Yukeaaa's Arxiv
Computer Vision and Pattern Recognition 127
☆ TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System
Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around $250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also open-sourced at https://twist-data.github.io .
comment: Website: https://yanjieze.com/TWIST2
☆ Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks
We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.
comment: Project page: https://diddone.github.io/densemarks/ .Video: https://youtu.be/o8DOOYFW0gI .21 pages, 13 figures, 2 tables
☆ PLUTO-4: Frontier Pathology Foundation Models
Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4's potential to transform real-world applications as a backbone for translational research and diagnostic use cases.
☆ AI-Generated Image Detection: An Empirical Study and Future Research Directions
The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.
☆ When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.
comment: 28 pages, 15 figures
☆ VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.
comment: Project page: https://csu-jpg.github.io/VCode Github: https://github.com/CSU-JPG/VCode
☆ PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE
comment: Project Page: https://antoniooroz.github.io/PercHead/ Video: https://www.youtube.com/watch?v=4hFybgTk4kE
☆ Dynamic Reflections: Probing Video Representations with Text Alignment
The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/
comment: 21 pages, 12 figures
☆ LLEXICORP: End-user Explainability of Convolutional Neural Networks
Convolutional neural networks (CNNs) underpin many modern computer vision systems. With applications ranging from common to critical areas, a need to explain and understand the model and its decisions (XAI) emerged. Prior works suggest that in the top layers of CNNs, the individual channels can be attributed to classifying human-understandable concepts. Concept relevance propagation (CRP) methods can backtrack predictions to these channels and find images that most activate these channels. However, current CRP workflows are largely manual: experts must inspect activation images to name the discovered concepts and must synthesize verbose explanations from relevance maps, limiting the accessibility of the explanations and their scalability. To address these issues, we introduce Large Language model EXplaIns COncept Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a multimodal large language model. Our approach automatically assigns descriptive names to concept prototypes and generates natural-language explanations that translate quantitative relevance distributions into intuitive narratives. To ensure faithfulness, we craft prompts that teach the language model the semantics of CRP through examples and enforce a separation between naming and explanation tasks. The resulting text can be tailored to different audiences, offering low-level technical descriptions for experts and high-level summaries for non-technical stakeholders. We qualitatively evaluate our method on various images from ImageNet on a VGG16 model. Our findings suggest that integrating concept-based attribution methods with large language models can significantly lower the barrier to interpreting deep neural networks, paving the way for more transparent AI systems.
☆ An unscented Kalman filter method for real time input-parameter-state estimation
The input-parameter-state estimation capabilities of a novel unscented Kalman filter is examined herein on both linear and nonlinear systems. The unknown input is estimated in two stages within each time step. Firstly, the predicted dynamic states and the system parameters provide an estimation of the input. Secondly, the corrected with measurements states and parameters provide a final estimation. Importantly, it is demonstrated using the perturbation analysis that, a system with at least a zero or a non-zero known input can potentially be uniquely identified. This output-only methodology allows for a better understanding of the system compared to classical output-only parameter identification strategies, given that all the dynamic states, the parameters, and the input are estimated jointly and in real-time.
comment: author-accepted manuscript (AAM) published in Mechanical Systems and Signal Processing
☆ VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models
Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.
comment: 41 pages, 26 figures
☆ Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification
Visible-infrared person re-identification (VI-ReID) technique could associate the pedestrian images across visible and infrared modalities in the practical scenarios of background illumination changes. However, a substantial gap inherently exists between these two modalities. Besides, existing methods primarily rely on intermediate representations to align cross-modal features of the same person. The intermediate feature representations are usually create by generating intermediate images (kind of data enhancement), or fusing intermediate features (more parameters, lack of interpretability), and they do not make good use of the intermediate features. Thus, we propose a novel VI-ReID framework via Modality-Transition Representation Learning (MTRL) with a middle generated image as a transmitter from visible to infrared modals, which are fully aligned with the original visible images and similar to the infrared modality. After that, using a modality-transition contrastive loss and a modality-query regularization loss for training, which could align the cross-modal features more effectively. Notably, our proposed framework does not need any additional parameters, which achieves the same inference speed to the backbone while improving its performance on VI-ReID task. Extensive experimental results illustrate that our model significantly and consistently outperforms existing SOTAs on three typical VI-ReID datasets.
☆ Differentiable Hierarchical Visual Tokenization NeurIPS 2025
Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
comment: NeurIPS 2025 Spotlight
☆ Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.
☆ Robust Face Liveness Detection for Biometric Authentication using Single Image
Biometric technologies are widely adopted in security, legal, and financial systems. Face recognition can authenticate a person based on the unique facial features such as shape and texture. However, recent works have demonstrated the vulnerability of Face Recognition Systems (FRS) towards presentation attacks. Using spoofing (aka.,presentation attacks), a malicious actor can get illegitimate access to secure systems. This paper proposes a novel light-weight CNN framework to identify print/display, video and wrap attacks. The proposed robust architecture provides seamless liveness detection ensuring faster biometric authentication (1-2 seconds on CPU). Further, this also presents a newly created 2D spoof attack dataset consisting of more than 500 videos collected from 60 subjects. To validate the effectiveness of this architecture, we provide a demonstration video depicting print/display, video and wrap attack detection approaches. The demo can be viewed in the following link: https://rak.box.com/s/m1uf31fn5amtjp4mkgf1huh4ykfeibaa
☆ UniChange: Unifying Change Detection with Multimodal Large Language Model
Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.
☆ Zero-Shot Multi-Animal Tracking in the Wild
Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.
☆ TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
comment: 13 pages, 8 figures, 3 tables. The first two authors contributed equally. Project Page: https://iyatomilab.github.io/TAUE
☆ Resource-efficient Automatic Refinement of Segmentations via Weak Supervision from Light Feedback
Delineating anatomical regions is a key task in medical image analysis. Manual segmentation achieves high accuracy but is labor-intensive and prone to variability, thus prompting the development of automated approaches. Recently, a breadth of foundation models has enabled automated segmentations across diverse anatomies and imaging modalities, but these may not always meet the clinical accuracy standards. While segmentation refinement strategies can improve performance, current methods depend on heavy user interactions or require fully supervised segmentations for training. Here, we present SCORE (Segmentation COrrection from Regional Evaluations), a weakly supervised framework that learns to refine mask predictions only using light feedback during training. Specifically, instead of relying on dense training image annotations, SCORE introduces a novel loss that leverages region-wise quality scores and over/under-segmentation error labels. We demonstrate SCORE on humerus CT scans, where it considerably improves initial predictions from TotalSegmentator, and achieves performance on par with existing refinement methods, while greatly reducing their supervision requirements and annotation time. Our code is available at: https://gitlab.inria.fr/adelangl/SCORE.
☆ A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.
comment: 9 pages main text with 6 figures (excluding references), supplementary material included
☆ Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification
Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID
☆ The Urban Vision Hackathon Dataset and Models: Towards Image Annotations and Accurate Vision Models for Indian Traffic
This report describes the UVH-26 dataset, the first public release by AIM@IISc of a large-scale dataset of annotated traffic-camera images from India. The dataset comprises 26,646 high-resolution (1080p) images sampled from 2800 Bengaluru's Safe-City CCTV cameras over a 4-week period, and subsequently annotated through a crowdsourced hackathon involving 565 college students from across India. In total, 1.8 million bounding boxes were labeled across 14 vehicle classes specific to India: Cycle, 2-Wheeler (Motorcycle), 3-Wheeler (Auto-rickshaw), LCV (Light Commercial Vehicles), Van, Tempo-traveller, Hatchback, Sedan, SUV, MUV, Mini-bus, Bus, Truck and Other. Of these, 283k-316k consensus ground truth bounding boxes and labels were derived for distinct objects in the 26k images using Majority Voting and STAPLE algorithms. Further, we train multiple contemporary detectors, including YOLO11-S/X, RT-DETR-S/X, and DAMO-YOLO-T/L using these datasets, and report accuracy based on mAP50, mAP75 and mAP50:95. Models trained on UVH-26 achieve 8.4-31.5% improvements in mAP50:95 over equivalent baseline models trained on COCO dataset, with RT-DETR-X showing the best performance at 0.67 (mAP50:95) as compared to 0.40 for COCO-trained weights for common classes (Car, Bus, and Truck). This demonstrates the benefits of domain-specific training data for Indian traffic scenarios. The release package provides the 26k images with consensus annotations based on Majority Voting (UVH-26-MV) and STAPLE (UVH-26-ST) and the 6 fine-tuned YOLO and DETR models on each of these datasets. By capturing the heterogeneity of Indian urban mobility directly from operational traffic-camera streams, UVH-26 addresses a critical gap in existing global benchmarks, and offers a foundation for advancing detection, classification, and deployment of intelligent transportation systems in emerging nations with complex traffic conditions.
☆ SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration
We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at https://github.com/microsoft/SigmaCollab.
☆ Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction
Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.
☆ Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data
Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.
comment: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025
☆ LiteVoxel: Low-memory Intelligent Thresholding for Efficient Voxel Rasterization
Sparse-voxel rasterization is a fast, differentiable alternative for optimization-based scene reconstruction, but it tends to underfit low-frequency content, depends on brittle pruning heuristics, and can overgrow in ways that inflate VRAM. We introduce LiteVoxel, a self-tuning training pipeline that makes SV rasterization both steadier and lighter. Our loss is made low-frequency aware via an inverse-Sobel reweighting with a mid-training gamma-ramp, shifting gradient budget to flat regions only after geometry stabilize. Adaptation replaces fixed thresholds with a depth-quantile pruning logic on maximum blending weight, stabilized by EMA-hysteresis guards and refines structure through ray-footprint-based, priority-driven subdivision under an explicit growth budget. Ablations and full-system results across Mip-NeRF 360 (6scenes) and Tanks & Temples (3scenes) datasets show mitigation of errors in low-frequency regions and boundary instability while keeping PSNR/SSIM, training time, and FPS comparable to a strong SVRaster pipeline. Crucially, LiteVoxel reduces peak VRAM by ~40%-60% and preserves low-frequency detail that prior setups miss, enabling more predictable, memory-efficient training without sacrificing perceptual quality.
☆ Keeping it Local, Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems
Recent advancements in Deep Learning enable hardware-based cognitive systems, that is, mechatronic systems in general and robotics in particular with integrated Artificial Intelligence, to interact with dynamic and unstructured environments. While the results are impressive, the application of such systems to critical tasks like autonomous driving as well as service and care robotics necessitate the evaluation of large amount of heterogeneous data. Automated report generation for Mobile Robotics can play a crucial role in facilitating the evaluation and acceptance of such systems in various domains. In this paper, we propose a pipeline for generating automated reports in natural language utilizing various multi-modal sensors that solely relies on local models capable of being deployed on edge computing devices, thus preserving the privacy of all actors involved and eliminating the need for external services. In particular, we evaluate our implementation on a diverse dataset spanning multiple domains including indoor, outdoor and urban environments, providing quantitative as well as qualitative evaluation results. Various generated example reports and other supplementary materials are available via a public repository.
comment: 6 pages, 4 figures, 1 table; accepted for MECATRONICS-REM 2025 International Conference, PARIS, FRANCE December 3-5 2025
☆ ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing
Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator's unique artistic expression in shot assembly.To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com
☆ Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes
The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.
☆ DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding NeurIPS 2025
Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890
comment: Advances in Neural Information Processing Systems 2025 (NeurIPS 2025), Poster, https://neurips.cc/virtual/2025/loc/san-diego/poster/121400
☆ Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization
With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: https://github.com/liutao23/ODGNNLoc.git.
comment: 20 pages, Submitted to IEEE TIM
☆ OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.
☆ MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer ICIP2024
Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person's action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately $4.4$ points on the F-measure.
comment: Selected as Best Industry Paper Award at ICIP2024
☆ HAGI++: Head-Assisted Gaze Imputation and Generation
Mobile eye tracking plays a vital role in capturing human visual attention across both real-world and extended reality (XR) environments, making it an essential tool for applications ranging from behavioural research to human-computer interaction. However, missing values due to blinks, pupil detection errors, or illumination changes pose significant challenges for further gaze data analysis. To address this challenge, we introduce HAGI++ - a multi-modal diffusion-based approach for gaze data imputation that, for the first time, uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements. HAGI++ employs a transformer-based diffusion model to learn cross-modal dependencies between eye and head representations and can be readily extended to incorporate additional body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D, and HOT3D datasets demonstrate that HAGI++ consistently outperforms conventional interpolation methods and deep learning-based time-series imputation baselines in gaze imputation. Furthermore, statistical analyses confirm that HAGI++ produces gaze velocity distributions that closely match actual human gaze behaviour, ensuring more realistic gaze imputations. Moreover, by incorporating wrist motion captured from commercial wearable devices, HAGI++ surpasses prior methods that rely on full-body motion capture in the extreme case of 100% missing gaze data (pure gaze generation). Our method paves the way for more complete and accurate eye gaze recordings in real-world settings and has significant potential for enhancing gaze-based analysis and interaction across various application domains.
comment: Extended version of our UIST'25 paper "HAGI: Head-Assisted Gaze Imputation for Mobile Eye Trackers"
☆ KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image
Satellite image inpainting is a crucial task in remote sensing, where accurately restoring missing or occluded regions is essential for robust image analysis. In this paper, we propose KAO, a novel framework that utilizes Kernel-Adaptive Optimization within diffusion models for satellite image inpainting. KAO is specifically designed to address the challenges posed by very high-resolution (VHR) satellite datasets, such as DeepGlobe and the Massachusetts Roads Dataset. Unlike existing methods that rely on preconditioned models requiring extensive retraining or postconditioned models with significant computational overhead, KAO introduces a Latent Space Conditioning approach, optimizing a compact latent space to achieve efficient and accurate inpainting. Furthermore, we incorporate Explicit Propagation into the diffusion process, facilitating forward-backward fusion, which improves the stability and precision of the method. Experimental results demonstrate that KAO sets a new benchmark for VHR satellite image restoration, providing a scalable, high-performance solution that balances the efficiency of preconditioned models with the flexibility of postconditioned models.
comment: 18 pages
☆ From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
comment: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025
☆ A Kullback-Leibler divergence method for input-system-state identification
The capability of a novel Kullback-Leibler divergence method is examined herein within the Kalman filter framework to select the input-parameter-state estimation execution with the most plausible results. This identification suffers from the uncertainty related to obtaining different results from different initial parameter set guesses, and the examined approach uses the information gained from the data in going from the prior to the posterior distribution to address the issue. Firstly, the Kalman filter is performed for a number of different initial parameter sets providing the system input-parameter-state estimation. Secondly, the resulting posterior distributions are compared simultaneously to the initial prior distributions using the Kullback-Leibler divergence. Finally, the identification with the least Kullback-Leibler divergence is selected as the one with the most plausible results. Importantly, the method is shown to select the better performed identification in linear, nonlinear, and limited information applications, providing a powerful tool for system monitoring.
comment: 32 pages, 17 figures, published in Journal of Sound and Vibration
☆ Synthetic Crop-Weed Image Generation and its Impact on Model Generalization
Precise semantic segmentation of crops and weeds is necessary for agricultural weeding robots. However, training deep learning models requires large annotated datasets, which are costly to obtain in real fields. Synthetic data can reduce this burden, but the gap between simulated and real images remains a challenge. In this paper, we present a pipeline for procedural generation of synthetic crop-weed images using Blender, producing annotated datasets under diverse conditions of plant growth, weed density, lighting, and camera angle. We benchmark several state-of-the-art segmentation models on synthetic and real datasets and analyze their cross-domain generalization. Our results show that training on synthetic images leads to a sim-to-real gap of 10%, surpassing previous state-of-the-art methods. Moreover, synthetic data demonstrates good generalization properties, outperforming real datasets in cross-domain scenarios. These findings highlight the potential of synthetic agricultural datasets and support hybrid strategies for more efficient model training.
☆ ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension EMNLP25
Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.
comment: 23 pages, EMNLP25 Accepted
☆ IllumFlow: Illumination-Adaptive Low-Light Enhancement via Conditional Rectified Flow and Retinex Decomposition
We present IllumFlow, a novel framework that synergizes conditional Rectified Flow (CRF) with Retinex theory for low-light image enhancement (LLIE). Our model addresses low-light enhancement through separate optimization of illumination and reflectance components, effectively handling both lighting variations and noise. Specifically, we first decompose an input image into reflectance and illumination components following Retinex theory. To model the wide dynamic range of illumination variations in low-light images, we propose a conditional rectified flow framework that represents illumination changes as a continuous flow field. While complex noise primarily resides in the reflectance component, we introduce a denoising network, enhanced by flow-derived data augmentation, to remove reflectance noise and chromatic aberration while preserving color fidelity. IllumFlow enables precise illumination adaptation across lighting conditions while naturally supporting customizable brightness enhancement. Extensive experiments on low-light enhancement and exposure correction demonstrate superior quantitative and qualitative performance over existing methods.
☆ Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs
Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.
☆ MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.
☆ A Novel Grouping-Based Hybrid Color Correction Algorithm for Color Point Clouds
Color consistency correction for color point clouds is a fundamental yet important task in 3D rendering and compression applications. In the past, most previous color correction methods aimed at correcting color for color images. The purpose of this paper is to propose a grouping-based hybrid color correction algorithm for color point clouds. Our algorithm begins by estimating the overlapping rate between the aligned source and target point clouds, and then adaptively partitions the target points into two groups, namely the close proximity group Gcl and the moderate proximity group Gmod, or three groups, namely Gcl, Gmod, and the distant proximity group Gdist, when the estimated overlapping rate is low or high, respectively. To correct color for target points in Gcl, a K-nearest neighbors based bilateral interpolation (KBI) method is proposed. To correct color for target points in Gmod, a joint KBI and the histogram equalization (JKHE) method is proposed. For target points in Gdist, a histogram equalization (HE) method is proposed for color correction. Finally, we discuss the grouping-effect free property and the ablation study in our algorithm. The desired color consistency correction benefit of our algorithm has been justified through 1086 testing color point cloud pairs against the state-of-the-art methods. The C++ source code of our algorithm can be accessed from the website: https://github.com/ivpml84079/Point-cloud-color-correction.
Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds SC 2025
Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point's Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.
comment: Accepted for publication at IEEE International Conference on Intelligent Transportation Systems (ITSC 2025), 8 pages, 3 figures
☆ RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.
☆ CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.
☆ M3PD Dataset: Dual-view Photoplethysmography (PPG) Using Front-and-rear Cameras of Smartphones in Lab and Clinical Settings
Portable physiological monitoring is essential for early detection and management of cardiovascular disease, but current methods often require specialized equipment that limits accessibility or impose impractical postures that patients cannot maintain. Video-based photoplethysmography on smartphones offers a convenient noninvasive alternative, yet it still faces reliability challenges caused by motion artifacts, lighting variations, and single-view constraints. Few studies have demonstrated reliable application to cardiovascular patients, and no widely used open datasets exist for cross-device accuracy. To address these limitations, we introduce the M3PD dataset, the first publicly available dual-view mobile photoplethysmography dataset, comprising synchronized facial and fingertip videos captured simultaneously via front and rear smartphone cameras from 60 participants (including 47 cardiovascular patients). Building on this dual-view setting, we further propose F3Mamba, which fuses the facial and fingertip views through Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to 30.2 percent over existing single-view baselines while improving robustness in challenging real-world scenarios. Data and code: https://github.com/Health-HCI-Group/F3Mamba.
☆ GAFD-CC: Global-Aware Feature Decoupling with Confidence Calibration for OOD Detection
Out-of-distribution (OOD) detection is paramount to ensuring the reliability and robustness of learning models in real-world applications. Existing post-hoc OOD detection methods detect OOD samples by leveraging their features and logits information without retraining. However, they often overlook the inherent correlation between features and logits, which is crucial for effective OOD detection. To address this limitation, we propose Global-Aware Feature Decoupling with Confidence Calibration (GAFD-CC). GAFD-CC aims to refine decision boundaries and increase discriminative performance. Firstly, it performs global-aware feature decoupling guided by classification weights. This involves aligning features with the direction of global classification weights to decouple them. From this, GAFD-CC extracts two types of critical information: positively correlated features that promote in-distribution (ID)/OOD boundary refinement and negatively correlated features that suppress false positives and tighten these boundaries. Secondly, it adaptively fuses these decoupled features with multi-scale logit-based confidence for comprehensive and robust OOD detection. Extensive experiments on large-scale benchmarks demonstrate GAFD-CC's competitive performance and strong generalization ability compared to those of state-of-the-art methods.
☆ Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization NeurIPS 2025
We introduce Cycle-Sync, a robust and global framework for estimating camera poses (both rotations and locations). Our core innovation is a location solver that adapts message-passing least squares (MPLS) -- originally developed for group synchronization -- to camera location estimation. We modify MPLS to emphasize cycle-consistent information, redefine cycle consistencies using estimated distances from previous iterations, and incorporate a Welsch-type robust loss. We establish the strongest known deterministic exact-recovery guarantee for camera location estimation, showing that cycle consistency alone -- without access to inter-camera distances -- suffices to achieve the lowest sample complexity currently known. To further enhance robustness, we introduce a plug-and-play outlier rejection module inspired by robust subspace recovery, and we fully integrate cycle consistency into MPLS for rotation synchronization. Our global approach avoids the need for bundle adjustment. Experiments on synthetic and real datasets show that Cycle-Sync consistently outperforms leading pose estimators, including full structure-from-motion pipelines with bundle adjustment.
comment: NeurIPS 2025 spotlight paper
☆ 3D Point Cloud Object Detection on Edge Devices for Split Computing
The field of autonomous driving technology is rapidly advancing, with deep learning being a key component. Particularly in the field of sensing, 3D point cloud data collected by LiDAR is utilized to run deep neural network models for 3D object detection. However, these state-of-the-art models are complex, leading to longer processing times and increased power consumption on edge devices. The objective of this study is to address these issues by leveraging Split Computing, a distributed machine learning inference method. Split Computing aims to lessen the computational burden on edge devices, thereby reducing processing time and power consumption. Furthermore, it minimizes the risk of data breaches by only transmitting intermediate data from the deep neural network model. Experimental results show that splitting after voxelization reduces the inference time by 70.8% and the edge device execution time by 90.0%. When splitting within the network, the inference time is reduced by up to 57.1%, and the edge device execution time is reduced by up to 69.5%.
comment: 6 pages. This version includes minor lstlisting configuration adjustments for successful compilation. No changes to content or layout. Originally published at ACM/IEEE RAGE 2024
☆ Link prediction Graph Neural Networks for structure recognition of Handwritten Mathematical Expressions ICDAR2025
We propose a Graph Neural Network (GNN)-based approach for Handwritten Mathematical Expression (HME) recognition by modeling HMEs as graphs, where nodes represent symbols and edges capture spatial dependencies. A deep BLSTM network is used for symbol segmentation, recognition, and spatial relation classification, forming an initial primitive graph. A 2D-CFG parser then generates all possible spatial relations, while the GNN-based link prediction model refines the structure by removing unnecessary connections, ultimately forming the Symbol Label Graph. Experimental results demonstrate the effectiveness of our approach, showing promising performance in HME structure recognition.
comment: accepted for ICDAR2025-WML
☆ SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.
☆ Are Euler angles a useful rotation parameterisation for pose estimation with Normalizing Flows? BMVC 2025
Object pose estimation is a task that is of central importance in 3D Computer Vision. Given a target image and a canonical pose, a single point estimate may very often be sufficient; however, a probabilistic pose output is related to a number of benefits when pose is not unambiguous due to sensor and projection constraints or inherent object symmetries. With this paper, we explore the usefulness of using the well-known Euler angles parameterisation as a basis for a Normalizing Flows model for pose estimation. Isomorphic to spatial rotation, 3D pose has been parameterized in a number of ways, either in or out of the context of parameter estimation. We explore the idea that Euler angles, despite their shortcomings, may lead to useful models in a number of aspects, compared to a model built on a more complex parameterisation.
comment: BMVC 2025 workshop proceedings (Smart Cameras for Smarter Autonomous Vehicles & Robots)
☆ Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework
Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.
☆ Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency
Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at https://github.com/MedICL-VU/MDE.
☆ Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer's Disease Diagnosis
Alzheimer's disease (AD) is the most prevalent form of dementia, and its early diagnosis is essential for slowing disease progression. Recent studies on multimodal neuroimaging fusion using MRI and PET have achieved promising results by integrating multi-scale complementary features. However, most existing approaches primarily emphasize cross-modal complementarity while overlooking the diagnostic importance of modality-specific features. In addition, the inherent distributional differences between modalities often lead to biased and noisy representations, degrading classification performance. To address these challenges, we propose a Collaborative Attention and Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The proposed model introduces a learnable parameter representation (LPR) block to compensate for missing modality information, followed by a shared encoder and modality-independent encoders to preserve both shared and specific representations. Furthermore, a consistency-guided mechanism is employed to explicitly align the latent distributions across modalities. Experimental results on the ADNI dataset demonstrate that our method achieves superior diagnostic performance compared with existing fusion strategies.
☆ Can Foundation Models Revolutionize Mobile AR Sparse Sensing?
Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.
☆ High-Resolution Magnetic Particle Imaging System Matrix Recovery Using a Vision Transformer with Residual Feature Network
This study presents a hybrid deep learning framework, the Vision Transformer with Residual Feature Network (VRF-Net), for recovering high-resolution system matrices in Magnetic Particle Imaging (MPI). MPI resolution often suffers from downsampling and coil sensitivity variations. VRF-Net addresses these challenges by combining transformer-based global attention with residual convolutional refinement, enabling recovery of both large-scale structures and fine details. To reflect realistic MPI conditions, the system matrix is degraded using a dual-stage downsampling strategy. Training employed paired-image super-resolution on the public Open MPI dataset and a simulated dataset incorporating variable coil sensitivity profiles. For system matrix recovery on the Open MPI dataset, VRF-Net achieved nRMSE = 0.403, pSNR = 39.08 dB, and SSIM = 0.835 at 2x scaling, and maintained strong performance even at challenging scale 8x (pSNR = 31.06 dB, SSIM = 0.717). For the simulated dataset, VRF-Net achieved nRMSE = 4.44, pSNR = 28.52 dB, and SSIM = 0.771 at 2x scaling, with stable performance at higher scales. On average, it reduced nRMSE by 88.2%, increased pSNR by 44.7%, and improved SSIM by 34.3% over interpolation and CNN-based methods. In image reconstruction of Open MPI phantoms, VRF-Net further reduced reconstruction error to nRMSE = 1.79 at 2x scaling, while preserving structural fidelity (pSNR = 41.58 dB, SSIM = 0.960), outperforming existing methods. These findings demonstrate that VRF-Net enables sharper, artifact-free system matrix recovery and robust image reconstruction across multiple scales, offering a promising direction for future in vivo applications.
☆ Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning
Segmental longitudinal strain (SLS) of the left ventricle (LV) is an important prognostic indicator for evaluating regional LV dysfunction, in particular for diagnosing and managing myocardial ischemia. Current techniques for strain estimation require significant manual intervention and expertise, limiting their efficiency and making them too resource-intensive for monitoring purposes. This study introduces the first automated pipeline, autoStrain, for SLS estimation in transesophageal echocardiography (TEE) using deep learning (DL) methods for motion estimation. We present a comparative analysis of two DL approaches: TeeFlow, based on the RAFT optical flow model for dense frame-to-frame predictions, and TeeTracker, based on the CoTracker point trajectory model for sparse long-sequence predictions. As ground truth motion data from real echocardiographic sequences are hardly accessible, we took advantage of a unique simulation pipeline (SIMUS) to generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with ground truth myocardial motion to train and evaluate both models. Our evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a mean distance error in motion estimation of 0.65 mm on a synTEE test dataset. Clinical validation on 16 patients further demonstrated that SLS estimation with our autoStrain pipeline aligned with clinical references, achieving a mean difference (95\% limits of agreement) of 1.09% (-8.90% to 11.09%). Incorporation of simulated ischemia in the synTEE data improved the accuracy of the models in quantifying abnormal deformation. Our findings indicate that integrating AI-driven motion estimation with TEE can significantly enhance the precision and efficiency of cardiac function assessment in clinical settings.
comment: 13 pages, IEEE Journal of Biomedical and Health Informatics
☆ Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping
Strawberries are among the most economically significant fruits in the United States, generating over $2 billion in annual farm-gate sales and accounting for approximately 13% of the total fruit production value. Plant phenotyping plays a vital role in selecting superior cultivars by characterizing plant traits such as morphology, canopy structure, and growth dynamics. However, traditional plant phenotyping methods are time-consuming, labor-intensive, and often destructive. Recently, neural rendering techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful frameworks for high-fidelity 3D reconstruction. By capturing a sequence of multi-view images or videos around a target plant, these methods enable non-destructive reconstruction of complex plant architectures. Despite their promise, most current applications of 3DGS in agricultural domains reconstruct the entire scene, including background elements, which introduces noise, increases computational costs, and complicates downstream trait analysis. To address this limitation, we propose a novel object-centric 3D reconstruction framework incorporating a preprocessing pipeline that leverages the Segment Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean strawberry plant reconstructions. This approach produces more accurate geometric representations while substantially reducing computational time. With a background-free reconstruction, our algorithm can automatically estimate important plant traits, such as plant height and canopy width, using DBSCAN clustering and Principal Component Analysis (PCA). Experimental results show that our method outperforms conventional pipelines in both accuracy and efficiency, offering a scalable and non-destructive solution for strawberry plant phenotyping.
comment: 11 pages, 4 figures, 3 tables
☆ Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers
Background: Alzheimer's disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson's r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer's disease.
comment: 31 pages, 8 figures
☆ OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning
Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.
comment: 25 pages, 12 figures, 8 tables
☆ MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation
Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 $\%$ on DRIVE and 1.25 $\%$ on STARE, demonstrating its effectiveness and advancement. The project code is public via https://github.com/liujiawen-jpg/MM-UNet.
comment: This paper was accepted by IEEE BIBM 2025 conference
☆ Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models ICCV2025
In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.
comment: 1st place winner of Grounded Videoqa track at the ICCV2025 Perception Test
☆ Autobiasing Event Cameras for Flickering Mitigation
Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.
☆ Fast Measuring Pavement Crack Width by Cascading Principal Component Analysis
Accurate quantification of pavement crack width plays a pivotal role in assessing structural integrity and guiding maintenance interventions. However, achieving precise crack width measurements presents significant challenges due to: (1) the complex, non-uniform morphology of crack boundaries, which limits the efficacy of conventional approaches, and (2) the demand for rapid measurement capabilities from arbitrary pixel locations to facilitate comprehensive pavement condition evaluation. To overcome these limitations, this study introduces a cascaded framework integrating Principal Component Analysis (PCA) and Robust PCA (RPCA) for efficient crack width extraction from digital images. The proposed methodology comprises three sequential stages: (1) initial crack segmentation using established detection algorithms to generate a binary representation, (2) determination of the primary orientation axis for quasi-parallel cracks through PCA, and (3) extraction of the Main Propagation Axis (MPA) for irregular crack geometries using RPCA. Comprehensive evaluations were conducted across three publicly available datasets, demonstrating that the proposed approach achieves superior performance in both computational efficiency and measurement accuracy compared to existing state-of-the-art techniques.
☆ From Instance Segmentation to 3D Growth Trajectory Reconstruction in Planktonic Foraminifera
Planktonic foraminifera, marine protists characterized by their intricate chambered shells, serve as valuable indicators of past and present environmental conditions. Understanding their chamber growth trajectory provides crucial insights into organismal development and ecological adaptation under changing environments. However, automated tracing of chamber growth from imaging data remains largely unexplored, with existing approaches relying heavily on manual segmentation of each chamber, which is time-consuming and subjective. In this study, we propose an end-to-end pipeline that integrates instance segmentation, a computer vision technique not extensively explored in foraminifera, with a dedicated chamber ordering algorithm to automatically reconstruct three-dimensional growth trajectories from high-resolution computed tomography scans. We quantitatively and qualitatively evaluate multiple instance segmentation methods, each optimized for distinct spatial features of the chambers, and examine their downstream influence on growth-order reconstruction accuracy. Experimental results on expert-annotated datasets demonstrate that the proposed pipeline substantially reduces manual effort while maintaining biologically meaningful accuracy. Although segmentation models exhibit under-segmentation in smaller chambers due to reduced voxel fidelity and subtle inter-chamber connectivity, the chamber-ordering algorithm remains robust, achieving consistent reconstruction of developmental trajectories even under partial segmentation. This work provides the first fully automated and reproducible pipeline for digital foraminiferal growth analysis, establishing a foundation for large-scale, data-driven ecological studies.
♻ ☆ GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality
As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce GS-Verse (Gaussian Splatting for Virtual Environment Rendering and Scene Editing), a novel method designed to overcome these challenges by directly integrating an object's mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, GS-Verse facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.
♻ ☆ DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging NeurIPS 2025
Safe deployment of machine learning (ML) models in safety-critical domains such as medical imaging requires detecting inputs with characteristics not seen during training, known as out-of-distribution (OOD) detection, to prevent unreliable predictions. Effective OOD detection after deployment could benefit from access to the training data, enabling direct comparison between test samples and the training data distribution to identify differences. State-of-the-art OOD detection methods, however, either discard the training data after deployment or assume that test samples and training data are centrally stored together, an assumption that rarely holds in real-world settings. This is because shipping the training data with the deployed model is usually impossible due to the size of training databases, as well as proprietary or privacy constraints. We introduce the Isolation Network, an OOD detection framework that quantifies the difficulty of separating a target test sample from the training data by solving a binary classification task. We then propose Decentralized Isolation Networks (DIsoN), which enables the comparison of training and test data when data-sharing is impossible, by exchanging only model parameters between the remote computational nodes of training and deployment. We further extend DIsoN with class-conditioning, comparing a target sample solely with training data of its predicted class. We evaluate DIsoN on four medical imaging datasets (dermatology, chest X-ray, breast ultrasound, histopathology) across 12 OOD detection tasks. DIsoN performs favorably against existing methods while respecting data-privacy. This decentralized OOD detection framework opens the way for a new type of service that ML developers could provide along with their models: providing remote, secure utilization of their training data for OOD detection services. Code: https://github.com/FelixWag/DIsoN
comment: Accepted at NeurIPS 2025
♻ ☆ A Practical Investigation of Spatially-Controlled Image Generation with Transformers
Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate "forgetting" and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency.
comment: TMLR https://openreview.net/forum?id=loT6xhgLYK
♻ ☆ The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs
Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.
♻ ☆ Image Super-Resolution with Guarantees via Conformalized Generative Models NeurIPS 2025
The increasing use of generative ML foundation models for image restoration tasks such as super-resolution calls for robust and interpretable uncertainty quantification methods. We address this need by presenting a novel approach based on conformal prediction techniques to create a 'confidence mask' capable of reliably and intuitively communicating where the generated image can be trusted. Our method is adaptable to any black-box generative model, including those locked behind an opaque API, requires only easily attainable data for calibration, and is highly customizable via the choice of a local image similarity metric. We prove strong theoretical guarantees for our method that span fidelity error control (according to our local image similarity metric), reconstruction quality, and robustness in the face of data leakage. Finally, we empirically evaluate these results and establish our method's solid performance.
comment: To appear at NeurIPS 2025. 17 pages, 7 figures
♻ ☆ Positive Semi-definite Latent Factor Grouping-Boosted Cluster-reasoning Instance Disentangled Learning for WSI Representation
Multiple instance learning (MIL) has been widely used for representing whole-slide pathology images. However, spatial, semantic, and decision entanglements among instances limit its representation and interpretability. To address these challenges, we propose a latent factor grouping-boosted cluster-reasoning instance disentangled learning framework for whole-slide image (WSI) interpretable representation in three phases. First, we introduce a novel positive semi-definite latent factor grouping that maps instances into a latent subspace, effectively mitigating spatial entanglement in MIL. To alleviate semantic entanglement, we employs instance probability counterfactual inference and optimization via cluster-reasoning instance disentangling. Finally, we employ a generalized linear weighted decision via instance effect re-weighting to address decision entanglement. Extensive experiments on multicentre datasets demonstrate that our model outperforms all state-of-the-art models. Moreover, it attains pathologist-aligned interpretability through disentangled representations and a transparent decision-making process.
comment: Our code is available at https://github.com/Prince-Lee-PathAI/PG-CIDL
♻ ☆ Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey
3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.
comment: A project page associated with this survey is available at https://fnzhan.com/projects/Feed-Forward-3D
♻ ☆ Prompt to Restore, Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal
Universal adverse weather removal (UAWR) seeks to address various weather degradations within a unified framework. Recent methods are inspired by prompt learning using pre-trained vision-language models (e.g., CLIP), leveraging degradation-aware prompts to facilitate weather-free image restoration, yielding significant improvements. In this work, we propose CyclicPrompt, an innovative cyclic prompt approach designed to enhance the effectiveness, adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key components: 1) a composite context prompt that integrates weather-related information and context-aware representations into the network to guide restoration. This prompt differs from previous methods by marrying learnable input-conditional vectors with weather-specific knowledge, thereby improving adaptability across various degradations. 2) The erase-and-paste mechanism, after the initial guided restoration, substitutes weather-specific knowledge with constrained restoration priors, inducing high-quality weather-free concepts into the composite prompt to further fine-tune the restoration process. Therefore, we can form a cyclic "Prompt-Restore-Prompt" pipeline that adeptly harnesses weather-specific knowledge, textual contexts, and reliable textures. Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt. The code is available at: https://github.com/RongxinL/CyclicPrompt.
♻ ☆ Mobile Robotic Multi-View Photometric Stereo SP
Multi-View Photometric Stereo (MVPS) is a popular method for fine-detailed 3D acquisition of an object from images. Despite its outstanding results on diverse material objects, a typical MVPS experimental setup requires a well-calibrated light source and a monocular camera installed on an immovable base. This restricts the use of MVPS on a movable platform, limiting us from taking MVPS benefits in 3D acquisition for mobile robotics applications. To this end, we introduce a new mobile robotic system for MVPS. While the proposed system brings advantages, it introduces additional algorithmic challenges. Addressing them, in this paper, we further propose an incremental approach for mobile robotic MVPS. Our approach leverages a supervised learning setup to predict per-view surface normal, object depth, and per-pixel uncertainty in model-predicted results. A refined depth map per view is obtained by solving an MVPS-driven optimization problem proposed in this paper. Later, we fuse the refined depth map while tracking the camera pose w.r.t the reference frame to recover globally consistent object 3D geometry. Experimental results show the advantages of our robotic system and algorithm, featuring the local high-frequency surface detail recovery with globally consistent object shape. Our work is beyond any MVPS system yet presented, providing encouraging results on objects with unknown reflectance properties using fewer frames without a tiring calibration and installation process, enabling computationally efficient robotic automation approach to photogrammetry. The proposed approach is nearly 100 times computationally faster than the state-of-the-art MVPS methods such as [1, 2] while maintaining the similar results when tested on subjects taken from the benchmark DiLiGenT MV dataset [3].
comment: Acknowledgment Added. Published at International Society Journal of Photogrammetry and Remote Sensing (ISPRS). 32 pages, 14 Figures, 5 Tables
♻ ☆ GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.
comment: NeurlPS 2025 Spotlight
♻ ☆ Label tree semantic losses for rich multi-class medical image segmentation
Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the labels space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical image segmentation tasks, namely head MRI for whole brain parcellation (WBP) with full supervision and neurosurgical hyperspectral imaging (HSI) for scene understanding with sparse annotations. Results demonstrate that our proposed method reaches state-of-the-art performance in both cases.
♻ ☆ Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Recently, augmenting vision-language-action models (VLAs) with world-models has shown promise in robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, we propose training techniques such as independent noise perturbations for each modality and a decoupled flow matching loss, which enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Furthermore, based on the decoupled training framework, we introduce a sampling method where we sample action and vision tokens asynchronously at different rates, which shows improvement through inference-time scaling. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods, with our inference-time scaling approach providing an additional 2-5% gain on success rate. On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%, confirming its effectiveness beyond simulation. Lastly, we demonstrate the effectiveness of DUST in large-scale pretraining with action-free videos from BridgeV2, where DUST leads to significant gain when transferred to the RoboCasa benchmark.
comment: 20 pages, 10 figures
♻ ☆ Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions ICCV2025
As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.
comment: ICCV2025 Workshop
♻ ☆ Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment SP
In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.
comment: ICSPS 2025
♻ ☆ Robust Identity Perceptual Watermark Against Deepfake Face Swapping
Notwithstanding offering convenience and entertainment to society, Deepfake face swapping has caused critical privacy issues with the rapid development of deep generative models. Due to imperceptible artifacts in high-quality synthetic images, passive detection models against face swapping in recent years usually suffer performance damping regarding the generalizability issue in cross-domain scenarios. Therefore, several studies have been attempted to proactively protect the original images against malicious manipulations by inserting invisible signals in advance. However, existing proactive defense approaches demonstrate unsatisfactory results with respect to visual quality, detection accuracy, and source tracing ability. In this study, to fulfill the research gap, we propose a robust identity perceptual watermarking framework that concurrently performs detection and source tracing against Deepfake face swapping proactively. We innovatively assign identity semantics regarding the image contents to the watermarks and devise an unpredictable and nonreversible chaotic encryption system to ensure watermark confidentiality. The watermarks are robustly encoded and recovered by jointly training an encoder-decoder framework along with adversarial image manipulations. For a suspect image, falsification is accomplished by justifying the consistency between the content-matched identity perceptual watermark and the recovered robust watermark, without requiring the ground-truth. Moreover, source tracing can be accomplished based on the identity semantics that the recovered watermark carries. Extensive experiments demonstrate state-of-the-art detection and source tracing performance against Deepfake face swapping with promising watermark robustness for both cross-dataset and cross-manipulation settings.
comment: In peer review
♻ ☆ RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing NeurIPS 2025
Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.
comment: NeurIPS 2025
♻ ☆ ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary rather than isomorphic modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7 percent over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
comment: project page: https://thinkmorph.github.io/
♻ ☆ Training Convolutional Neural Networks with the Forward-Forward algorithm SC
Recent successes in image analysis with deep neural networks are achieved almost exclusively with Convolutional Neural Networks (CNNs), typically trained using the backpropagation (BP) algorithm. In a 2022 preprint, Geoffrey Hinton proposed the Forward-Forward (FF) algorithm as a biologically inspired alternative, where positive and negative examples are jointly presented to the network and training is guided by a locally defined goodness function. Here, we extend the FF paradigm to CNNs. We introduce two spatially extended labeling strategies, based on Fourier patterns and morphological transformations, that enable convolutional layers to access label information across all spatial positions. On CIFAR10, we show that deeper FF-trained CNNs can be optimized successfully and that morphology-based labels prevent shortcut solutions on dataset with more complex and fine features. On CIFAR100, carefully designed label sets scale effectively to 100 classes. Class Activation Maps reveal that FF-trained CNNs learn meaningful and complementary features across layers. Together, these results demonstrate that FF training is feasible beyond fully connected networks, provide new insights into its learning dynamics and stability, and highlight its potential for neuromorphic computing and biologically inspired learning.
comment: PEER-REVIEWED VERSION PUBLISHED ON "SCIENTIFIC REPORTS" (2025) DOI: 10.1038/s41598-025-26235-2
♻ ☆ Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. \texttt{UniWorld-V2}, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available to support further research.
♻ ☆ OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility}. Therefore, we introduce \textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).
♻ ☆ Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
comment: https://genie-envisioner.github.io/
♻ ☆ A Quantitative Evaluation Framework for Explainable AI in Semantic Segmentation
Ensuring transparency and trust in artificial intelligence (AI) models is essential as they are increasingly deployed in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge; however, the rigorous evaluation of XAI methods remains vital for balancing the trade-offs between model complexity, predictive performance, and interpretability. While substantial progress has been made in evaluating XAI for classification tasks, strategies tailored to semantic segmentation remain limited. Moreover, objectively assessing XAI approaches is difficult, since qualitative visual explanations provide only preliminary insights. Such qualitative methods are inherently subjective and cannot ensure the accuracy or stability of explanations. To address these limitations, this work introduces a comprehensive quantitative evaluation framework for assessing XAI in semantic segmentation, accounting for both spatial and contextual task complexities. The framework systematically integrates pixel-level evaluation strategies with carefully designed metrics to yield fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings advance the development of transparent, trustworthy, and accountable semantic segmentation models.
♻ ☆ Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection
Monocular 3D lane detection aims to estimate the 3D position of lanes from frontal-view (FV) images. However, existing methods are fundamentally constrained by the inherent ambiguity of single-frame input, which leads to inaccurate geometric predictions and poor lane integrity, especially for distant lanes.To overcome this, we propose to unlock the rich information embedded in the temporal evolution of the scene as the vehicle moves. Our proposed Geometry-aware Temporal Aggregation Network (GTA-Net) systematically leverages the temporal information from complementary perspectives.First, Temporal Geometry Enhancement Module (TGEM) learns geometric consistency across consecutive frames, effectively recovering depth information from motion to build a reliable 3D scene representation.Second, to enhance lane integrity, Temporal Instance-aware Query Generation (TIQG) module aggregates instance cues from past and present frames. Crucially, for lanes that are ambiguous in the current view, TIQG innovatively synthesizes a pseudo future perspective to generate queries that reveal lanes which would otherwise be missed.The experiments demonstrate that GTA-Net achieves new SoTA results, significantly outperforming existing monocular 3D lane detection solutions.
♻ ☆ Cross-modal Diffusion Modelling for Super-resolved Spatial Transcriptomics
The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.
♻ ☆ Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation MICCAI2024
In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size during model training, resulting in an improved class balance for smaller patch sizes and accelerated convergence of the training process. We evaluate our curriculum approach in two settings: a resource-efficient mode and a performance mode, both regarding Dice score performance and computational costs across 15 diverse and popular 3D medical image segmentation tasks. The resource-efficient mode matches the Dice score performance of the conventional constant patch size sampling baseline with a notable reduction in training time to only 44%. The performance mode improves upon constant patch size segmentation results, achieving a statistically significant relative mean performance gain of 1.28% in Dice Score. Remarkably, across all 15 tasks, our proposed performance mode manages to surpass the constant patch size baseline in Dice Score performance, while simultaneously reducing training time to only 89%. The benefits are particularly pronounced for highly imbalanced tasks such as lesion segmentation tasks. Rigorous experiments demonstrate that our performance mode not only improves mean segmentation performance but also reduces performance variance, yielding more trustworthy model comparison. Furthermore, our findings reveal that the proposed curriculum sampling is not tied to a specific architecture but represents a broadly applicable strategy that consistently boosts performance across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In summary, we show that this simple yet elegant transformation on input data substantially improves both Dice Score performance and training runtime, while being compatible across diverse segmentation backbones.
comment: Journal Extension of "Progressive Growing of Patch Size: Resource-Efficient Curriculum Learning for Dense Prediction Tasks" (MICCAI2024) submitted to MedIA
♻ ☆ CGF-DETR: Cross-Gated Fusion DETR for Enhanced Pneumonia Detection in Chest X-rays
Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate and efficient automated detection systems. While recent transformer-based detectors like RT-DETR have shown promise in object detection tasks, their application to medical imaging, particularly pneumonia detection in chest X-rays, remains underexplored. This paper presents CGF-DETR, an enhanced real-time detection transformer specifically designed for pneumonia detection. We introduce XFABlock in the backbone to improve multi-scale feature extraction through convolutional attention mechanisms integrated with CSP architecture. To achieve efficient feature aggregation, we propose SPGA module that replaces standard multi-head attention with dynamic gating mechanisms and single-head self-attention. Additionally, GCFC3 is designed for the neck to enhance feature representation through multi-path convolution fusion while maintaining real-time performance via structural re-parameterization. Extensive experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while maintaining comparable inference speed at 48.1 FPS. Our ablation studies confirm that each proposed module contributes meaningfully to the overall performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]
♻ ☆ Improving Generalization in MRI-Based Deep Learning Models for Total Knee Replacement Prediction
Knee osteoarthritis (KOA) is a common joint disease that causes pain and mobility issues. While MRI-based deep learning models have demonstrated superior performance in predicting total knee replacement (TKR) and disease progression, their generalizability remains challenging, particularly when applied to imaging data from different sources. In this study, we show that replacing batch normalization with instance normalization, using data augmentation, and applying contrastive loss improves generalization. For training and evaluation, we used MRI data from the Osteoarthritis Initiative (OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) images as the target domain. The results demonstrated a statistically significant improvement in classification metrics across both domains by replacing batch normalization with instance normalization in the baseline model, generating augmented input views using the Global Intensity Non-linear (GIN) augmentation method, and incorporating a supervised contrastive loss alongside the classification loss to align representations of samples with the same label. The GIN method with contrastive loss performed better than all evaluated single-source domain generalization methods when using 3D instance normalization. Comparing GIN with and without contrastive loss (for both normalization types) showed that adding contrastive loss consistently led to better performance.
♻ ☆ Research on Expressway Congestion Warning Technology Based on YOLOv11-DIoU and GRU-Attention
Expressway traffic congestion severely reduces travel efficiency and hinders regional connectivity. Existing "detection-prediction" systems have critical flaws: low vehicle perception accuracy under occlusion and loss of long-sequence dependencies in congestion forecasting. This study proposes an integrated technical framework to resolve these issues.For traffic flow perception, two baseline algorithms were optimized. Traditional YOLOv11 was upgraded to YOLOv11-DIoU by replacing GIoU Loss with DIoU Loss, and DeepSort was improved by fusing Mahalanobis (motion) and cosine (appearance) distances. Experiments on Chang-Shen Expressway videos showed YOLOv11-DIoU achieved 95.7\% mAP (6.5 percentage points higher than baseline) with 5.3\% occlusion miss rate. DeepSort reached 93.8\% MOTA (11.3 percentage points higher than SORT) with only 4 ID switches. Using the Greenberg model (for 10-15 vehicles/km high-density scenarios), speed and density showed a strong negative correlation (r=-0.97), conforming to traffic flow theory. For congestion warning, a GRU-Attention model was built to capture congestion precursors. Trained 300 epochs with flow, density, and speed, it achieved 99.7\% test accuracy (7-9 percentage points higher than traditional GRU). In 10-minute advance warnings for 30-minute congestion, time error was $\leq$ 1 minute. Validation with an independent video showed 95\% warning accuracy, over 90\% spatial overlap of congestion points, and stable performance in high-flow ($>$5 vehicles/second) scenarios.This framework provides quantitative support for expressway congestion control, with promising intelligent transportation applications.
♻ ☆ Towards classification-based representation learning for place recognition on LiDAR scans
Place recognition is a crucial task in autonomous driving, allowing vehicles to determine their position using sensor data. While most existing methods rely on contrastive learning, we explore an alternative approach by framing place recognition as a multi-class classification problem. Our method assigns discrete location labels to LiDAR scans and trains an encoder-decoder model to classify each scan's position directly. We evaluate this approach on the NuScenes dataset and show that it achieves competitive performance compared to contrastive learning-based methods while offering advantages in training efficiency and stability.
♻ ☆ MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. Demos are available at https://v2aresearch.github.io/MultiSoundGen/.
♻ ☆ CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution
Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50\%, 82.73\%, and 84.94\% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70\%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.
♻ ☆ Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems
♻ ☆ WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning and addresses the limitations of one-step KD. Each model in the chain is trained on only a fraction of the dataset and shows that effective learning can be achieved with minimal supervision. Extensive evaluation on six imaging datasets across otoscopic, microscopic, and magnetic resonance imaging modalities shows that it generalizes and outperforms existing methods. Furthermore, the proposed distillation chain resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.
♻ ☆ Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
A deep understanding of kinematic structures and movable components is essential for enabling robots to manipulate objects and model their own articulated forms. Such understanding is captured through articulated objects, which are essential for tasks such as physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets, which hinders scalability. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. We evaluate Kinematify on diverse inputs from both synthetic and real-world environments, demonstrating improvements in registration and kinematic topology accuracy over prior work.
comment: project page: https://sites.google.com/deemos.com/kinematify
♻ ☆ SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion
With the rapid advancement of the digital society, the proliferation of satellites in the Satellite Internet of Things (Sat-IoT) has led to the continuous accumulation of large-scale multi-temporal and multi-source images across diverse application scenarios. However, existing methods fail to fully exploit the complementary information embedded in both temporal and source dimensions. For example, Multi-Image Super-Resolution (MISR) enhances reconstruction quality by leveraging temporal complementarity across multiple observations, yet the limited fine-grained texture details in input images constrain its performance. Conversely, pansharpening integrates multi-source images by injecting high-frequency spatial information from panchromatic data, but typically relies on pre-interpolated low-resolution inputs and assumes noise-free alignment, making it highly sensitive to noise and misregistration. To address these issues, we propose SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion. Specifically, SatFusion first employs a Multi-Temporal Image Fusion (MTIF) module to achieve deep feature alignment with the panchromatic image. Then, a Multi-Source Image Fusion (MSIF) module injects fine-grained texture information from the panchromatic data. Finally, a Fusion Composition module adaptively integrates the complementary advantages of both modalities while dynamically refining spectral consistency, supervised by a weighted combination of multiple loss functions. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that SatFusion significantly improves fusion quality, robustness under challenging conditions, and generalizability to real-world Sat-IoT scenarios. The code is available at: https://github.com/dllgyufei/SatFusion.git.
♻ ☆ Deep Fourier-embedded Network for RGB and Thermal Salient Object Detection
The rapid development of deep learning has significantly improved salient object detection (SOD) combining both RGB and thermal (RGB-T) images. However, existing Transformer-based RGB-T SOD models with quadratic complexity are memory-intensive, limiting their application in high-resolution bimodal feature fusion. To overcome this limitation, we propose a purely Fourier Transform-based model, namely Deep Fourier-embedded Network (FreqSal), for accurate RGB-T SOD. Specifically, we leverage the efficiency of Fast Fourier Transform with linear complexity to design three key components: (1) To fuse RGB and thermal modalities, we propose Modal-coordinated Perception Attention, which aligns and enhances bimodal Fourier representation in multiple dimensions; (2) To clarify object edges and suppress noise, we design Frequency-decomposed Edge-aware Block, which deeply decomposes and filters Fourier components of low-level features; (3) To accurately decode features, we propose Fourier Residual Channel Attention Block, which prioritizes high-frequency information while aligning channel-wise global relationships. Additionally, even when converged, existing deep learning-based SOD models' predictions still exhibit frequency gaps relative to ground-truth. To address this problem, we propose Co-focus Frequency Loss, which dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing bimodal edge information in the Fourier domain. Extensive experiments on ten bimodal SOD benchmark datasets demonstrate that FreqSal outperforms twenty-nine existing state-of-the-art bimodal SOD models. Comprehensive ablation studies further validate the value and effectiveness of our newly proposed components. The code is available at https://github.com/JoshuaLPF/FreqSal.
comment: Accepted by TCSVT2025
♻ ☆ Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios
The scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance, thus hindering effective detection and segmentation performance. Existing studies employ the generative models to synthesize more training samples to mitigate data scarcity. However, these synthetic samples are repetitive or simplistic and fail to provide "crucial information" that targets the downstream model's weaknesses. Additionally, these methods typically require separate training for different objects, leading to computational inefficiencies. To address these issues, we propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples. Our method integrates two key modules. The Scene Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to capture target information. The Weakness Aware Sample Miner (WASM) generates hard-to-detect samples using feedback from the detection results of downstream model, which is then fused with the output of SAFE module. Together, our Crucial-Diff framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset, Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code is publicly available at https://github.com/JJessicaYao/Crucial-diff.
comment: Accepted by IEEE Transactions on Image Processing (TIP), 2025
♻ ☆ InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.
♻ ☆ FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World
Text-based Pedestrian Retrieval (TPR) deals with retrieving specific target pedestrians in visual scenes according to natural language descriptions. Although existing methods have achieved progress under constrained settings, interactive retrieval in the open-world scenario still suffers from limited model generalization and insufficient semantic understanding. To address these challenges, we propose FitPro, an open-world interactive zero-shot TPR framework with enhanced semantic comprehension and cross-scene adaptability. FitPro has three innovative components: Feature Contrastive Decoding (FCD), Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval (QHR). The FCD integrates prompt-guided contrastive decoding to generate high-quality structured pedestrian descriptions from denoised images, effectively alleviating semantic drift in zero-shot scenarios. The ISM constructs holistic pedestrian representations from multi-view observations to achieve global semantic modeling in multi-turn interactions, thereby improving robustness against viewpoint shifts and fine-grained variations in descriptions. The QHR dynamically optimizes the retrieval pipeline according to query types, enabling efficient adaptation to multi-modal and multi-view inputs. Extensive experiments on five public datasets and two evaluation protocols demonstrate that FitPro significantly overcomes the generalization limitations and semantic modeling constraints of existing methods in interactive retrieval, paving the way for practical deployment.
comment: 12pages,6 figures
♻ ☆ 3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data
Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 81.51\% accuracy, 85.94\% sensitivity, 76.36\% specificity, 80.88\% precision, and 83.33\% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.
comment: 17 pages, 3 figure, 9 tables
♻ ☆ Parameterized Prompt for Incremental Object Detection
Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD's effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.
♻ ☆ Light Future: Multimodal Action Frame Prediction via InstructPix2Pix WACV 2026
Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.
comment: 9 pages including appendix, 4 tables, 8 figures, to be submitted to WACV 2026
♻ ☆ Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution ICCV 2025
Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image superresolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multiscale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multiscale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency and high-frequency sub-bands, without omitting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.
comment: ICCV 2025 Oral Paper
♻ ☆ Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
comment: 5 pages, 3 figures
♻ ☆ Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras NeurIPS 2025
Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes -- appearance, status, relation to viewer, and relation to other objects -- bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.
comment: NeurIPS 2025 Spotlight; 43 pages, 17 figures, 16 tables; Project Page at https://talk2event.github.io
♻ ☆ FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks
Proactive Deepfake detection via robust watermarks has seen interest ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and performs one-way encryption of the selected parameters. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Moreover, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.
comment: ACM Multimedia 2025 Oral
♻ ☆ Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms
Palms are ecologically and economically indicators of tropical forest health, biodiversity, and human impact that support local economies and global forest product supply chains. While palm detection in plantations is well-studied, efforts to map naturally occurring palms in dense forests remain limited by overlapping crowns, uneven shading, and heterogeneous landscapes. We develop PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images. Orthomosaics are created from thousands of aerial images and spanning several to hundreds of gigabytes. Our contributions are threefold. First, we construct a large UAV-derived orthomosaic dataset collected across 21 ecologically diverse sites in western Ecuador, annotated with 8,830 bounding boxes and 5,026 palm center points. Second, we evaluate multiple state-of-the-art object detectors based on efficiency and performance, integrating zero-shot SAM 2 as the segmentation backbone, and refining the results for precise geographic mapping. Third, we apply calibration methods to align confidence scores with IoU and explore saliency maps for feature explainability. Though optimized for palms, PRISM is adaptable for identifying other natural objects, such as eastern white pines. Future work will explore transfer learning for lower-resolution datasets (0.5 to 1m).
comment: 15 pages, 8 figures, 4 tables
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ Real-Time Neural Video Compression with Unified Intra and Inter Coding
Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
comment: 10 pages
♻ ☆ ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation
Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency. To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer designs a hierarchical identity-preserving attention mechanism, which effectively aggregates features within and across subjects and modalities. To effectively allow for the semantic following of user intention, we introduce semantic understanding via pretrained vision-language model (VLM), leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID, we employ an online reinforcement learning phase to drive the overall training objective of ID-Composer into RLVR. Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.
♻ ☆ Joint Lossless Compression and Steganography for Medical Images via Large Language Models
Recently, large language models (LLMs) have driven promising progress in lossless image compression. However, directly adopting existing paradigms for medical images suffers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compression process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this insight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, providing global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we innovatively propose a segmented message steganography algorithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the superiority of our proposed method in terms of compression ratios, efficiency, and security. The source code will be made publicly available.
♻ ☆ Learning with Category-Equivariant Architectures for Human Activity Recognition
We propose CatEquiv, a category-equivariant neural network for Human Activity Recognition (HAR) from inertial sensors that systematically encodes temporal, amplitude, and structural symmetries. We introduce a symmetry category that jointly represents cyclic time shifts, positive gain scalings, and the sensor-hierarchy poset, capturing the categorical symmetry structure of the data. CatEquiv achieves equivariance with respect to the categorical symmetry product. On UCI-HAR under out-of-distribution perturbations, CatEquiv attains markedly higher robustness compared with circularly padded CNNs and plain CNNs. These results demonstrate that enforcing categorical symmetries yields strong invariance and generalization without additional model capacity.
♻ ☆ WXSOD: A Benchmark for Robust Salient Object Detection in Adverse Weather Conditions
Salient object detection (SOD) in complex environments remains a challenging research topic. Most existing methods perform well in natural scenes with negligible noise, and tend to leverage multi-modal information (e.g., depth and infrared) to enhance accuracy. However, few studies are concerned with the damage of weather noise on SOD performance due to the lack of dataset with pixel-wise annotations. To bridge this gap, this paper introduces a novel Weather-eXtended Salient Object Detection (WXSOD) dataset. It consists of 14,945 RGB images with diverse weather noise, along with the corresponding ground truth annotations and weather labels. To verify algorithm generalization, WXSOD contains two test sets, i.e., a synthesized test set and a real test set. The former is generated by adding weather noise to clean images, while the latter contains real-world weather noise. Based on WXSOD, we propose an efficient baseline, termed Weather-aware Feature Aggregation Network (WFANet), which adopts a fully supervised two-branch architecture. Specifically, the weather prediction branch mines weather-related deep features, while the saliency detection branch fuses semantic features extracted from the backbone with weather features for SOD. Comprehensive comparisons against 17 SOD methods shows that our WFANet achieves superior performance on WXSOD. The code and benchmark results will be made publicly available at https://github.com/C-water/WXSOD
comment: Under review
♻ ☆ CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection
The sparse cross-modality detector offers more advantages than its counterpart, the Bird's-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.
comment: 13 pages
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ 3DBonsai: Structure-Aware Bonsai Modeling Using Conditioned 3D Gaussian Splatting
Recent advancements in text-to-3D generation have shown remarkable results by leveraging 3D priors in combination with 2D diffusion. However, previous methods utilize 3D priors that lack detailed and complex structural information, limiting them to generating simple objects and presenting challenges for creating intricate structures such as bonsai. In this paper, we propose 3DBonsai, a novel text-to-3D framework for generating 3D bonsai with complex structures. Technically, we first design a trainable 3D space colonization algorithm to produce bonsai structures, which are then enhanced through random sampling and point cloud augmentation to serve as the 3D Gaussian priors. We introduce two bonsai generation pipelines with distinct structural levels: fine structure conditioned generation, which initializes 3D Gaussians using a 3D structure prior to produce detailed and complex bonsai, and coarse structure conditioned generation, which employs a multi-view structure consistency module to align 2D and 3D structures. Moreover, we have compiled a unified 2D and 3D Chinese-style bonsai dataset. Our experimental results demonstrate that 3DBonsai significantly outperforms existing methods, providing a new benchmark for structure-aware 3D bonsai generation.
♻ ☆ DRIP: Dynamic patch Reduction via Interpretable Pooling
Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
comment: Need more refinement
♻ ☆ Weakly Supervised Object Segmentation by Background Conditional Divergence
As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images and then, during learning, create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The code for this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.
comment: Published in TMLR: https://openreview.net/forum?id=2JJZhfGvMW
♻ ☆ Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification NeurIPS 2025
Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.
comment: Published in NeurIPS 2025
♻ ☆ Visual Program Distillation with Template-Based Augmentation EMNLP
Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference
comment: EMNLP Camera Ready
Machine Learning 150
☆ Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.
comment: 16 pages, 7 figures, 14 tables. Under Review
☆ In Good GRACEs: Principled Teacher Selection for Knowledge Distillation
Knowledge distillation is an efficient strategy to use data generated by large "teacher" language models to train smaller capable "student" models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.
☆ TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System
Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around $250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also open-sourced at https://twist-data.github.io .
comment: Website: https://yanjieze.com/TWIST2
☆ GeoCrossBench: Cross-Band Generalization for Remote Sensing
The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25\% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.
☆ Accelerated Frank-Wolfe Algorithms: Complementarity Conditions and Sparsity
We develop new accelerated first-order algorithms in the Frank-Wolfe (FW) family for minimizing smooth convex functions over compact convex sets, with a focus on two prominent constraint classes: (1) polytopes and (2) matrix domains given by the spectrahedron and the unit nuclear-norm ball. A key technical ingredient is a complementarity condition that captures solution sparsity -- face dimension for polytopes and rank for matrices. We present two algorithms: (1) a purely linear optimization oracle (LOO) method for polytopes that has optimal worst-case first-order (FO) oracle complexity and, aside of a finite \emph{burn-in} phase and up to a logarithmic factor, has LOO complexity that scales with $r/\sqrt{\epsilon}$, where $\epsilon$ is the target accuracy and $r$ is the solution sparsity $r$ (independently of the ambient dimension), and (2) a hybrid scheme that combines FW with a sparse projection oracle (e.g., low-rank SVDs for matrix domains with low-rank solutions), which also has optimal FO oracle complexity, and after a finite burn-in phase, only requires $O(1/\sqrt{\epsilon})$ sparse projections and LOO calls (independently of both the ambient dimension and the rank of optimal solutions). Our results close a gap on how to accelerate recent advancements in linearly-converging FW algorithms for strongly convex optimization, without paying the price of the dimension.
☆ Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning
Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-MSP .
☆ Assessing win strength in MLB win prediction models
In Major League Baseball, strategy and planning are major factors in determining the outcome of a game. Previous studies have aided this by building machine learning models for predicting the winning team of any given game. We extend this work by training a comprehensive set of machine learning models using a common dataset. In addition, we relate the win probabilities produced by these models to win strength as measured by score differential. In doing so we show that the most common machine learning models do indeed demonstrate a relationship between predicted win probability and the strength of the win. Finally, we analyze the results of using predicted win probabilities as a decision making mechanism on run-line betting. We demonstrate positive returns when utilizing appropriate betting strategies, and show that naive use of machine learning models for betting lead to significant loses.
☆ TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models
Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models. The library is open source and available at https://github.com/Lexsi-Labs/TabTune .
☆ Fast, Private, and Protected: Safeguarding Data Privacy and Defending Against Model Poisoning Attacks in Federated Learning
Federated Learning (FL) is a distributed training paradigm wherein participants collaborate to build a global model while ensuring the privacy of the involved data, which remains stored on participant devices. However, proposals aiming to ensure such privacy also make it challenging to protect against potential attackers seeking to compromise the training outcome. In this context, we present Fast, Private, and Protected (FPP), a novel approach that aims to safeguard federated training while enabling secure aggregation to preserve data privacy. This is accomplished by evaluating rounds using participants' assessments and enabling training recovery after an attack. FPP also employs a reputation-based mechanism to mitigate the participation of attackers. We created a dockerized environment to validate the performance of FPP compared to other approaches in the literature (FedAvg, Power-of-Choice, and aggregation via Trimmed Mean and Median). Our experiments demonstrate that FPP achieves a rapid convergence rate and can converge even in the presence of malicious participants performing model poisoning attacks.
☆ Can LLMs subtract numbers?
We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a
comment: Work-in-progress; MathNLP non-archival presentation
☆ Enhancing Federated Learning Privacy with QUBO
Federated learning (FL) is a widely used method for training machine learning (ML) models in a scalable way while preserving privacy (i.e., without centralizing raw data). Prior research shows that the risk of exposing sensitive data increases cumulatively as the number of iterations where a client's updates are included in the aggregated model increase. Attackers can launch membership inference attacks (MIA; deciding whether a sample or client participated), property inference attacks (PIA; inferring attributes of a client's data), and model inversion attacks (MI; reconstructing inputs), thereby inferring client-specific attributes and, in some cases, reconstructing inputs. In this paper, we mitigate risk by substantially reducing per client exposure using a quantum computing-inspired quadratic unconstrained binary optimization (QUBO) formulation that selects a small subset of client updates most relevant for each training round. In this work, we focus on two threat vectors: (i) information leakage by clients during training and (ii) adversaries who can query or obtain the global model. We assume a trusted central server and do not model server compromise. This method also assumes that the server has access to a validation/test set with global data distribution. Experiments on the MNIST dataset with 300 clients in 20 rounds showed a 95.2% per-round and 49% cumulative privacy exposure reduction, with 147 clients' updates never being used during training while maintaining in general the full-aggregation accuracy or even better. The method proved to be efficient at lower scale and more complex model as well. A CINIC-10 dataset-based experiment with 30 clients resulted in 82% per-round privacy improvement and 33% cumulative privacy.
comment: 8 pages, 9 figures
☆ Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold
Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, $\tr(\mH)$, whereas we prove that Adam minimizes $\tr(\Diag(\mH)^{1/2})$ instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.
☆ STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation
The chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.
comment: 16 pages, 3 figures, 2 tables
☆ VecComp: Vector Computing via MIMO Digital Over-the-Air Computation
Recently, the ChannelComp framework has proposed digital over-the-air computation by designing digital modulations that enable the computation of arbitrary functions. Unlike traditional analog over-the-air computation, which is restricted to nomographic functions, ChannelComp enables a broader range of computational tasks while maintaining compatibility with digital communication systems. This framework is intended for applications that favor local information processing over the mere acquisition of data. However, ChannelComp is currently designed for scalar function computation, while numerous data-centric applications necessitate vector-based computations, and it is susceptible to channel fading. In this work, we introduce a generalization of the ChannelComp framework, called VecComp, by integrating ChannelComp with multiple-antenna technology. This generalization not only enables vector function computation but also ensures scalability in the computational complexity, which increases only linearly with the vector dimension. As such, VecComp remains computationally efficient and robust against channel impairments, making it suitable for high-dimensional, data-centric applications. We establish a non-asymptotic upper bound on the mean squared error of VecComp, affirming its computation efficiency under fading channel conditions. Numerical experiments show the effectiveness of VecComp in improving the computation of vector functions and fading compensation over noisy and fading multiple-access channels.
☆ From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos
Training a team of agents from scratch in multi-agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi-agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search-and-rescue. To unlock their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi-agent training through a policy fusion mechanism that combines an MoE-like gating selector and an action editor. Experiments across diverse cooperative tasks show that SoCo significantly boosts the training efficiency and performance of backbone algorithms. These results demonstrate that solo demonstrations provide a scalable and effective complement to multi-agent data, making cooperative learning more practical and broadly applicable.
☆ ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models
Zeroth-order or derivative-free optimization (MeZO) is an attractive strategy for finetuning large language models (LLMs) because it eliminates the memory overhead of backpropagation. However, it converges slowly due to the inherent curse of dimensionality when searching for descent directions in the high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a novel zeroth-order optimizer that accelerates convergence by adaptive directional sampling. Instead of drawing the direction uniformly at random, ConMeZO restricts the sampling to a cone centered around a momentum estimate. This concentrates the search in directions where the true gradient is more likely to lie and thus reduces the effect of high dimensions. We prove that ConMeZO achieves the same worst-case convergence rate as MeZO. Empirically, when finetuning LLMs on natural language tasks, ConMeZO is up to 2X faster than MeZO while retaining the low-memory footprint of zeroth-order methods.
☆ DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications
Classical probabilistic graphical models face fundamental challenges in modern data environments, which are characterized by high dimensionality, source heterogeneity, and stringent data-sharing constraints. In this work, we revisit the Ising model, a well-established member of the Markov Random Field (MRF) family, and develop a distributed framework that enables scalable and privacy-preserving representation learning from large-scale binary data with inherent low-rank structure. Our approach optimizes a non-convex surrogate loss function via bi-factored gradient descent, offering substantial computational and communication advantages over conventional convex approaches. We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets from 58,248 patients across the University of Pittsburgh Medical Center (UPMC) and Mass General Brigham (MGB), demonstrating superior performance in global representation learning and downstream clinical tasks, including relationship detection, patient phenotyping, and patient clustering. These results highlight a broader potential for statistical inference in federated, high-dimensional settings while addressing the practical challenges of data complexity and multi-institutional integration.
☆ Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning
We argue that sixth-generation (6G) intelligence is not fluent token prediction but the capacity to imagine and choose -- to simulate future scenarios, weigh trade-offs, and act with calibrated uncertainty. We reframe open radio access network (O-RAN) near-real-time (Near-RT) control via counterfactual dynamics and a world modeling (WM) paradigm that learns an action-conditioned generative state space. This enables quantitative "what-if" forecasting beyond large language models (LLMs) as the primary modeling primitive. Actions such as physical resource blocks (PRBs) are treated as first-class control inputs in a causal world model, and both aleatoric and epistemic uncertainty are modeled for prediction and what-if analysis. An agentic, model predictive control (MPC)-based cross-entropy method (CEM) planner operates over short horizons, using prior-mean rollouts within data-driven PRB bounds to maximize a deterministic reward. The model couples multi-scale structured state-space mixtures (MS3M) with a compact stochastic latent to form WM-MS3M, summarizing key performance indicators (KPIs) histories and predicting next-step KPIs under hypothetical PRB sequences. On realistic O-RAN traces, WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference, enabling rare-event simulation and offline policy screening.
comment: 13 Pages, 3 Figures, 4 Tables
☆ Calibration improves detection of mislabeled examples
Mislabeled data is a pervasive issue that undermines the performance of machine learning systems in real-world applications. An effective approach to mitigate this problem is to detect mislabeled instances and subject them to special treatment, such as filtering or relabeling. Automatic mislabeling detection methods typically rely on training a base machine learning model and then probing it for each instance to obtain a trust score that each provided label is genuine or incorrect. The properties of this base model are thus of paramount importance. In this paper, we investigate the impact of calibrating this model. Our empirical results show that using calibration methods improves the accuracy and robustness of mislabeled instance detection, providing a practical and effective solution for industrial applications.
☆ Does Interpretability of Knowledge Tracing Models Support Teacher Decision Making?
Knowledge tracing (KT) models are a crucial basis for pedagogical decision-making, namely which task to select next for a learner and when to stop teaching a particular skill. Given the high stakes of pedagogical decisions, KT models are typically required to be interpretable, in the sense that they should implement an explicit model of human learning and provide explicit estimates of learners' abilities. However, to our knowledge, no study to date has investigated whether the interpretability of KT models actually helps human teachers to make teaching decisions. We address this gap. First, we perform a simulation study to show that, indeed, decisions based on interpretable KT models achieve mastery faster compared to decisions based on a non-interpretable model. Second, we repeat the study but ask $N=12$ human teachers to make the teaching decisions based on the information provided by KT models. As expected, teachers rate interpretable KT models higher in terms of usability and trustworthiness. However, the number of tasks needed until mastery hardly differs between KT models. This suggests that the relationship between model interpretability and teacher decisions is not straightforward: teachers do not solely rely on KT models to make decisions and further research is needed to investigate how learners and teachers actually understand and use KT models.
comment: in press at the Workshop on Epistemics and Decision-Making in AI-Supported Education, AIED 2025
☆ Optimizing Kernel Discrepancies via Subset Selection
Kernel discrepancies are a powerful tool for analyzing worst-case errors in quasi-Monte Carlo (QMC) methods. Building on recent advances in optimizing such discrepancy measures, we extend the subset selection problem to the setting of kernel discrepancies, selecting an m-element subset from a large population of size $n \gg m$. We introduce a novel subset selection algorithm applicable to general kernel discrepancies to efficiently generate low-discrepancy samples from both the uniform distribution on the unit hypercube, the traditional setting of classical QMC, and from more general distributions $F$ with known density functions by employing the kernel Stein discrepancy. We also explore the relationship between the classical $L_2$ star discrepancy and its $L_\infty$ counterpart.
☆ Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs NeurIPS'25
Training agents to operate under strict constraints during deployment, such as limited resource budgets or stringent safety requirements, presents significant challenges, especially when these constraints render the task complex. In this work, we propose a curriculum learning strategy that gradually tightens constraints during training, enabling the agent to incrementally master the deployment requirements. Inspired by self-paced learning techniques in unconstrained reinforcement learning (RL), our approach facilitates a smoother transition to challenging environments by initially training on simplified versions of the constraints and progressively introducing the full deployment conditions. We provide a theoretical analysis using an RL agent in a binary-tree Markov Decision Process (MDP) to demonstrate that our curriculum strategy can accelerate training relative to a baseline approach that imposes the trajectory constraints from the outset. Moreover, we empirically validate the effectiveness and generality of our method across both RL and large language model (LLM) agents in diverse settings, including a binary-tree MDP, a multi-task navigation domain, and a math reasoning task with two benchmarks. These results highlight the potential of curriculum design in enhancing the efficiency and performance of agents operating under complex trajectory constraints during deployment. Moreover, when applied to LLMs, our strategy enables compression of output chain-of-thought tokens, achieving a substantial inference speedup on consumer hardware, demonstrating its effectiveness for resource-constrained deployment.
comment: NeurIPS'25 paper
☆ Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes
Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.
☆ RL-Aided Cognitive ISAC: Robust Detection and Sensing-Communication Trade-offs
This paper proposes a reinforcement learning (RL)-aided cognitive framework for massive MIMO-based integrated sensing and communication (ISAC) systems employing a uniform planar array (UPA). The focus is on enhancing radar sensing performance in environments with unknown and dynamic disturbance characteristics. A Wald-type detector is employed for robust target detection under non-Gaussian clutter, while a SARSA-based RL algorithm enables adaptive estimation of target positions without prior environmental knowledge. Based on the RL-derived sensing information, a joint waveform optimization strategy is formulated to balance radar sensing accuracy and downlink communication throughput. The resulting design provides an adaptive trade-off between detection performance and achievable sum rate through an analytically derived closed-form solution. Monte Carlo simulations demonstrate that the proposed cognitive ISAC framework achieves significantly improved detection probability compared to orthogonal and non-learning adaptive baselines, while maintaining competitive communication performance. These results underline the potential of RL-assisted sensing for robust and spectrum-efficient ISAC in next-generation wireless networks.
comment: 29 pages, 14 figures. Invited paper, submitted to the EURASIP Journal on Wireless Communications and Networking (JWCN)
☆ Scalable Evaluation and Neural Models for Compositional Generalization NeurIPS
Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts.
comment: Accepted at the Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
☆ In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization
Focusing on implicit neural representations, we present a novel in situ training protocol that employs limited memory buffers of full and sketched data samples, where the sketched data are leveraged to prevent catastrophic forgetting. The theoretical motivation for our use of sketching as a regularizer is presented via a simple Johnson-Lindenstrauss-informed result. While our methods may be of wider interest in the field of continual learning, we specifically target in situ neural compression using implicit neural representation-based hypernetworks. We evaluate our method on a variety of complex simulation data in two and three dimensions, over long time horizons, and across unstructured grids and non-Cartesian geometries. On these tasks, we show strong reconstruction performance at high compression rates. Most importantly, we demonstrate that sketching enables the presented in situ scheme to approximately match the performance of the equivalent offline method.
comment: 17 pages, 8 figures, 4 tables
☆ Nesterov-Accelerated Robust Federated Learning Over Byzantine Adversaries
We investigate robust federated learning, where a group of workers collaboratively train a shared model under the orchestration of a central server in the presence of Byzantine adversaries capable of arbitrary and potentially malicious behaviors. To simultaneously enhance communication efficiency and robustness against such adversaries, we propose a Byzantine-resilient Nesterov-Accelerated Federated Learning (Byrd-NAFL) algorithm. Byrd-NAFL seamlessly integrates Nesterov's momentum into the federated learning process alongside Byzantine-resilient aggregation rules to achieve fast and safeguarding convergence against gradient corruption. We establish a finite-time convergence guarantee for Byrd-NAFL under non-convex and smooth loss functions with relaxed assumption on the aggregated gradients. Extensive numerical experiments validate the effectiveness of Byrd-NAFL and demonstrate the superiority over existing benchmarks in terms of convergence speed, accuracy, and resilience to diverse Byzantine attack strategies.
☆ Apriel-H1: Towards Efficient Enterprise Reasoning Models
Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.
☆ Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks
Large language models (LLMs) are proliferating rapidly at the edge, delivering intelligent capabilities across diverse application scenarios. However, their practical deployment in collaborative scenarios confronts fundamental challenges: privacy vulnerabilities, communication overhead, and computational bottlenecks. To address these, we propose Federated Attention (FedAttn), which integrates the federated paradigm into the self-attention mechanism, creating a new distributed LLM inference framework that simultaneously achieves privacy protection, communication efficiency, and computational efficiency. FedAttn enables participants to perform local self-attention over their own token representations while periodically exchanging and aggregating Key-Value (KV) matrices across multiple Transformer blocks, collaboratively generating LLM responses without exposing private prompts. Further, we identify a structural duality between contextual representation refinement in FedAttn and parameter optimization in FL across private data, local computation, and global aggregation. This key insight provides a principled foundation for systematically porting federated optimization techniques to collaborative LLM inference. Building on this framework, we theoretically analyze how local self-attention computation within participants and heterogeneous token relevance among participants shape error propagation dynamics across Transformer blocks. Moreover, we characterize the fundamental trade-off between response quality and communication/computation efficiency, which is governed by the synchronization interval and the number of participants. Experimental results validate our theoretical analysis, and reveal significant optimization opportunities through sparse attention and adaptive KV aggregation, highlighting FedAttn's potential to deliver scalability and efficiency in real-world edge deployments.
☆ Natural-gas storage modelling by deep reinforcement learning
We introduce GasRL, a simulator that couples a calibrated representation of the natural gas market with a model of storage-operator policies trained with deep reinforcement learning (RL). We use it to analyse how optimal stockpile management affects equilibrium prices and the dynamics of demand and supply. We test various RL algorithms and find that Soft Actor Critic (SAC) exhibits superior performance in the GasRL environment: multiple objectives of storage operators - including profitability, robust market clearing and price stabilisation - are successfully achieved. Moreover, the equilibrium price dynamics induced by SAC-derived optimal policies have characteristics, such as volatility and seasonality, that closely match those of real-world prices. Remarkably, this adherence to the historical distribution of prices is obtained without explicitly calibrating the model to price data. We show how the simulator can be used to assess the effects of EU-mandated minimum storage thresholds. We find that such thresholds have a positive effect on market resilience against unanticipated shifts in the distribution of supply shocks. For example, with unusually large shocks, market disruptions are averted more often if a threshold is in place.
comment: 8 pages, 5 figures, published on
☆ Recursively Enumerably Representable Classes and Computable Versions of the Fundamental Theorem of Statistical Learning
We study computable probably approximately correct (CPAC) learning, where learners are required to be computable functions. It had been previously observed that the Fundamental Theorem of Statistical Learning, which characterizes PAC learnability by finiteness of the Vapnik-Chervonenkis (VC-)dimension, no longer holds in this framework. Recent works recovered analogs of the Fundamental Theorem in the computable setting, for instance by introducing an effective VC-dimension. Guided by this, we investigate the connection between CPAC learning and recursively enumerable representable (RER) classes, whose members can be algorithmically listed. Our results show that the effective VC-dimensions can take arbitrary values above the traditional one, even for RER classes, which creates a whole family of (non-)examples for various notions of CPAC learning. Yet the two dimensions coincide for classes satisfying sufficiently strong notions of CPAC learning. We then observe that CPAC learnability can also be characterized via containment of RER classes that realize the same samples. Furthermore, it is shown that CPAC learnable classes satisfying a unique identification property are necessarily RER. Finally, we establish that agnostic learnability can be guaranteed for RER classes, by considering the relaxed notion of nonuniform CPAC learning.
☆ The stability of shallow neural networks on spheres: A sharp spectral analysis
We present an estimation of the condition numbers of the \emph{mass} and \emph{stiffness} matrices arising from shallow ReLU$^k$ neural networks defined on the unit sphere~$\mathbb{S}^d$. In particular, when $\{\theta_j^*\}_{j=1}^n \subset \mathbb{S}^d$ is \emph{antipodally quasi-uniform}, the condition number is sharp. Indeed, in this case, we obtain sharp asymptotic estimates for the full spectrum of eigenvalues and characterize the structure of the corresponding eigenspaces, showing that the smallest eigenvalues are associated with an eigenbasis of low-degree polynomials while the largest eigenvalues are linked to high-degree polynomials. This spectral analysis establishes a precise correspondence between the approximation power of the network and its numerical stability.
☆ Verifying LLM Inference to Prevent Model Weight Exfiltration
As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model outputs, a strategy known as steganography. This work investigates how to verify model responses to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. We formalize model exfiltration as a security game, propose a verification framework that can provably mitigate steganographic exfiltration, and specify the trust assumptions associated with our scheme. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them. We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to <0.5% with false-positive rate of 0.01%, corresponding to a >200x slowdown for adversaries. Overall, this work further establishes a foundation for defending against model weight exfiltration and demonstrates that strong protection can be achieved with minimal additional cost to inference providers.
☆ A Non-Adversarial Approach to Idempotent Generative Modelling
Idempotent Generative Networks (IGNs) are deep generative models that also function as local data manifold projectors, mapping arbitrary inputs back onto the manifold. They are trained to act as identity operators on the data and as idempotent operators off the data manifold. However, IGNs suffer from mode collapse, mode dropping, and training instability due to their objectives, which contain adversarial components and can cause the model to cover the data manifold only partially -- an issue shared with generative adversarial networks. We introduce Non-Adversarial Idempotent Generative Networks (NAIGNs) to address these issues. Our loss function combines reconstruction with the non-adversarial generative objective of Implicit Maximum Likelihood Estimation (IMLE). This improves on IGN's ability to restore corrupted data and generate new samples that closely match the data distribution. We moreover demonstrate that NAIGNs implicitly learn the distance field to the data manifold, as well as an energy-based model.
☆ Neural Network Interoperability Across Platforms
The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate neural network code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Artefacts from our work are available online.
☆ A Large Language Model for Corporate Credit Scoring
We introduce Omega^2, a Large Language Model-driven framework for corporate credit scoring that combines structured financial data with advanced machine learning to improve predictive reliability and interpretability. Our study evaluates Omega^2 on a multi-agency dataset of 7,800 corporate credit ratings drawn from Moody's, Standard & Poor's, Fitch, and Egan-Jones, each containing detailed firm-level financial indicators such as leverage, profitability, and liquidity ratios. The system integrates CatBoost, LightGBM, and XGBoost models optimized through Bayesian search under temporal validation to ensure forward-looking and reproducible results. Omega^2 achieved a mean test AUC above 0.93 across agencies, confirming its ability to generalize across rating systems and maintain temporal consistency. These results show that combining language-based reasoning with quantitative learning creates a transparent and institution-grade foundation for reliable corporate credit-risk assessment.
☆ Redundancy Maximization as a Principle of Associative Memory Learning
Associative memory, traditionally modeled by Hopfield networks, enables the retrieval of previously stored patterns from partial or noisy cues. Yet, the local computational principles which are required to enable this function remain incompletely understood. To formally characterize the local information processing in such systems, we employ a recent extension of information theory - Partial Information Decomposition (PID). PID decomposes the contribution of different inputs to an output into unique information from each input, redundant information across inputs, and synergistic information that emerges from combining different inputs. Applying this framework to individual neurons in classical Hopfield networks we find that below the memory capacity, the information in a neuron's activity is characterized by high redundancy between the external pattern input and the internal recurrent input, while synergy and unique information are close to zero until the memory capacity is surpassed and performance drops steeply. Inspired by this observation, we use redundancy as an information-theoretic learning goal, which is directly optimized for each neuron, dramatically increasing the network's memory capacity to 1.59, a more than tenfold improvement over the 0.14 capacity of classical Hopfield networks and even outperforming recent state-of-the-art implementations of Hopfield networks. Ultimately, this work establishes redundancy maximization as a new design principle for associative memories and opens pathways for new associative memory models based on information-theoretic goals.
comment: 21 pages, 8 figures
☆ TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
comment: 13 pages, 8 figures, 3 tables. The first two authors contributed equally. Project Page: https://iyatomilab.github.io/TAUE
☆ Directional-Clamp PPO
Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the "right" direction -- starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these "right" direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective's behavior by altering the clipping in the "right" direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the "wrong" direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict "wrong" direction regions, where the advantage is positive (negative) and importance ratio falls below (above) $1 - \beta$ ($1+\beta$), for a tunable parameter $\beta \in (0, 1)$. The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective's behavior in the "right" direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid "wrong" direction updates while keeping the importance ratio closer to 1.
☆ RIS-Assisted 3D Spherical Splatting for Object Composition Visualization using Detection Transformers
The pursuit of immersive and structurally aware multimedia experiences has intensified interest in sensing modalities that reconstruct objects beyond the limits of visible light. Conventional optical pipelines degrade under occlusion or low illumination, motivating the use of radio-frequency (RF) sensing, whose electromagnetic waves penetrate materials and encode both geometric and compositional information. Yet, uncontrolled multipath propagation restricts reconstruction accuracy. Recent advances in Programmable Wireless Environments (PWEs) mitigate this limitation by enabling software-defined manipulation of propagation through Reconfigurable Intelligent Surfaces (RISs), thereby providing controllable illumination diversity. Building on this capability, this work introduces a PWE-driven RF framework for three-dimensional object reconstruction using material-aware spherical primitives. The proposed approach combines RIS-enabled field synthesis with a Detection Transformer (DETR) that infers spatial and material parameters directly from extracted RF features. Simulation results confirm the framework's ability to approximate object geometries and classify material composition with an overall accuracy of 79.35%, marking an initial step toward programmable and physically grounded RF-based 3D object composition visualization.
comment: Submitted to IEEE ICC 2026
☆ Dynamic Priors in Bayesian Optimization for Hyperparameter Optimization
Hyperparameter optimization (HPO), for example, based on Bayesian optimization (BO), supports users in designing models well-suited for a given dataset. HPO has proven its effectiveness on several applications, ranging from classical machine learning for tabular data to deep neural networks for computer vision and transformers for natural language processing. However, HPO still sometimes lacks acceptance by machine learning experts due to its black-box nature and limited user control. Addressing this, first approaches have been proposed to initialize BO methods with expert knowledge. However, these approaches do not allow for online steering during the optimization process. In this paper, we introduce a novel method that enables repeated interventions to steer BO via user input, specifying expert knowledge and user preferences at runtime of the HPO process in the form of prior distributions. To this end, we generalize an existing method, $\pi$BO, preserving theoretical guarantees. We also introduce a misleading prior detection scheme, which allows protection against harmful user inputs. In our experimental evaluation, we demonstrate that our method can effectively incorporate multiple priors, leveraging informative priors, whereas misleading priors are reliably rejected or overcome. Thereby, we achieve competitiveness to unperturbed BO.
comment: 9 pages
☆ Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning NeurIPS 2025
Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.
comment: Accepted to NeurIPS 2025 (Spotlight)
☆ Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction
Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.
☆ Theoretical Guarantees for Causal Discovery on Large Random Graphs
We investigate theoretical guarantees for the false-negative rate (FNR) -- the fraction of true causal edges whose orientation is not recovered, under single-variable random interventions and an $\epsilon$-interventional faithfulness assumption that accommodates latent confounding. For sparse Erd\H{o}s--R\'enyi directed acyclic graphs, where the edge probability scales as $p_e = \Theta(1/d)$, we show that the FNR concentrates around its mean at rate $O(\frac{\log d}{\sqrt d})$, implying that large deviations above the expected error become exponentially unlikely as dimensionality increases. This concentration ensures that derived upper bounds hold with high probability in large-scale settings. Extending the analysis to generalized Barab\'asi--Albert graphs reveals an even stronger phenomenon: when the degree exponent satisfies $\gamma > 3$, the deviation width scales as $O(d^{\beta - \frac{1}{2}})$ with $\beta = 1/(\gamma - 1) < \frac{1}{2}$, and hence vanishes in the limit. This demonstrates that realistic scale-free topologies intrinsically regularize causal discovery, reducing variability in orientation error. These finite-dimension results provide the first dimension-adaptive, faithfulness-robust guarantees for causal structure recovery, and challenge the intuition that high dimensionality and network heterogeneity necessarily hinder accurate discovery. Our simulation results corroborate these theoretical predictions, showing that the FNR indeed concentrates and often vanishes in practice as dimensionality grows.
☆ Rawlsian many-to-one matching with non-linear utility
We study a many-to-one matching problem, such as the college admission problem, where each college can admit multiple students. Unlike classical models, colleges evaluate sets of students through non-linear utility functions that capture diversity between them. In this setting, we show that classical stable matchings may fail to exist. To address this, we propose alternative solution concepts based on Rawlsian fairness, aiming to maximize the minimum utility across colleges. We design both deterministic and stochastic algorithms that iteratively improve the outcome of the worst-off college, offering a practical approach to fair allocation when stability cannot be guaranteed.
comment: 17 pages, 7 figures
☆ Agentic AI for Mobile Network RAN Management and Optimization
Agentic AI represents a new paradigm for automating complex systems by using Large AI Models (LAMs) to provide human-level cognitive abilities with multimodal perception, planning, memory, and reasoning capabilities. This will lead to a new generation of AI systems that autonomously decompose goals, retain context over time, learn continuously, operate across tools and environments, and adapt dynamically. The complexity of 5G and upcoming 6G networks renders manual optimization ineffective, pointing to Agentic AI as a method for automating decisions in dynamic RAN environments. However, despite its rapid advances, there is no established framework outlining the foundational components and operational principles of Agentic AI systems nor a universally accepted definition. This paper contributes to ongoing research on Agentic AI in 5G and 6G networks by outlining its core concepts and then proposing a practical use case that applies Agentic principles to RAN optimization. We first introduce Agentic AI, tracing its evolution from classical agents and discussing the progress from workflows and simple AI agents to Agentic AI. Core design patterns-reflection, planning, tool use, and multi-agent collaboration-are then described to illustrate how intelligent behaviors are orchestrated. These theorical concepts are grounded in the context of mobile networks, with a focus on RAN management and optimization. A practical 5G RAN case study shows how time-series analytics and LAM-driven agents collaborate for KPI-based autonomous decision-making.
☆ Causal Graph Neural Networks for Healthcare
Healthcare artificial intelligence systems routinely fail when deployed across institutions, with documented performance drops and perpetuation of discriminatory patterns embedded in historical data. This brittleness stems, in part, from learning statistical associations rather than causal mechanisms. Causal graph neural networks address this triple crisis of distribution shift, discrimination, and inscrutability by combining graph-based representations of biomedical data with causal inference principles to learn invariant mechanisms rather than spurious correlations. This Review examines methodological foundations spanning structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs. We analyse applications demonstrating clinical value across psychiatric diagnosis through brain network analysis, cancer subtyping via multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias. These advances establish foundations for patient-specific Causal Digital Twins, enabling in silico clinical experimentation, with integration of large language models for hypothesis generation and causal graph neural networks for mechanistic validation. Substantial barriers remain, including computational requirements precluding real-time deployment, validation challenges demanding multi-modal evidence triangulation beyond cross-validation, and risks of causal-washing where methods employ causal terminology without rigorous evidentiary support. We propose tiered frameworks distinguishing causally-inspired architectures from causally-validated discoveries and identify critical research priorities making causal rather than purely associational claims.
☆ Many-vs-Many Missile Guidance via Virtual Targets
This paper presents a novel approach to many-vs-many missile guidance using virtual targets (VTs) generated by a Normalizing Flows-based trajectory predictor. Rather than assigning n interceptors directly to m physical targets through conventional weapon target assignment algorithms, we propose a centralized strategy that constructs n VT trajectories representing probabilistic predictions of maneuvering target behavior. Each interceptor is guided toward its assigned VT using Zero-Effort-Miss guidance during midcourse flight, transitioning to Proportional Navigation guidance for terminal interception. This approach treats many-vs-many engagements as many-vs-distribution scenarios, exploiting numerical superiority (n > m) by distributing interceptors across diverse trajectory hypotheses rather than pursuing identical deterministic predictions. Monte Carlo simulations across various target-interceptor configurations (1-6 targets, 1-8 interceptors) demonstrate that the VT method matches or exceeds baseline straight-line prediction performance by 0-4.1% when n = m, with improvements increasing to 5.8-14.4% when n > m. The results confirm that probabilistic VTs enable effective exploitation of numerical superiority, significantly increasing interception probability in many-vs-many scenarios.
comment: will be submitted to Journal of Guidance, Control, and Dynamics as Technical Note
☆ An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems
The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.
☆ Variational Geometric Information Bottleneck: Learning the Shape of Understanding
We propose a unified information-geometric framework that formalizes understanding in learning as a trade-off between informativeness and geometric simplicity. An encoder phi is evaluated by U(phi) = I(phi(X); Y) - beta * C(phi), where C(phi) penalizes curvature and intrinsic dimensionality, enforcing smooth, low-complexity manifolds. Under mild manifold and regularity assumptions, we derive non-asymptotic bounds showing that generalization error scales with intrinsic dimension while curvature controls approximation stability, directly linking geometry to sample efficiency. To operationalize this theory, we introduce the Variational Geometric Information Bottleneck (V-GIB), a variational estimator that unifies mutual-information compression and curvature regularization through tractable geometric proxies such as the Hutchinson trace, Jacobian norms, and local PCA. Experiments across synthetic manifolds, few-shot settings, and real-world datasets (Fashion-MNIST, CIFAR-10) reveal a robust information-geometry Pareto frontier, stable estimators, and substantial gains in interpretive efficiency. Fractional-data experiments on CIFAR-10 confirm that curvature-aware encoders maintain predictive power under data scarcity, validating the predicted efficiency-curvature law. Overall, V-GIB provides a principled and measurable route to representations that are geometrically coherent, data-efficient, and aligned with human-understandable structure.
☆ BRAINS: A Retrieval-Augmented System for Alzheimer's Detection and Monitoring ICML
As the global burden of Alzheimer's disease (AD) continues to grow, early and accurate detection has become increasingly critical, especially in regions with limited access to advanced diagnostic tools. We propose BRAINS (Biomedical Retrieval-Augmented Intelligence for Neurodegeneration Screening) to address this challenge. This novel system harnesses the powerful reasoning capabilities of Large Language Models (LLMs) for Alzheimer's detection and monitoring. BRAINS features a dual-module architecture: a cognitive diagnostic module and a case-retrieval module. The Diagnostic Module utilizes LLMs fine-tuned on cognitive and neuroimaging datasets -- including MMSE, CDR scores, and brain volume metrics -- to perform structured assessments of Alzheimer's risk. Meanwhile, the Case Retrieval Module encodes patient profiles into latent representations and retrieves similar cases from a curated knowledge base. These auxiliary cases are fused with the input profile via a Case Fusion Layer to enhance contextual understanding. The combined representation is then processed with clinical prompts for inference. Evaluations on real-world datasets demonstrate BRAINS effectiveness in classifying disease severity and identifying early signs of cognitive decline. This system not only shows strong potential as an assistive tool for scalable, explainable, and early-stage Alzheimer's disease detection, but also offers hope for future applications in the field.
comment: Accepted for publication in ICMLA 2025
☆ Learning CNF formulas from uniform random solutions in the local lemma regime
We study the problem of learning a $n$-variables $k$-CNF formula $\Phi$ from its i.i.d. uniform random solutions, which is equivalent to learning a Boolean Markov random field (MRF) with $k$-wise hard constraints. Revisiting Valiant's algorithm (Commun. ACM'84), we show that it can exactly learn (1) $k$-CNFs with bounded clause intersection size under Lov\'asz local lemma type conditions, from $O(\log n)$ samples; and (2) random $k$-CNFs near the satisfiability threshold, from $\widetilde{O}(n^{\exp(-\sqrt{k})})$ samples. These results significantly improve the previous $O(n^k)$ sample complexity. We further establish new information-theoretic lower bounds on sample complexity for both exact and approximate learning from i.i.d. uniform random solutions.
☆ NOWS: Neural Operator Warm Starts for Accelerating Iterative Solvers
Partial differential equations (PDEs) underpin quantitative descriptions across the physical sciences and engineering, yet high-fidelity simulation remains a major computational bottleneck for many-query, real-time, and design tasks. Data-driven surrogates can be strikingly fast but are often unreliable when applied outside their training distribution. Here we introduce Neural Operator Warm Starts (NOWS), a hybrid strategy that harnesses learned solution operators to accelerate classical iterative solvers by producing high-quality initial guesses for Krylov methods such as conjugate gradient and GMRES. NOWS leaves existing discretizations and solver infrastructures intact, integrating seamlessly with finite-difference, finite-element, isogeometric analysis, finite volume method, etc. Across our benchmarks, the learned initialization consistently reduces iteration counts and end-to-end runtime, resulting in a reduction of the computational time of up to 90 %, while preserving the stability and convergence guarantees of the underlying numerical algorithms. By combining the rapid inference of neural operators with the rigor of traditional solvers, NOWS provides a practical and trustworthy approach to accelerate high-fidelity PDE simulations.
☆ SKGE: Spherical Knowledge Graph Embedding with Geometric Regularization
Knowledge graph embedding (KGE) has become a fundamental technique for representation learning on multi-relational data. Many seminal models, such as TransE, operate in an unbounded Euclidean space, which presents inherent limitations in modeling complex relations and can lead to inefficient training. In this paper, we propose Spherical Knowledge Graph Embedding (SKGE), a model that challenges this paradigm by constraining entity representations to a compact manifold: a hypersphere. SKGE employs a learnable, non-linear Spherization Layer to map entities onto the sphere and interprets relations as a hybrid translate-then-project transformation. Through extensive experiments on three benchmark datasets, FB15k-237, CoDEx-S, and CoDEx-M, we demonstrate that SKGE consistently and significantly outperforms its strong Euclidean counterpart, TransE, particularly on large-scale benchmarks such as FB15k-237 and CoDEx-M, demonstrating the efficacy of the spherical geometric prior. We provide an in-depth analysis to reveal the sources of this advantage, showing that this geometric constraint acts as a powerful regularizer, leading to comprehensive performance gains across all relation types. More fundamentally, we prove that the spherical geometry creates an "inherently hard negative sampling" environment, naturally eliminating trivial negatives and forcing the model to learn more robust and semantically coherent representations. Our findings compellingly demonstrate that the choice of manifold is not merely an implementation detail but a fundamental design principle, advocating for geometric priors as a cornerstone for designing the next generation of powerful and stable KGE models.
☆ Accounting for Underspecification in Statistical Claims of Model Superiority
Machine learning methods are increasingly applied in medical imaging, yet many reported improvements lack statistical robustness: recent works have highlighted that small but significant performance gains are highly likely to be false positives. However, these analyses do not take \emph{underspecification} into account -- the fact that models achieving similar validation scores may behave differently on unseen data due to random initialization or training dynamics. Here, we extend a recent statistical framework modeling false outperformance claims to include underspecification as an additional variance component. Our simulations demonstrate that even modest seed variability ($\sim1\%$) substantially increases the evidence required to support superiority claims. Our findings underscore the need for explicit modeling of training variance when validating medical imaging systems.
comment: Medical Imaging meets EurIPS Workshop: MedEurIPS 2025
☆ An Adaptive Sampling Framework for Detecting Localized Concept Drift under Label Scarcity
Concept drift and label scarcity are two critical challenges limiting the robustness of predictive models in dynamic industrial environments. Existing drift detection methods often assume global shifts and rely on dense supervision, making them ill-suited for regression tasks with local drifts and limited labels. This paper proposes an adaptive sampling framework that combines residual-based exploration and exploitation with EWMA monitoring to efficiently detect local concept drift under labeling budget constraints. Empirical results on synthetic benchmarks and a case study on electricity market demonstrate superior performance in label efficiency and drift detection accuracy.
☆ Improving Unlearning with Model Updates Probably Aligned with Gradients
We formulate the machine unlearning problem as a general constrained optimization problem. It unifies the first-order methods from the approximate machine unlearning literature. This paper then introduces the concept of feasible updates as the model's parameter update directions that help with unlearning while not degrading the utility of the initial model. Our design of feasible updates is based on masking, \ie\ a careful selection of the model's parameters worth updating. It also takes into account the estimation noise of the gradients when processing each batch of data to offer a statistical guarantee to derive locally feasible updates. The technique can be plugged in, as an add-on, to any first-order approximate unlearning methods. Experiments with computer vision classifiers validate this approach.
comment: Accepted to AISec'25 co-located with the 32nd ACM Conference on Computer and Communications Security
☆ Arithmetic Circuits and Neural Networks for Regular Matroids
We prove that there exist uniform $(+,\times,/)$-circuits of size $O(n^3)$ to compute the basis generating polynomial of regular matroids on $n$ elements. By tropicalization, this implies that there exist uniform $(\max,+,-)$-circuits and ReLU neural networks of the same size for weighted basis maximization of regular matroids. As a consequence in linear programming theory, we obtain a first example where taking the difference of two extended formulations can be more efficient than the best known individual extended formulation of size $O(n^6)$ by Aprile and Fiorini. Such differences have recently been introduced as virtual extended formulations. The proof of our main result relies on a fine-tuned version of Seymour's decomposition of regular matroids which allows us to identify and maintain graphic substructures to which we can apply a local version of the star-mesh transformation.
☆ MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.
☆ A Spatially Informed Gaussian Process UCB Method for Decentralized Coverage Control
We present a novel decentralized algorithm for coverage control in unknown spatial environments modeled by Gaussian Processes (GPs). To trade-off between exploration and exploitation, each agent autonomously determines its trajectory by minimizing a local cost function. Inspired by the GP-UCB (Upper Confidence Bound for GPs) acquisition function, the proposed cost combines the expected locational cost with a variance-based exploration term, guiding agents toward regions that are both high in predicted density and model uncertainty. Compared to previous work, our algorithm operates in a fully decentralized fashion, relying only on local observations and communication with neighboring agents. In particular, agents periodically update their inducing points using a greedy selection strategy, enabling scalable online GP updates. We demonstrate the effectiveness of our algorithm in simulation.
Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds SC 2025
Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point's Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.
comment: Accepted for publication at IEEE International Conference on Intelligent Transportation Systems (ITSC 2025), 8 pages, 3 figures
☆ H-Infinity Filter Enhanced CNN-LSTM for Arrhythmia Detection from Heart Sound Recordings ICSE
Early detection of heart arrhythmia can prevent severe future complications in cardiac patients. While manual diagnosis still remains the clinical standard, it relies heavily on visual interpretation and is inherently subjective. In recent years, deep learning has emerged as a powerful tool to automate arrhythmia detection, offering improved accuracy, consistency, and efficiency. Several variants of convolutional and recurrent neural network architectures have been widely explored to capture spatial and temporal patterns in physiological signals. However, despite these advancements, current models often struggle to generalize well in real-world scenarios, especially when dealing with small or noisy datasets, which are common challenges in biomedical applications. In this paper, a novel CNN-H-Infinity-LSTM architecture is proposed to identify arrhythmic heart signals from heart sound recordings. This architecture introduces trainable parameters inspired by the H-Infinity filter from control theory, enhancing robustness and generalization. Extensive experimentation on the PhysioNet CinC Challenge 2016 dataset, a public benchmark of heart audio recordings, demonstrates that the proposed model achieves stable convergence and outperforms existing benchmarks, with a test accuracy of 99.42% and an F1 score of 98.85%.
comment: This is a preprint of a paper to appear at the 15th IEEE International Conference on Systems Engineering and Technology (ICSET 2025)
☆ AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
☆ A new class of Markov random fields enabling lightweight sampling
This work addresses the problem of efficient sampling of Markov random fields (MRF). The sampling of Potts or Ising MRF is most often based on Gibbs sampling, and is thus computationally expensive. We consider in this work how to circumvent this bottleneck through a link with Gaussian Markov Random fields. The latter can be sampled in several cost-effective ways, and we introduce a mapping from real-valued GMRF to discrete-valued MRF. The resulting new class of MRF benefits from a few theoretical properties that validate the new model. Numerical results show the drastic performance gain in terms of computational efficiency, as we sample at least 35x faster than Gibbs sampling using at least 37x less energy, all the while exhibiting empirical properties close to classical MRFs.
☆ LUMA-RAG: Lifelong Multimodal Agents with Provably Stable Streaming Alignment
Retrieval-Augmented Generation (RAG) has emerged as the dominant paradigm for grounding large language model outputs in verifiable evidence. However, as modern AI agents transition from static knowledge bases to continuous multimodal streams encompassing text, images, video, and audio, two critical challenges arise: maintaining index freshness without prohibitive re-indexing costs, and preserving cross-modal semantic consistency across heterogeneous embedding spaces. We present LUMA-RAG, a lifelong multimodal agent architecture featuring three key innovations: (i) a streaming, multi-tier memory system that dynamically spills embeddings from a hot HNSW tier to a compressed IVFPQ tier under strict memory budgets; (ii) a streaming CLAP->CLIP alignment bridge that maintains cross-modal consistency through incremental orthogonal Procrustes updates; and (iii) stability-aware retrieval telemetry providing Safe@k guarantees by jointly bounding alignment drift and quantization error. Experiments demonstrate robust text-to-image retrieval (Recall@10 = 0.94), graceful performance degradation under product quantization offloading, and provably stable audio-to-image rankings (Safe@1 = 1.0), establishing LUMA-RAG as a practical framework for production multimodal RAG systems.
☆ Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation CIKM 2025
Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.
comment: Accepted to MMGenSR Workshop (CIKM 2025)
☆ An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
The widespread deployment of Large Language Models (LLMs) as public-facing web services and APIs has made their security a core concern for the web ecosystem. Jailbreak attacks, as one of the significant threats to LLMs, have recently attracted extensive research. In this paper, we reveal a jailbreak strategy which can effectively evade current defense strategies. It can extract valuable information from failed or partially successful attack attempts and contains self-evolution from attack interactions, resulting in sufficient strategy diversity and adaptability. Inspired by continuous learning and modular design principles, we propose ASTRA, a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies to achieve more efficient and adaptive attacks. To enable this autonomous evolution, we design a closed-loop "attack-evaluate-distill-reuse" core mechanism that not only generates attack prompts but also automatically distills and generalizes reusable attack strategies from every interaction. To systematically accumulate and apply this attack knowledge, we introduce a three-tier strategy library that categorizes strategies into Effective, Promising, and Ineffective based on their performance scores. The strategy library not only provides precise guidance for attack generation but also possesses exceptional extensibility and transferability. We conduct extensive experiments under a black-box setting, and the results show that ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.
☆ Evolving Graph Learning for Out-of-Distribution Generalization in Non-stationary Environments
Graph neural networks have shown remarkable success in exploiting the spatial and temporal patterns on dynamic graphs. However, existing GNNs exhibit poor generalization ability under distribution shifts, which is inevitable in dynamic scenarios. As dynamic graph generation progresses amid evolving latent non-stationary environments, it is imperative to explore their effects on out-of-distribution (OOD) generalization. This paper proposes a novel Evolving Graph Learning framework for OOD generalization (EvoOOD) by environment-aware invariant pattern recognition. Specifically, we first design an environment sequential variational auto-encoder to model environment evolution and infer the underlying environment distribution. Then, we introduce a mechanism for environment-aware invariant pattern recognition, tailored to address environmental diversification through inferred distributions. Finally, we conduct fine-grained causal interventions on individual nodes using a mixture of instantiated environment samples. This approach helps to distinguish spatio-temporal invariant patterns for OOD prediction, especially in non-stationary environments. Experimental results demonstrate the superiority of EvoGOOD on both real-world and synthetic dynamic datasets under distribution shifts. To the best of our knowledge, it is the first attempt to study the dynamic graph OOD generalization problem from the environment evolution perspective.
☆ Human-Machine Ritual: Synergic Performance through Real-Time Motion Recognition NeurIPS 2025
We introduce a lightweight, real-time motion recognition system that enables synergic human-machine performance through wearable IMU sensor data, MiniRocket time-series classification, and responsive multimedia control. By mapping dancer-specific movement to sound through somatic memory and association, we propose an alternative approach to human-machine collaboration, one that preserves the expressive depth of the performing body while leveraging machine learning for attentive observation and responsiveness. We demonstrate that this human-centered design reliably supports high accuracy classification (<50 ms latency), offering a replicable framework to integrate dance-literate machines into creative, educational, and live performance contexts.
comment: 8 pages, 5 figures. Camera-ready manuscript for the Creative AI Track of NeurIPS 2025
☆ Reducing normalizing flow complexity for MCMC preconditioning
Preconditioning is a key component of MCMC algorithms that improves sampling efficiency by facilitating exploration of geometrically complex target distributions through an invertible map. While linear preconditioners are often sufficient for moderately complex target distributions, recent work has explored nonlinear preconditioning with invertible neural networks as components of normalizing flows (NFs). However, empirical and theoretical studies show that overparameterized NF preconditioners can degrade sampling efficiency and fit quality. Moreover, existing NF-based approaches do not adapt their architectures to the target distribution. Related work outside of MCMC similarly finds that suitably parameterized NFs can achieve comparable or superior performance with substantially less training time or data. We propose a factorized preconditioning architecture that reduces NF complexity by combining a linear component with a conditional NF, improving adaptability to target geometry. The linear preconditioner is applied to dimensions that are approximately Gaussian, as estimated from warmup samples, while the conditional NF models more complex dimensions. Our method yields significantly better tail samples on two complex synthetic distributions and consistently better performance on a sparse logistic regression posterior across varying likelihood and prior strengths. It also achieves higher effective sample sizes on hierarchical Bayesian model posteriors with weak likelihoods and strong funnel geometries. This approach is particularly relevant for hierarchical Bayesian model analyses with limited data and could inform current theoretical and software strides in neural MCMC design.
comment: 22 pages, 6 figures
☆ Learning A Universal Crime Predictor with Knowledge-guided Hypernetworks ECAI 2025
Predicting crimes in urban environments is crucial for public safety, yet existing prediction methods often struggle to align the knowledge across diverse cities that vary dramatically in data availability of specific crime types. We propose HYpernetwork-enhanced Spatial Temporal Learning (HYSTL), a framework that can effectively train a unified, stronger crime predictor without assuming identical crime types in different cities' records. In HYSTL, instead of parameterising a dedicated predictor per crime type, a hypernetwork is designed to dynamically generate parameters for the prediction function conditioned on the crime type of interest. To bridge the semantic gap between different crime types, a structured crime knowledge graph is built, where the learned representations of crimes are used as the input to the hypernetwork to facilitate parameter generation. As such, when making predictions for each crime type, the predictor is additionally guided by its intricate association with other relevant crime types. Extensive experiments are performed on two cities with non-overlapping crime types, and the results demonstrate HYSTL outperforms state-of-the-art baselines.
comment: Accepted by ECAI 2025
☆ RoME: Domain-Robust Mixture-of-Experts for MILP Solution Prediction across Domains
Mixed-Integer Linear Programming (MILP) is a fundamental and powerful framework for modeling complex optimization problems across diverse domains. Recently, learning-based methods have shown great promise in accelerating MILP solvers by predicting high-quality solutions. However, most existing approaches are developed and evaluated in single-domain settings, limiting their ability to generalize to unseen problem distributions. This limitation poses a major obstacle to building scalable and general-purpose learning-based solvers. To address this challenge, we introduce RoME, a domain-Robust Mixture-of-Experts framework for predicting MILP solutions across domains. RoME dynamically routes problem instances to specialized experts based on learned task embeddings. The model is trained using a two-level distributionally robust optimization strategy: inter-domain to mitigate global shifts across domains, and intra-domain to enhance local robustness by introducing perturbations on task embeddings. We reveal that cross-domain training not only enhances the model's generalization capability to unseen domains but also improves performance within each individual domain by encouraging the model to capture more general intrinsic combinatorial patterns. Specifically, a single RoME model trained on three domains achieves an average improvement of 67.7% then evaluated on five diverse domains. We further test the pretrained model on MIPLIB in a zero-shot setting, demonstrating its ability to deliver measurable performance gains on challenging real-world instances where existing learning-based approaches often struggle to generalize.
♻ ☆ Why and When Deep is Better than Shallow: An Implementation-Agnostic State-Transition View of Depth Supremacy
Why and when is deep better than shallow? We answer this question in a framework that is agnostic to network implementation. We formulate a deep model as an abstract state-transition semigroup acting on a general metric space, and separate the implementation (e.g., ReLU nets, transformers, and chain-of-thought) from the abstract state transition. We prove a bias-variance decomposition in which the variance depends only on the abstract depth-$k$ network and not on the implementation (Theorem 1). We further split the bounds into output and hidden parts to tie the depth dependence of the variance to the metric entropy of the state-transition semigroup (Theorem 2). We then investigate implementation-free conditions under which the variance grow polynomially or logarithmically with depth (Section 4). Combining these with exponential or polynomial bias decay identifies four canonical bias-variance trade-off regimes (EL/EP/PL/PP) and produces explicit optimal depths $k^\ast$. Across regimes, $k^\ast>1$ typically holds, giving a rigorous form of depth supremacy. The lowest generalization error bound is achieved under the EL regime (exp-decay bias + log-growth variance), explaining why and when deep is better, especially for iterative or hierarchical concept classes such as neural ODEs, diffusion/score-matching models, and chain-of-thought reasoning.
♻ ☆ Imagine Beyond! Distributionally Robust Auto-Encoding for State Space Coverage in Online Reinforcement Learning
Goal-Conditioned Reinforcement Learning (GCRL) enables agents to autonomously acquire diverse behaviors, but faces major challenges in visual environments due to high-dimensional, semantically sparse observations. In the online setting, where agents learn representations while exploring, the latent space evolves with the agent's policy, to capture newly discovered areas of the environment. However, without incentivization to maximize state coverage in the representation, classical approaches based on auto-encoders may converge to latent spaces that over-represent a restricted set of states frequently visited by the agent. This is exacerbated in an intrinsic motivation setting, where the agent uses the distribution encoded in the latent space to sample the goals it learns to master. To address this issue, we propose to progressively enforce distributional shifts towards a uniform distribution over the full state space, to ensure a full coverage of skills that can be learned in the environment. We introduce DRAG (Distributionally Robust Auto-Encoding for GCRL), a method that combines the $\beta$-VAE framework with Distributionally Robust Optimization. DRAG leverages an adversarial neural weighter of training states of the VAE, to account for the mismatch between the current data distribution and unseen parts of the environment. This allows the agent to construct semantically meaningful latent spaces beyond its immediate experience. Our approach improves state space coverage and downstream control performance on hard exploration environments such as mazes and robotic control involving walls to bypass, without pre-training nor prior environment knowledge.
♻ ☆ Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders
Gene regulatory network inference (GRNI) aims to discover how genes causally regulate each other from gene expression data. It is well-known that statistical dependencies in observed data do not necessarily imply causation, as spurious dependencies may arise from latent confounders, such as non-coding RNAs. Numerous GRNI methods have thus been proposed to address this confounding issue. However, dependencies may also result from selection--only cells satisfying certain survival or inclusion criteria are observed--while these selection-induced spurious dependencies are frequently overlooked in gene expression data analyses. In this work, we show that such selection is ubiquitous and, when ignored or conflated with true regulations, can lead to flawed causal interpretation and misguided intervention recommendations. To address this challenge, a fundamental question arises: can we distinguish dependencies due to regulation, confounding, and crucially, selection? We show that gene perturbations offer a simple yet effective answer: selection-induced dependencies are symmetric under perturbation, while those from regulation or confounding are not. Building on this motivation, we propose GISL (Gene regulatory network Inference in the presence of Selection bias and Latent confounders), a principled algorithm that leverages perturbation data to uncover both true gene regulatory relations and non-regulatory mechanisms of selection and confounding up to the equivalence class. Experiments on synthetic and real-world gene expression data demonstrate the effectiveness of our method.
♻ ☆ Hybrid Quantum-Classical Recurrent Neural Networks
We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC in an exponentially large Hilbert space $\mathbb{C}^{2^n}$, which serves as a coherent recurrent quantum memory. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit Pauli expectation-value readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit readouts, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling. For sequence-to-sequence learning, we further devise a soft attention mechanism over the mid-circuit readouts and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
comment: Clarified expectation-value-based readouts and made minor text edits
♻ ☆ DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging NeurIPS 2025
Safe deployment of machine learning (ML) models in safety-critical domains such as medical imaging requires detecting inputs with characteristics not seen during training, known as out-of-distribution (OOD) detection, to prevent unreliable predictions. Effective OOD detection after deployment could benefit from access to the training data, enabling direct comparison between test samples and the training data distribution to identify differences. State-of-the-art OOD detection methods, however, either discard the training data after deployment or assume that test samples and training data are centrally stored together, an assumption that rarely holds in real-world settings. This is because shipping the training data with the deployed model is usually impossible due to the size of training databases, as well as proprietary or privacy constraints. We introduce the Isolation Network, an OOD detection framework that quantifies the difficulty of separating a target test sample from the training data by solving a binary classification task. We then propose Decentralized Isolation Networks (DIsoN), which enables the comparison of training and test data when data-sharing is impossible, by exchanging only model parameters between the remote computational nodes of training and deployment. We further extend DIsoN with class-conditioning, comparing a target sample solely with training data of its predicted class. We evaluate DIsoN on four medical imaging datasets (dermatology, chest X-ray, breast ultrasound, histopathology) across 12 OOD detection tasks. DIsoN performs favorably against existing methods while respecting data-privacy. This decentralized OOD detection framework opens the way for a new type of service that ML developers could provide along with their models: providing remote, secure utilization of their training data for OOD detection services. Code: https://github.com/FelixWag/DIsoN
comment: Accepted at NeurIPS 2025
♻ ☆ Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers with frozen, non-semantic input embeddings. We posit that because high-level meaning is a compositional property of a Transformer's deep layers, not its input vectors, the embedding layer and trained lower layers can serve as a fixed foundation. This liberates backpropagation to focus solely on newly added components, making incremental growth viable. We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack via low-rank adaptation (LoRA) as complexity increases. This method not only demonstrates stable convergence but also reveals a direct correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD, which are absent in shallower models. In a controlled study, our constructively grown model rivals the performance of a monolithically trained baseline of the same size, validating the efficiency and efficacy of the approach. Our findings suggest a path towards a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development. This opens a path for more resource-efficient scaling, continual learning, and a more modular approach to building powerful AI systems. We release all code and models to facilitate further research.
comment: Controlled Comparative Study added
♻ ☆ Inducing Riesz and orthonormal bases in $L^2$ via composition operators
Let $C_h$ be a composition operator mapping $L^2(\Omega_1)$ into $L^2(\Omega_2)$ for some open sets $\Omega_1, \Omega_2 \subseteq \mathbb{R}^n$. We characterize the mappings $h$ that transform Riesz bases of $L^2(\Omega_1)$ into Riesz bases of $L^2(\Omega_2)$. Restricting our analysis to differentiable mappings, we demonstrate that mappings $h$ that preserve Riesz bases have Jacobian determinants that are bounded away from zero and infinity. We discuss implications of these results for approximation theory, highlighting the potential of using bijective neural networks to construct Riesz bases with favorable approximation properties.
♻ ☆ Efficient Learning of Quantum States Prepared With Few Non-Clifford Gates
We give a pair of algorithms that efficiently learn a quantum state prepared by Clifford gates and $O(\log n)$ non-Clifford gates. Specifically, for an $n$-qubit state $|\psi\rangle$ prepared with at most $t$ non-Clifford gates, our algorithms use $\mathsf{poly}(n,2^t,1/\varepsilon)$ time and copies of $|\psi\rangle$ to learn $|\psi\rangle$ to trace distance at most $\varepsilon$. The first algorithm for this task is more efficient, but requires entangled measurements across two copies of $|\psi\rangle$. The second algorithm uses only single-copy measurements at the cost of polynomial factors in runtime and sample complexity. Our algorithms more generally learn any state with sufficiently large stabilizer dimension, where a quantum state has stabilizer dimension $k$ if it is stabilized by an abelian group of $2^k$ Pauli operators. We also develop an efficient property testing algorithm for stabilizer dimension, which may be of independent interest.
comment: 54 pages. v4 to be published in Quantum journal. Merged v3 with arXiv:2308.07175. This version now subsumes arXiv:2308.07175
♻ ☆ Gradient GA: Gradient Genetic Algorithm for Drug Molecular Design
Molecular discovery has brought great benefits to the chemical industry. Various molecule design techniques are developed to identify molecules with desirable properties. Traditional optimization methods, such as genetic algorithms, continue to achieve state-of-the-art results across multiple molecular design benchmarks. However, these techniques rely solely on random walk exploration, which hinders both the quality of the final solution and the convergence speed. To address this limitation, we propose a novel approach called Gradient Genetic Algorithm (Gradient GA), which incorporates gradient information from the objective function into genetic algorithms. Instead of random exploration, each proposed sample iteratively progresses toward an optimal solution by following the gradient direction. We achieve this by designing a differentiable objective function parameterized by a neural network and utilizing the Discrete Langevin Proposal to enable gradient guidance in discrete molecular spaces. Experimental results demonstrate that our method significantly improves both convergence speed and solution quality, outperforming cutting-edge techniques. For example, it achieves up to a 25% improvement in the top-10 score over the vanilla genetic algorithm. The code is publicly available at https://github.com/debadyuti23/GradientGA.
♻ ☆ Position: Bridge the Gaps between Machine Unlearning and AI Regulation NeurIPS 2025
The ''right to be forgotten'' and the data privacy laws that encode it have motivated machine unlearning since its earliest days. Now, some argue that an inbound wave of artificial intelligence regulations -- like the European Union's Artificial Intelligence Act (AIA) -- may offer important new use cases for machine unlearning. However, this position paper argues, this opportunity will only be realized if researchers proactively bridge the (sometimes sizable) gaps between machine unlearning's state of the art and its potential applications to AI regulation. To demonstrate this point, we use the AIA as our primary case study. Specifically, we deliver a ``state of the union'' as regards machine unlearning's current potential (or, in many cases, lack thereof) for aiding compliance with various provisions of the AIA. This starts with a precise cataloging of the potential applications of machine unlearning to AIA compliance. For each, we flag the technical gaps that exist between the potential application and the state of the art of machine unlearning. Finally, we end with a call to action: for machine learning researchers to solve the open technical questions that could unlock machine unlearning's potential to assist compliance with the AIA -- and other AI regulations like it.
comment: NeurIPS 2025 Position Paper Track Oral, https://openreview.net/forum?id=0ngi2StMwC
♻ ☆ Estimation of aboveground biomass in a tropical dry forest: An intercomparison of airborne, unmanned, and space laser scanning
According to the Paris Climate Change Agreement, all nations are required to submit reports on their greenhouse gas emissions and absorption every two years by 2024. Consequently, forests play a crucial role in reducing carbon emissions, which is essential for meeting these obligations. Recognizing the significance of forest conservation in the global battle against climate change, Article 5 of the Paris Agreement emphasizes the need for high-quality forest data. This study focuses on enhancing methods for mapping aboveground biomass in tropical dry forests. Tropical dry forests are considered one of the least understood tropical forest environments; therefore, there is a need for accurate approaches to estimate carbon pools. We employ a comparative analysis of AGB estimates, utilizing different discrete and full-waveform laser scanning datasets in conjunction with Ordinary Least Squares and Bayesian approaches SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning were used as independent variables for extracting forest metrics. Variable selection, SVM regression tuning, and cross-validation via a machine-learning approach were applied to account for overfitting and underfitting. The results indicate that six key variables primarily related to tree height: Elev\.minimum, Elev\.L3, lev\.MAD\.mode, Elev\.mode, Elev\.MAD\.median, and Elev\.skewness, are important for AGB estimation using ALSD and ULSD, while Leaf Area Index, canopy coverage and height, terrain elevation, and full-waveform signal energy emerged as the most vital variables. AGB values estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste province ranged from 26.02 Mg/ha to 175.43 Mg/ha. The SVM regressions demonstrated a 17.89 error across all laser scanning systems, with SLSF W exhibiting the lowest error 17.07 in estimating total biomass per plot.
comment: 32 pages, 17 figures, research paper
♻ ☆ A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios
The remarkable capabilities of Large Language Models (LLMs) often need to be tailored for specific applications, requiring the integration of new knowledge or the acquisition of new skills. While full fine-tuning is a powerful adaptation method, it is computationally expensive and can lead to a degradation of general reasoning abilities, a phenomenon known as catastrophic forgetting. A range of alternative techniques exists, each with its own trade-offs. In-Context Learning (ICL) is fast but limited by context length, while Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a middle ground by minimizing parameter changes. However, the challenge of catastrophic forgetting persists, raising questions about the best adaptation strategy for a given task. This paper presents a comparative analysis of Supervised Finetuning (SFT), LoRA, and ICL in data-scarce scenarios. We find that LoRA provides the most effective balance, successfully instilling new skills with minimal impact on the base model's general knowledge. In contrast, while SFT excels at skill acquisition, it is highly susceptible to catastrophic forgetting. ICL is effective for incorporating factual knowledge but struggles with complex skills. Our findings offer a practical framework for selecting an LLM adaptation strategy. We highlight the critical distinction between skill acquisition and knowledge integration, clarify the trade-offs between task-specific performance and the preservation of general capabilities.
♻ ☆ Bayesian Optimization by Kernel Regression and Density-based Exploration
Bayesian optimization is highly effective for optimizing expensive-to-evaluate black-box functions, but it faces significant computational challenges due to the high computational complexity of Gaussian processes, which results in a total time complexity that is quartic with respect to the number of iterations. To address this limitation, we propose the Bayesian Optimization by Kernel regression and density-based Exploration (BOKE) algorithm. BOKE uses kernel regression for efficient function approximation, kernel density for exploration, and integrates them into the confidence bound criteria to guide the optimization process, thus reducing computational costs to quadratic. Our theoretical analysis rigorously establishes the global convergence of BOKE and ensures its robustness in noisy settings. Through extensive numerical experiments on both synthetic and real-world optimization tasks, we demonstrate that BOKE not only performs competitively compared to Gaussian process-based methods and several other baseline methods but also exhibits superior computational efficiency. These results highlight BOKE's effectiveness in resource-constrained environments, providing a practical approach for optimization problems in engineering applications.
♻ ☆ Realizable Circuit Complexity: Embedding Computation in Space-Time
Classical circuit complexity characterizes parallel computation in purely combinatorial terms, ignoring the physical constraints that govern real hardware. The standard classes $\mathbf{NC}$, $\mathbf{AC}$, and $\mathbf{TC}$ treat unlimited fan-in, free interconnection, and polynomial gate counts as feasible -- assumptions that conflict with geometric, energetic, and thermodynamic realities. We introduce the family of \textit{realizable circuit classes} $\mathbf{RC}_d$, which model computation embedded in physical $d$-dimensional space. Each circuit in $\mathbf{RC}_d$ obeys conservative realizability laws: volume scales as $\mathcal{O}(t^d)$, cross-boundary information flux is bounded by $\mathcal{O}(t^{d-1})$ per unit time, and growth occurs through local, physically constructible edits. These bounds apply to all causal systems, classical or quantum. Within this framework, we show that algorithms with runtime $\omega(n^{d/(d-1)})$ cannot scale to inputs of maximal entropy, and that any $d$-dimensional parallel implementation offers at most a polynomial speed-up of degree $(d-1)$ over its optimal sequential counterpart. In the limit $d\to\infty$, $\mathbf{RC}_\infty(\mathrm{polylog})=\mathbf{NC}$, recovering classical parallelism as a non-physical idealization. By unifying geometry, causality, and information flow, $\mathbf{RC}_d$ extends circuit complexity into the physical domain, revealing universal scaling laws for computation.
comment: 38 pages
♻ ☆ The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs
Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.
♻ ☆ Image Super-Resolution with Guarantees via Conformalized Generative Models NeurIPS 2025
The increasing use of generative ML foundation models for image restoration tasks such as super-resolution calls for robust and interpretable uncertainty quantification methods. We address this need by presenting a novel approach based on conformal prediction techniques to create a 'confidence mask' capable of reliably and intuitively communicating where the generated image can be trusted. Our method is adaptable to any black-box generative model, including those locked behind an opaque API, requires only easily attainable data for calibration, and is highly customizable via the choice of a local image similarity metric. We prove strong theoretical guarantees for our method that span fidelity error control (according to our local image similarity metric), reconstruction quality, and robustness in the face of data leakage. Finally, we empirically evaluate these results and establish our method's solid performance.
comment: To appear at NeurIPS 2025. 17 pages, 7 figures
♻ ☆ Noise-based reward-modulated learning
The pursuit of energy-efficient and adaptive artificial intelligence (AI) has positioned neuromorphic computing as a promising alternative to conventional computing. However, achieving learning on these platforms requires techniques that prioritize local information while enabling effective credit assignment. Here, we propose noise-based reward-modulated learning (NRL), a novel synaptic plasticity rule that mathematically unifies reinforcement learning and gradient-based optimization with biologically-inspired local updates. NRL addresses the computational bottleneck of exact gradients by approximating them through stochastic neural activity, transforming the inherent noise of biological and neuromorphic substrates into a functional resource. Drawing inspiration from biological learning, our method uses reward prediction errors as its optimization target to generate increasingly advantageous behavior, and eligibility traces to facilitate retrospective credit assignment. Experimental validation on reinforcement tasks, featuring immediate and delayed rewards, shows that NRL achieves performance comparable to baselines optimized using backpropagation, although with slower convergence, while showing significantly superior performance and scalability in multi-layer networks compared to reward-modulated Hebbian learning (RMHL), the most prominent similar approach. While tested on simple architectures, the results highlight the potential of noise-driven, brain-inspired learning for low-power adaptive systems, particularly in computing substrates with locality constraints. NRL offers a theoretically grounded paradigm well-suited for the event-driven characteristics of next-generation neuromorphic AI.
♻ ☆ Efficient Latent Variable Causal Discovery: Combining Score Search and Targeted Testing
Learning causal structure from observational data is especially challenging when latent variables or selection bias are present. The Fast Causal Inference (FCI) algorithm addresses this setting but often performs exhaustive conditional independence tests across many subsets, leading to spurious independence claims, extra or missing edges, and unreliable orientations. We present a family of score-guided mixed-strategy causal search algorithms that build on this tradition. First, we introduce BOSS-FCI and GRaSP-FCI, straightforward variants of GFCI that substitute BOSS or GRaSP for FGES, thereby retaining correctness while incurring different scalability tradeoffs. Second, we develop FCI Targeted-testing (FCIT), a novel mixed-strategy method that improves upon these variants by replacing exhaustive all-subsets testing with targeted tests guided by BOSS, yielding well-formed PAGs with higher precision and efficiency. Finally, we propose a simple heuristic, LV-Dumb (also known as BOSS-POD), which bypasses latent-variable-specific reasoning and directly returns the PAG of the BOSS DAG. Although not strictly correct in the FCI sense, it scales better and often achieves superior accuracy in practice. Simulations and real-data analyses demonstrate that BOSS-FCI and GRaSP-FCI provide sound baselines, FCIT improves both efficiency and reliability, and LV-Dumb offers a practical heuristic with strong empirical performance. Together, these method highlight the value of score-guided and targeted strategies for scalable latent-variable causal discovery.
comment: 30 pages, 44 figures, 6 tables
♻ ☆ Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation EMNLP'25
This paper introduces an algorithm to select demonstration examples for in-context learning of a query set. Given a set of $n$ examples, how can we quickly select $k$ out of $n$ to best serve as the conditioning for downstream inference? This problem has broad applications in prompt tuning and chain-of-thought reasoning. Since model weights remain fixed during in-context learning, previous work has sought to design methods based on the similarity of token embeddings. This work proposes a new approach based on gradients of the output taken in the input embedding space. Our approach estimates model outputs through a first-order approximation using the gradients. Then, we apply this estimation to multiple randomly sampled subsets. Finally, we aggregate the sampled subset outcomes to form an influence score for each demonstration, and select $k$ most relevant examples. This procedure only requires pre-computing model outputs and gradients once, resulting in a linear-time algorithm relative to model and training set sizes. Extensive experiments across various models and datasets validate the efficiency of our approach. We show that the gradient estimation procedure yields approximations of full inference with less than ${1}\%$ error across six datasets. This allows us to scale up subset selection that would otherwise run full inference by up to ${37.7}\times$ on models with up to $34$ billion parameters, and outperform existing selection methods based on input embeddings by ${11}\%$ on average.
comment: 19 pages. EMNLP'25
♻ ☆ Universal Sequence Preconditioning
We study the problem of preconditioning in sequential prediction. From the theoretical lens of linear dynamical systems, we show that convolving the target sequence corresponds to applying a polynomial to the hidden transition matrix. Building on this insight, we propose a universal preconditioning method that convolves the target with coefficients from orthogonal polynomials such as Chebyshev or Legendre. We prove that this approach reduces regret for two distinct prediction algorithms and yields the first ever sublinear and hidden-dimension-independent regret bounds (up to logarithmic factors) that hold for systems with marginally table and asymmetric transition matrices. Finally, extensive synthetic and real-world experiments show that this simple preconditioning strategy improves the performance of a diverse range of algorithms, including recurrent neural networks, and generalizes to signals beyond linear dynamical systems.
comment: 35 pages, 3 tables, 5 figures
♻ ☆ Discrete and Continuous Difference of Submodular Minimization
Submodular functions, defined on continuous or discrete domains, arise in numerous applications. We study the minimization of the difference of two submodular (DS) functions, over both domains, extending prior work restricted to set functions. We show that all functions on discrete domains and all smooth functions on continuous domains are DS. For discrete domains, we observe that DS minimization is equivalent to minimizing the difference of two convex (DC) functions, as in the set function case. We propose a novel variant of the DC Algorithm (DCA) and apply it to the resulting DC Program, obtaining comparable theoretical guarantees as in the set function case. The algorithm can be applied to continuous domains via discretization. Experiments demonstrate that our method outperforms baselines in integer compressive sensing and integer least squares.
♻ ☆ Repetitions are not all alike: distinct mechanisms sustain repetition in language models
Large Language Models (LLMs) can sometimes degrade into repetitive loops, persistently generating identical word sequences. Because repetition is rare in natural human language, its frequent occurrence across diverse tasks and contexts in LLMs remains puzzling. Here we investigate whether behaviorally similar repetition patterns arise from distinct underlying mechanisms and how these mechanisms develop during model training. We contrast two conditions: repetitions elicited by natural text prompts with those induced by in-context learning (ICL) setups that explicitly require copying behavior. Our analyses reveal that ICL-induced repetition relies on a dedicated network of attention heads that progressively specialize over training, whereas naturally occurring repetition emerges early and lacks a defined circuitry. Attention inspection further shows that natural repetition focuses disproportionately on low-information tokens, suggesting a fallback behavior when relevant context cannot be retrieved. These results indicate that superficially similar repetition behaviors originate from qualitatively different internal processes, reflecting distinct modes of failure and adaptation in language models.
♻ ☆ Harnessing IoT and Generative AI for Weather-Adaptive Learning in Climate Resilience Education
This paper introduces the Future Atmospheric Conditions Training System (FACTS), a novel platform that advances climate resilience education through place-based, adaptive learning experiences. FACTS combines real-time atmospheric data collected by IoT sensors with curated resources from a Knowledge Base to dynamically generate localized learning challenges. Learner responses are analyzed by a Generative AI powered server, which delivers personalized feedback and adaptive support. Results from a user evaluation indicate that participants found the system both easy to use and effective for building knowledge related to climate resilience. These findings suggest that integrating IoT and Generative AI into atmospherically adaptive learning technologies holds significant promise for enhancing educational engagement and fostering climate awareness.
comment: Not enough evidence to prove the effectiveness of the system in the context of learning about climate change
♻ ☆ FORTALESA: Fault-Tolerant Reconfigurable Systolic Array for DNN Inference
The emergence of Deep Neural Networks (DNNs) in mission- and safety-critical applications brings their reliability to the front. High performance demands of DNNs require the use of specialized hardware accelerators. Systolic array architecture is widely used in DNN accelerators due to its parallelism and regular structure. This work presents a run-time reconfigurable systolic array architecture with three execution modes and four implementation options. All four implementations are evaluated in terms of resource utilization, throughput, and fault tolerance improvement. The proposed architecture is used for reliability enhancement of DNN inference on systolic array through heterogeneous mapping of different network layers to different execution modes. The approach is supported by a novel reliability assessment method based on fault propagation analysis. It is used for the exploration of the appropriate execution mode--layer mapping for DNN inference. The proposed architecture efficiently protects registers and MAC units of systolic array PEs from transient and permanent faults. The reconfigurability feature enables a speedup of up to $3\times$, depending on layer vulnerability. Furthermore, it requires $6\times$ fewer resources compared to static redundancy and $2.5\times$ fewer resources compared to the previously proposed solution for transient faults.
comment: 13 pages, 15 figures
♻ ☆ PyDPF: A Python Package for Differentiable Particle Filtering
State-space models (SSMs) are a widely used tool in time series analysis. In the complex systems that arise from real-world data, it is common to employ particle filtering (PF), an efficient Monte Carlo method for estimating the hidden state corresponding to a sequence of observations. Applying particle filtering requires specifying both the parametric form and the parameters of the system, which are often unknown and must be estimated. Gradient-based optimisation techniques cannot be applied directly to standard particle filters, as the filters themselves are not differentiable. However, several recently proposed methods modify the resampling step to make particle filtering differentiable. In this paper, we present an implementation of several such differentiable particle filters (DPFs) with a unified API built on the popular PyTorch framework. Our implementation makes these algorithms easily accessible to a broader research community and facilitates straightforward comparison between them. We validate our framework by reproducing experiments from several existing studies and demonstrate how DPFs can be applied to address several common challenges with state space modelling.
comment: 42 pages, 0 figures, under review at the Journal of Statistical Software, the python package can be found at https://pypi.org/project/pydpf/ , the full documentation at https://python-dpf.readthedocs.io/en/latest/#documentation-index , and the source code including experiments and replication material at https://github.com/John-JoB/pydpf
♻ ☆ AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents' performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.
comment: Code: https://github.com/facebookresearch/aira-dojo
♻ ☆ Multiscale spatiotemporal heterogeneity analysis of bike-sharing system's self-loop phenomenon: Evidence from Shanghai
Bike-sharing is an environmentally friendly shared mobility mode, but its self-loop phenomenon, where bikes are returned to the same station after several time usage, significantly impacts equity in accessing its services. Therefore, this study conducts a multiscale analysis with a spatial autoregressive model and double machine learning framework to assess socioeconomic features and geospatial location's impact on the self-loop phenomenon at metro stations and street scales. The results reveal that bike-sharing self-loop intensity exhibits significant spatial lag effect at street scale and is positively associated with residential land use. Marginal treatment effects of residential land use is higher on streets with middle-aged residents, high fixed employment, and low car ownership. The multimodal public transit condition reveals significant positive marginal treatment effects at both scales. To enhance bike-sharing cooperation, we advocate augmenting bicycle availability in areas with high metro usage and low bus coverage, alongside implementing adaptable redistribution strategies.
comment: Critical OD data calibration errors in Sections 3.2/4.1 (invalidating indices, undermining conclusions), planning a revised validated version
♻ ☆ Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning NeurIPS 2025
Biological brains learn continually from a stream of unlabeled data, while integrating specialized information from sparsely labeled examples without compromising their ability to generalize. Meanwhile, machine learning methods are susceptible to catastrophic forgetting in this natural learning setting, as supervised specialist fine-tuning degrades performance on the original task. We introduce task-modulated contrastive learning (TMCL), which takes inspiration from the biophysical machinery in the neocortex, using predictive coding principles to integrate top-down information continually and without supervision. We follow the idea that these principles build a view-invariant representation space, and that this can be implemented using a contrastive loss. Then, whenever labeled samples of a new class occur, new affine modulations are learned that improve separation of the new class from all others, without affecting feedforward weights. By co-opting the view-invariance learning mechanism, we then train feedforward weights to match the unmodulated representation of a data sample to its modulated counterparts. This introduces modulation invariance into the representation space, and, by also using past modulations, stabilizes it. Our experiments show improvements in both class-incremental and transfer learning over state-of-the-art unsupervised approaches, as well as over comparable supervised approaches, using as few as 1% of available labels. Taken together, our work suggests that top-down modulations play a crucial role in balancing stability and plasticity.
comment: Accepted to NeurIPS 2025. Camera-ready version. 33 pages, 5 figures. Code available at: https://github.com/tran-khoa/tmcl
♻ ☆ Electrical Load Forecasting over Multihop Smart Metering Networks with Federated Learning
Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) record household energy data. Traditional machine learning (ML) methods are often employed for load forecasting, but require data sharing, which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method for high-quality load forecasting in metering networks. A meta-learning-based strategy is developed to address data heterogeneity at local SMs in the collaborative training of local load forecasting models. Moreover, to minimize the load forecasting delays in our PFL model, we study a new latency optimization problem based on optimal resource allocation at SMs. A theoretical convergence analysis is also conducted to provide insights into FL design for federated load forecasting. Extensive simulations from real-world datasets show that our method outperforms existing approaches regarding better load forecasting and reduced operational latency costs.
♻ ☆ Lower-dimensional projections of cellular expression improves cell type classification from single-cell RNA sequencing
Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at https://github.com/umar1196/EnProCell.
♻ ☆ Scaffolded Language Models with Language Supervision for Mixed-Autonomy: A Survey
This survey organizes the intricate literature on the design and optimization of emerging structures around post-trained LMs. We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools. We view scaffolded LMs as semi-parametric models wherein we train non-parametric variables, including the prompt, tools, and scaffold's code. In particular, they interpret instructions, use tools, and receive feedback all in language. Recent works use an LM as an optimizer to interpret language supervision and update non-parametric variables according to intricate objectives. In this survey, we refer to this paradigm as training of scaffolded LMs with language supervision. A key feature of non-parametric training is the ability to learn from language. Parametric training excels in learning from demonstration (supervised learning), exploration (reinforcement learning), or observations (unsupervised learning), using well-defined loss functions. Language-based optimization enables rich, interpretable, and expressive objectives, while mitigating issues like catastrophic forgetting and supporting compatibility with closed-source models. Furthermore, agents are increasingly deployed as co-workers in real-world applications such as Copilot in Office tools or software development. In these mixed-autonomy settings, where control and decision-making are shared between human and AI, users point out errors or suggest corrections. Accordingly, we discuss agents that continuously improve by learning from this real-time, language-based feedback and refer to this setting as streaming learning from language supervision.
♻ ☆ Training on Plausible Counterfactuals Removes Spurious Correlations
Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.
♻ ☆ From Uniform to Adaptive: General Skip-Block Mechanisms for Efficient PDE Neural Operators
In recent years, Neural Operators(NO) have gradually emerged as a popular approach for solving Partial Differential Equations (PDEs). However, their application to large-scale engineering tasks suffers from significant computational overhead. And the fact that current models impose a uniform computational cost while physical fields exhibit vastly different complexities constitutes a fundamental mismatch, which is the root of this inefficiency. For instance, in turbulence flows, intricate vortex regions require deeper network processing compared to stable flows. To address this, we introduce a framework: Skip-Block Routing (SBR), a general framework designed for Transformer-based neural operators, capable of being integrated into their multi-layer architectures. First, SBR uses a routing mechanism to learn the complexity and ranking of tokens, which is then applied during inference. Then, in later layers, it decides how many tokens are passed forward based on this ranking. This way, the model focuses more processing capacity on the tokens that are more complex. Experiments demonstrate that SBR is a general framework that seamlessly integrates into various neural operators. Our method reduces computational cost by approximately 50% in terms of Floating Point Operations (FLOPs), while still delivering up to 2x faster inference without sacrificing accuracy.
♻ ☆ ABS: Enforcing Constraint Satisfaction On Generated Sequences Via Automata-Guided Beam Search
Sequence generation and prediction form a cornerstone of modern machine learning, with applications spanning natural language processing, program synthesis, and time-series forecasting. These tasks are typically modeled in an autoregressive fashion, where each token is generated conditional on the preceding ones, and beam search is commonly used to balance exploration and fluency during decoding. While deep learning models and Large Language Models (LLMs) excel at capturing statistical patterns in this setting, they remain ill-equipped to guarantee compliance with formal constraints. In this paper, we introduce ABS: a general and model-agnostic inference-time algorithm that guarantees compliance with any constraint that can be compiled into a Deterministic Finite Automaton (DFA), without requiring retraining. ABS leverages the DFA to guide a constrained variant of beam search: at each decoding step, transitions leading to violations are masked, while remaining paths are dynamically re-ranked according to both the model's probabilities and the automaton's acceptance structure. We formally prove that the resulting sequences are guaranteed to satisfy the given constraints, and we empirically demonstrate that ABS also improves output quality. We validate our approach on three distinct tasks: constrained image-stream classification, controlled text generation, and text infilling. In all settings, ABS achieves perfect constraint satisfaction, while outperforming or matching state-of-the-art baselines on standard quality metrics and efficiency.
♻ ☆ Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment SP
In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.
comment: ICSPS 2025
♻ ☆ Constrained Optimal Fuel Consumption of HEVs under Observational Noise
In our prior work, we investigated the minimum fuel consumption of a hybrid electric vehicle (HEV) under a state-of-charge (SOC) balance constraint, assuming perfect SOC measurements and accurate reference speed profiles. The constrained optimal fuel consumption (COFC) problem was addressed using a constrained reinforcement learning (CRL) framework. However, in real-world scenarios, SOC readings are often corrupted by sensor noise, and reference speeds may deviate from actual driving conditions. To account for these imperfections, this study reformulates the COFC problem by explicitly incorporating observational noise in both SOC and reference speed. We adopt a robust CRL approach, where the noise is modeled as a uniform distribution, and employ a structured training procedure to ensure stability. The proposed method is evaluated through simulations on the Toyota Prius hybrid system (THS), using both the New European Driving Cycle (NEDC) and the Worldwide Harmonized Light Vehicles Test Cycle (WLTC). Results show that fuel consumption and SOC constraint satisfaction remain robust across varying noise levels. Furthermore, the analysis reveals that observational noise in SOC and speed can impact fuel consumption to different extents. To the best of our knowledge, this is the first study to explicitly examine how observational noise -- commonly encountered in dynamometer testing and predictive energy control (PEC) applications -- affects constrained optimal fuel consumption in HEVs.
comment: Minor text and figure adjustments; no substantive changes
♻ ☆ OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility}. Therefore, we introduce \textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).
♻ ☆ Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models NeurIPS 2025
In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker--Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker--Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling. Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https://github.com/noegroup/ScoreMD.
comment: Accepted at Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Astromer 2
Foundational models have emerged as a powerful paradigm in deep learning field, leveraging their capacity to learn robust representations from large-scale datasets and effectively to diverse downstream applications such as classification. In this paper, we present Astromer 2 a foundational model specifically designed for extracting light curve embeddings. We introduce Astromer 2 as an enhanced iteration of our self-supervised model for light curve analysis. This paper highlights the advantages of its pre-trained embeddings, compares its performance with that of its predecessor, Astromer 1, and provides a detailed empirical analysis of its capabilities, offering deeper insights into the model's representations. Astromer 2 is pretrained on 1.5 million single-band light curves from the MACHO survey using a self-supervised learning task that predicts randomly masked observations within sequences. Fine-tuning on a smaller labeled dataset allows us to assess its performance in classification tasks. The quality of the embeddings is measured by the F1 score of an MLP classifier trained on Astromer-generated embeddings. Our results demonstrate that Astromer 2 significantly outperforms Astromer 1 across all evaluated scenarios, including limited datasets of 20, 100, and 500 samples per class. The use of weighted per-sample embeddings, which integrate intermediate representations from Astromer's attention blocks, is particularly impactful. Notably, Astromer 2 achieves a 15% improvement in F1 score on the ATLAS dataset compared to prior models, showcasing robust generalization to new datasets. This enhanced performance, especially with minimal labeled data, underscores the potential of Astromer 2 for more efficient and scalable light curve analysis.
comment: 10 pages, 17 figures
♻ ☆ Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization
Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms. The time-dependency of the state transition dynamics aggravates the notorious stability problems of model-free deep actor-critic architectures. We posit that two properties will play a key role in overcoming non-stationarity in transition dynamics: (i)~preserving the plasticity of the critic network and (ii) directed exploration for rapid adaptation to changing dynamics. We show that performing on-policy reinforcement learning with an evidential critic provides both. The evidential design ensures a fast and accurate approximation of the uncertainty around the state value, which maintains the plasticity of the critic network by detecting the distributional shifts caused by changes in dynamics. The probabilistic critic also makes the actor training objective a random variable, enabling the use of directed exploration approaches as a by-product. We name the resulting algorithm \emph{Evidential Proximal Policy Optimization (EPPO)} due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.
♻ ☆ Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning NeurIPS 2025
Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
comment: To appear on NeurIPS 2025 (spotlight)
♻ ☆ Prior-Guided Flow Matching for Target-Aware Molecule Design with Learnable Atom Number NeurIPS 2025
Structure-based drug design (SBDD), aiming to generate 3D molecules with high binding affinity toward target proteins, is a vital approach in novel drug discovery. Although recent generative models have shown great potential, they suffer from unstable probability dynamics and mismatch between generated molecule size and the protein pockets geometry, resulting in inconsistent quality and off-target effects. We propose PAFlow, a novel target-aware molecular generation model featuring prior interaction guidance and a learnable atom number predictor. PAFlow adopts the efficient flow matching framework to model the generation process and constructs a new form of conditional flow matching for discrete atom types. A protein-ligand interaction predictor is incorporated to guide the vector field toward higher-affinity regions during generation, while an atom number predictor based on protein pocket information is designed to better align generated molecule size with target geometry. Extensive experiments on the CrossDocked2020 benchmark show that PAFlow achieves a new state-of-the-art in binding affinity (up to -8.31 Avg. Vina Score), simultaneously maintains favorable molecular properties.
comment: Accepted by NeurIPS 2025
♻ ☆ MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.
♻ ☆ LoLaFL: Low-Latency Federated Learning via Forward-only Propagation
Federated learning (FL) has emerged as a widely adopted paradigm for enabling edge learning with distributed data while ensuring data privacy. However, the traditional FL with deep neural networks trained via backpropagation can hardly meet the low-latency learning requirements in the sixth generation (6G) mobile networks. This challenge mainly arises from the high-dimensional model parameters to be transmitted and the numerous rounds of communication required for convergence due to the inherent randomness of the training process. To address this issue, we adopt the state-of-the-art principle of maximal coding rate reduction to learn linear discriminative features and extend the resultant white-box neural network into FL, yielding the novel framework of Low-Latency Federated Learning (LoLaFL) via forward-only propagation. LoLaFL enables layer-wise transmissions and aggregation with significantly fewer communication rounds, thereby considerably reducing latency. Additionally, we propose two \emph{nonlinear} aggregation schemes for LoLaFL. The first scheme is based on the proof that the optimal NN parameter aggregation in LoLaFL should be harmonic-mean-like. The second scheme further exploits the low-rank structures of the features and transmits the low-rank-approximated covariance matrices of features to achieve additional latency reduction. Theoretic analysis and experiments are conducted to evaluate the performance of LoLaFL. In comparison with traditional FL, the two nonlinear aggregation schemes for LoLaFL can achieve reductions in latency of over 87\% and 97\%, respectively, while maintaining comparable accuracies.
comment: 16 pages, 10 figures
♻ ☆ Cross-modal Diffusion Modelling for Super-resolved Spatial Transcriptomics
The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.
♻ ☆ Language-Agnostic Modeling of Source Reliability on Wikipedia
Over the last few years, verifying the credibility of information sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of web domains as sources in references across multiple language editions of Wikipedia. Utilizing editing activity data, the model evaluates domain reliability within different articles of varying controversiality, such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts domain reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65, while the performance of low-resource languages varies. In all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. We believe these findings can assist Wikipedia editors in their ongoing efforts to verify citations and may offer useful insights for other user-generated content communities.
♻ ☆ Improving Uncertainty Estimation through Semantically Diverse Language Generation ICLR 2025
Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that predictive uncertainty is one of the main causes of hallucinations. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs.
comment: ICLR 2025
♻ ☆ Autoencoding Random Forests NeurIPS 2025
We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
comment: 10 pages main text, 27 pages total. 9 figures, 4 tables. To be published in proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ GRAM-DTI: adaptive multimodal representation learning for drug target interaction prediction
Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILES protein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. We introduce GRAMDTI, a pretraining framework that integrates multimodal molecular and protein inputs into unified representations. GRAMDTI extends volume based contrastive learning to four modalities, capturing higher-order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modality's contribution during pre-training. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state of the art baselines. Our results highlight the benefits of higher order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
♻ ☆ Evolutionary Machine Learning meets Self-Supervised Learning: a comprehensive survey
The number of studies that combine Evolutionary Machine Learning and self-supervised learning has been growing steadily in recent years. Evolutionary Machine Learning has been shown to help automate the design of machine learning algorithms and to lead to more reliable solutions. Self-supervised learning, on the other hand, has produced good results in learning useful features when labelled data is limited. This suggests that the combination of these two areas can help both in shaping evolutionary processes and in automating the design of deep neural networks, while also reducing the need for labelled data. Still, there are no detailed reviews that explain how Evolutionary Machine Learning and self-supervised learning can be used together. To help with this, we provide an overview of studies that bring these areas together. Based on this growing interest and the range of existing works, we suggest a new sub-area of research, which we call Evolutionary Self-Supervised Learning and introduce a taxonomy for it. Finally, we point out some of the main challenges and suggest directions for future research to help Evolutionary Self-Supervised Learning grow and mature as a field.
♻ ☆ AI-driven software for automated quantification of skeletal metastases and treatment response evaluation using Whole-Body Diffusion-Weighted MRI (WB-DWI) in Advanced Prostate Cancer
Quantitative assessment of treatment response in Advanced Prostate Cancer (APC) with bone metastases remains an unmet clinical need. Whole-Body Diffusion-Weighted MRI (WB-DWI) provides two response biomarkers: Total Diffusion Volume (TDV) and global Apparent Diffusion Coefficient (gADC). However, tracking post-treatment changes of TDV and gADC from manually delineated lesions is cumbersome and increases inter-reader variability. We developed a software to automate this process. Core technologies include: (i) a weakly-supervised Residual U-Net model generating a skeleton probability map to isolate bone; (ii) a statistical framework for WB-DWI intensity normalisation, obtaining a signal-normalised b=900s/mm^2 (b900) image; and (iii) a shallow convolutional neural network that processes outputs from (i) and (ii) to generate a mask of suspected bone lesions, characterised by higher b900 signal intensity due to restricted water diffusion. This mask is applied to the gADC map to extract TDV and gADC statistics. We tested the tool using expert-defined metastatic bone disease delineations on 66 datasets, assessed repeatability of imaging biomarkers (N=10), and compared software-based response assessment with a construct reference standard (N=118). Average dice score between manual and automated delineations was 0.6 for lesions within pelvis and spine, with an average surface distance of 2mm. Relative differences for log-transformed TDV (log-TDV) and median gADC were 8.8% and 5%, respectively. Repeatability analysis showed coefficients of variation of 4.6% for log-TDV and 3.5% for median gADC, with intraclass correlation coefficients of 0.94 or higher. The software achieved 80.5% accuracy, 84.3% sensitivity, and 85.7% specificity in assessing response to treatment. Average computation time was 90s per scan.
♻ ☆ Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation MICCAI2024
In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size during model training, resulting in an improved class balance for smaller patch sizes and accelerated convergence of the training process. We evaluate our curriculum approach in two settings: a resource-efficient mode and a performance mode, both regarding Dice score performance and computational costs across 15 diverse and popular 3D medical image segmentation tasks. The resource-efficient mode matches the Dice score performance of the conventional constant patch size sampling baseline with a notable reduction in training time to only 44%. The performance mode improves upon constant patch size segmentation results, achieving a statistically significant relative mean performance gain of 1.28% in Dice Score. Remarkably, across all 15 tasks, our proposed performance mode manages to surpass the constant patch size baseline in Dice Score performance, while simultaneously reducing training time to only 89%. The benefits are particularly pronounced for highly imbalanced tasks such as lesion segmentation tasks. Rigorous experiments demonstrate that our performance mode not only improves mean segmentation performance but also reduces performance variance, yielding more trustworthy model comparison. Furthermore, our findings reveal that the proposed curriculum sampling is not tied to a specific architecture but represents a broadly applicable strategy that consistently boosts performance across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In summary, we show that this simple yet elegant transformation on input data substantially improves both Dice Score performance and training runtime, while being compatible across diverse segmentation backbones.
comment: Journal Extension of "Progressive Growing of Patch Size: Resource-Efficient Curriculum Learning for Dense Prediction Tasks" (MICCAI2024) submitted to MedIA
♻ ☆ Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
Side-channel analysis (SCA) poses a real-world threat by exploiting unintentional physical signals to extract secret information from secure devices. Evaluation labs also use the same techniques to certify device security. In recent years, deep learning has emerged as a prominent method for SCA, achieving state-of-the-art attack performance at the cost of interpretability. Understanding how neural networks extract secrets is crucial for security evaluators aiming to defend against such attacks, as only by understanding the attack can one propose better countermeasures. In this work, we apply mechanistic interpretability to neural networks trained for SCA, revealing \textit{how} models exploit \textit{what} leakage in side-channel traces. We focus on sudden jumps in performance to reverse engineer learned representations, ultimately recovering secret masks and moving the evaluation process from black-box to white-box. Our results show that mechanistic interpretability can scale to realistic SCA settings, even when relevant inputs are sparse, model accuracies are low, and side-channel protections prevent standard input interventions.
comment: 17 pages, 13 figures, 1 table
♻ ☆ Large Language Models are Unreliable for Cyber Threat Intelligence
Several recent works have argued that Large Language Models (LLMs) can be used to tame the data deluge in the cybersecurity field, by improving the automation of Cyber Threat Intelligence (CTI) tasks. This work presents an evaluation methodology that other than allowing to test LLMs on CTI tasks when using zero-shot learning, few-shot learning and fine-tuning, also allows to quantify their consistency and their confidence level. We run experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports and present new evidence of potential security risks in relying on LLMs for CTI. We show how LLMs cannot guarantee sufficient performance on real-size reports while also being inconsistent and overconfident. Few-shot learning and fine-tuning only partially improve the results, thus posing doubts about the possibility of using LLMs for CTI scenarios, where labelled datasets are lacking and where confidence is a fundamental factor.
♻ ☆ HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling
Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences.
♻ ☆ MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents NeurIPS 2025
Recent advances in operating system (OS) agents have enabled vision-language models (VLMs) to directly control a user's computer. Unlike conventional VLMs that passively output text, OS agents autonomously perform computer-based tasks in response to a single user prompt. OS agents do so by capturing, parsing, and analysing screenshots and executing low-level actions via application programming interfaces (APIs), such as mouse clicks and keyboard inputs. This direct interaction with the OS significantly raises the stakes, as failures or manipulations can have immediate and tangible consequences. In this work, we uncover a novel attack vector against these OS agents: Malicious Image Patches (MIPs), adversarially perturbed screen regions that, when captured by an OS agent, induce it to perform harmful actions by exploiting specific APIs. For instance, a MIP can be embedded in a desktop wallpaper or shared on social media to cause an OS agent to exfiltrate sensitive user data. We show that MIPs generalise across user prompts and screen configurations, and that they can hijack multiple OS agents even during the execution of benign instructions. These findings expose critical security vulnerabilities in OS agents that have to be carefully addressed before their widespread deployment.
comment: NeurIPS 2025
♻ ☆ Tongyi DeepResearch Technical Report
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
comment: https://tongyi-agent.github.io/blog
♻ ☆ Strategic Classification with Non-Linear Classifiers
In strategic classification, the standard supervised learning setting is extended to support the notion of strategic user behavior in the form of costly feature manipulations made in response to a classifier. While standard learning supports a broad range of model classes, the study of strategic classification has, so far, been dedicated mostly to linear classifiers. This work aims to expand the horizon by exploring how strategic behavior manifests under non-linear classifiers and what this implies for learning. We take a bottom-up approach showing how non-linearity affects decision boundary points, classifier expressivity, and model class complexity. Our results show how, unlike the linear case, strategic behavior may either increase or decrease effective class complexity, and that the complexity decrease may be arbitrarily large. Another key finding is that universal approximators (e.g., neural nets) are no longer universal once the environment is strategic. We demonstrate empirically how this can create performance gaps even on an unrestricted model class.
♻ ☆ Dense Backpropagation Improves Training for Sparse Mixture-of-Experts NeurIPS 2025
Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe.
comment: NeurIPS 2025
♻ ☆ Tracking solutions of time-varying variational inequalities
Tracking the solution of time-varying variational inequalities is an important problem with applications in game theory, optimization, and machine learning. Existing work considers time-varying games or time-varying optimization problems. For strongly convex optimization problems or strongly monotone games, these results provide tracking guarantees under the assumption that the variation of the time-varying problem is restrained, that is, problems with a sublinear solution path. In this work we extend existing results in two ways: In our first result, we provide tracking bounds for (1) variational inequalities with a sublinear solution path but not necessarily monotone functions, and (2) for periodic time-varying variational inequalities that do not necessarily have a sublinear solution path-length. Our second main contribution is an extensive study of the convergence behavior and trajectory of discrete dynamical systems of periodic time-varying VI. We show that these systems can exhibit provably chaotic behavior or can converge to the solution. Finally, we illustrate our theoretical results with experiments.
♻ ☆ Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients
As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we propose FedMosaic, a method that jointly addresses data and model heterogeneity with a task-relevance-aware model aggregation strategy to reduce parameter interference, and a dimension-invariant module that enables knowledge sharing across heterogeneous architectures without huge computational cost. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. The empirical study shows that FedMosaic outperforms the state-of-the-art PFL methods, excelling in both personalization and generalization capabilities under challenging, realistic scenarios.
♻ ☆ Link Prediction with Untrained Message Passing Layers
Message passing neural networks (MPNNs) operate on graphs by exchanging information between neigbouring nodes. MPNNs have been successfully applied to various node-, edge-, and graph-level tasks in areas like molecular science, computer vision, natural language processing, and combinatorial optimization. However, most MPNNs require training on large amounts of labeled data, which can be costly and time-consuming. In this work, we explore the use of various untrained message passing layers in graph neural networks, i.e. variants of popular message passing architecture where we remove all trainable parameters that are used to transform node features in the message passing step. Focusing on link prediction, we find that untrained message passing layers can lead to competitive and even superior performance compared to fully trained MPNNs, especially in the presence of high-dimensional features. We provide a theoretical analysis of untrained message passing by relating the inner products of features implicitly produced by untrained message passing layers to path-based topological node similarity measures. As such, untrained message passing architectures can be viewed as a highly efficient and interpretable approach to link prediction.
♻ ☆ NMCSE: Noise-Robust Multi-Modal Coupling Signal Estimation Method via Optimal Transport for Cardiovascular Disease Detection
The coupling signal refers to a latent physiological signal that characterizes the transformation from cardiac electrical excitation, captured by the electrocardiogram (ECG), to mechanical contraction, recorded by the phonocardiogram (PCG). By encoding the temporal and functional interplay between electrophysiological and hemodynamic events, it serves as an intrinsic link between modalities and offers a unified representation of cardiac function, with strong potential to enhance multi-modal cardiovascular disease (CVD) detection. However, existing coupling signal estimation methods remain highly vulnerable to noise, particularly in real-world clinical and physiological settings, which undermines their robustness and limits practical value. In this study, we propose Noise-Robust Multi-Modal Coupling Signal Estimation (NMCSE), which reformulates coupling signal estimation as a distribution matching problem solved via optimal transport. By jointly aligning amplitude and timing, NMCSE avoids noise amplification and enables stable signal estimation. When integrated into a Temporal-Spatial Feature Extraction (TSFE) network, the estimated coupling signal effectively enhances multi-modal fusion for more accurate CVD detection. To evaluate robustness under real-world conditions, we design two complementary experiments targeting distinct sources of noise. The first uses the PhysioNet 2016 dataset with simulated hospital noise to assess the resilience of NMCSE to clinical interference. The second leverages the EPHNOGRAM dataset with motion-induced physiological noise to evaluate intra-state estimation stability across activity levels. Experimental results show that NMCSE consistently outperforms existing methods under both clinical and physiological noise, highlighting it as a noise-robust estimation approach that enables reliable multi-modal cardiac detection in real-world conditions.
♻ ☆ When Is Diversity Rewarded in Cooperative Multi-Agent Learning?
The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, we study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the $N$ agents' effort allocations on individual tasks to a task score, and an outer operator that merges the $M$ task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneity Gain Parameter Search (HetGPS), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Across different environments, we show that HetGPS rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HetGPS and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.
♻ ☆ Exploring Federated Learning for Thermal Urban Feature Segmentation -- A Comparison of Centralized and Decentralized Approaches CCS
Federated Learning (FL) is an approach for training a shared Machine Learning (ML) model with distributed training data and multiple participants. FL allows bypassing limitations of the traditional Centralized Machine Learning CL if data cannot be shared or stored centrally due to privacy or technical restrictions -- the participants train the model locally with their training data and do not need to share it among the other participants. This paper investigates the practical implementation and effectiveness of FL in a real-world scenario, specifically focusing on unmanned aerial vehicle (UAV)-based thermal images for common thermal feature detection in urban environments. The distributed nature of the data arises naturally and makes it suitable for FL applications, as images captured in two German cities are available. This application presents unique challenges due to non-identical distribution and feature characteristics of data captured at both locations. The study makes several key contributions by evaluating FL algorithms in real deployment scenarios rather than simulation. We compare several FL approaches with a centralized learning baseline across key performance metrics such as model accuracy, training time, communication overhead, and energy usage. This paper also explores various FL workflows, comparing client-controlled workflows and server-controlled workflows. The findings of this work serve as a valuable reference for understanding the practical application and limitations of the FL methods in segmentation tasks in UAV-based imaging.
comment: The Version of Record of this contribution is published in Computational Science and Its Applications (ICCSA) 2025, and is available online at https://doi.org/10.1007/978-3-031-97000-9
♻ ☆ Investigating the Robustness of Knowledge Tracing Models in the Presence of Student Concept Drift
Knowledge Tracing (KT) has been an established problem in the educational data mining field for decades, and it is commonly assumed that the underlying learning process being modeled remains static. Given the ever-changing landscape of online learning platforms (OLPs), we investigate how concept drift and changing student populations can impact student behavior within an OLP through testing model performance both within a single academic year and across multiple academic years. Four well-studied KT models were applied to five academic years of data to assess how susceptible KT models are to concept drift. Through our analysis, we find that all four families of KT models can exhibit degraded performance, Bayesian Knowledge Tracing (BKT) remains the most stable KT model when applied to newer data, while more complex, attention based models lose predictive power significantly faster.
comment: 10 pages, 6 figures
♻ ☆ From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit NeurIPS 2025
Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuits (MP) algorithm from sparse coding to design MP-SAE -- an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Scalable and Cost-Efficient de Novo Template-Based Molecular Generation
Template-based molecular generation offers a promising avenue for drug design by ensuring generated compounds are synthetically accessible through predefined reaction templates and building blocks. In this work, we tackle three core challenges in template-based GFlowNets: (1) minimizing synthesis cost, (2) scaling to large building block libraries, and (3) effectively utilizing small fragment sets. We propose Recursive Cost Guidance, a backward policy framework that employs auxiliary machine learning models to approximate synthesis cost and viability. This guidance steers generation toward low-cost synthesis pathways, significantly enhancing cost-efficiency, molecular diversity, and quality, especially when paired with an Exploitation Penalty that balances the trade-off between exploration and exploitation. To enhance performance in smaller building block libraries, we develop a Dynamic Library mechanism that reuses intermediate high-reward states to construct full synthesis trees. Our approach establishes state-of-the-art results in template-based molecular generation.
♻ ☆ MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver NeurIPS 2025
Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach to train a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables the efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model's generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities.
comment: Accepted at NeurIPS 2025
♻ ☆ Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit
Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we compare them with an iterative SAE that unrolls Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.
comment: Complementary work to arXiv:2506.03093
♻ ☆ Cash Flow Underwriting with Bank Transaction Data: Advancing MSME Financial Inclusion in Malaysia
Despite accounting for 96.1% of all businesses in Malaysia, access to financing remains one of the most persistent challenges faced by Micro, Small, and Medium Enterprises (MSMEs). Newly established or young businesses are often excluded from formal credit markets as traditional underwriting approaches rely heavily on credit bureau data. This study investigates the potential of bank statement data as an alternative data source for credit assessment to promote financial inclusion in emerging markets. Firstly, we propose a cash flow-based underwriting pipeline where we utilise bank statement data for end-to-end data extraction and machine learning credit scoring. Secondly, we introduce a novel dataset of 611 loan applicants from a Malaysian lending institution. Thirdly, we develop and evaluate credit scoring models based on application information and bank transaction-derived features. Empirical results show that the use of such data boosts the performance of all models on our dataset, which can improve credit scoring for new-to-lending MSMEs. Lastly, we intend to release the anonymised bank transaction dataset to facilitate further research on MSMEs financial inclusion within Malaysia's emerging economy.
comment: Accepted for oral presentation at the AI for Financial Inclusion, Risk Modeling and Resilience in Emerging Markets (FinRem) Workshop at ACM ICAIF 2025, Singapore
♻ ☆ FlowRL: Matching Reward Distributions for LLM Reasoning
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
♻ ☆ Modeling Hierarchical Spaces: A Review and Unified Framework for Surrogate-Based Architecture Design
Simulation-based problems involving mixed-variable inputs frequently feature domains that are hierarchical, conditional, heterogeneous, or tree-structured. These characteristics pose challenges for data representation, modeling, and optimization. This paper reviews extensive literature on these structured input spaces and proposes a unified framework that generalizes existing approaches. In this framework, input variables may be continuous, integer, or categorical. A variable is described as meta if its value governs the presence of other decreed variables, enabling the modeling of conditional and hierarchical structures. We further introduce the concept of partially-decreed variables, whose activation depends on contextual conditions. To capture these inter-variable hierarchical relationships, we introduce design space graphs, combining principles from feature modeling and graph theory. This allows the definition of general hierarchical domains suitable for describing complex system architectures. Our framework defines hierarchical distances and kernels to enable surrogate modeling and optimization on hierarchical domains. We demonstrate its effectiveness on complex system design problems, including a neural network and a green-aircraft case study. Our methods are available in the open-source Surrogate Modeling Toolbox (SMT 2.0).
comment: version number two
♻ ☆ ReNF: Rethinking the Design Space of Neural Long-Term Time Series Forecasters
Neural Forecasters (NFs) are a cornerstone of Long-term Time Series Forecasting (LTSF). However, progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting principles. In this work, we return to first principles to redesign the LTSF paradigm. We begin by introducing a Multiple Neural Forecasting Theorem that provides a theoretical basis for our approach. We propose Boosted Direct Output (BDO), a novel forecasting strategy that synergistically combines the advantages of both Auto-Regressive (AR) and Direct Output (DO). In addition, we stabilize the learning process by smoothly tracking the model's parameters. Extensive experiments show that these principled improvements enable a simple MLP to achieve state-of-the-art performance, outperforming recent, complex models in nearly all cases, without any specific considerations in the area. Finally, we empirically verify our theorem, establishing a dynamic performance bound and identifying promising directions for future research. The code for review is available at: .
♻ ☆ QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation NeurIPS 2025
The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at https://github.com/zy1xxx/SALV.
comment: Accepted to NeurIPS 2025
♻ ☆ Explainable Graph Neural Architecture Search via Monte-Carlo Tree Search (Full version)
The number of graph neural network (GNN) architectures has increased rapidly due to the growing adoption of graph analysis. Although we use GNNs in wide application scenarios, it is a laborious task to design/select optimal GNN architectures in diverse graphs. To reduce human efforts, graph neural architecture search (Graph NAS) has been used to search for a sub-optimal GNN architecture that combines existing components. However, existing Graph NAS methods lack explainability to understand the reasons why the model architecture is selected because they use complex search space and neural models to select architecture. Therefore, we propose an explainable Graph NAS method, called ExGNAS, which consists of (i) a simple search space that can adapt to various graphs and (ii) a search algorithm with Monte-Carlo tree that makes the decision process explainable. The combination of our search space and algorithm achieves finding accurate GNN models and the important functions within the search space. We comprehensively evaluate ExGNAS compared with four state-of-the-art Graph NAS methods in twelve graphs. Our experimental results show that ExGNAS achieves high average accuracy and efficiency; improving accuracy up to 26.1% and reducing run time up to 88%. Furthermore, we show the effectiveness of explainability by questionnaire-based user study and architecture analysis.
♻ ☆ Differentiable Fast Top-K Selection for Large-Scale Recommendation
Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require $O(n \log n)$ complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal $O(n)$ time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.
comment: 12 pages, 5 figures. Minor corrections in v2
♻ ☆ WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models SIGMOD 2026
Generating temporal data under conditions is crucial for forecasting, imputation, and generative tasks. Such data often has metadata and partially observed signals that jointly influence the generated values. However, existing methods face three key limitations: (1) they condition on either the metadata or observed values, but rarely both together; (2) they adopt either training-time approaches that fail to generalize to unseen scenarios, or inference-time approaches that ignore metadata; and (3) they suffer from trade-offs between generation speed and temporal coherence across time windows--choosing either slow but coherent autoregressive methods or fast but incoherent parallel ones. We propose WaveStitch, a novel diffusion-based method to overcome these hurdles through: (1) dual-sourced conditioning on both metadata and partially observed signals; (2) a hybrid training-inference architecture, incorporating metadata during training and observations at inference via gradient-based guidance; and (3) a novel pipeline-style paradigm that generates time windows in parallel while preserving coherence through an inference-time conditional loss and a stitching mechanism. Across diverse datasets, WaveStitch demonstrates adaptability to arbitrary patterns of observed signals, achieving 1.81x lower mean-squared-error compared to the state-of-the-art, and generates data up to 166.48x faster than autoregressive methods while maintaining coherence. Our code is available at: https://github.com/adis98/WaveStitch
comment: Accepted at ACM SIGMOD 2026
Human-Computer Interaction 27
☆ Audience Amplified: Virtual Audiences in Asynchronously Performed AR Theater
Audience reactions can considerably enhance live experiences; conversely, in anytime-anywhere augmented reality (AR) experiences, large crowds of people might not always be available to congregate. To get closer to simulating live events with large audiences, we created a mobile AR experience where users can wander around naturally and engage in AR theater with virtual audiences trained from real audiences using imitation learning. This allows us to carefully capture the essence of human imperfections and behavior in artificial intelligence (AI) audiences. The result is a novel mobile AR experience in which solitary AR users experience an augmented performance in a physical space with a virtual audience. Virtual dancers emerge from the surroundings, accompanied by a digitally simulated audience, to provide a community experience akin to immersive theater. In a pilot study, simulated human avatars were vastly preferred over just audience audio commentary. We subsequently engaged 20 participants as attendees of an AR dance performance, comparing a no-audience condition with a simulated audience of six onlookers. Through questionnaires and experience reports, we investigated user reactions and behavior. Our results demonstrate that the presence of virtual audience members caused attendees to perceive the performance as a social experience with increased interest and involvement in the event. On the other hand, for some attendees, the dance performances without the virtual audience evoked a stronger positive sentiment.
comment: Conference Paper, 10 pages. Published at the 2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)
☆ Does Interpretability of Knowledge Tracing Models Support Teacher Decision Making?
Knowledge tracing (KT) models are a crucial basis for pedagogical decision-making, namely which task to select next for a learner and when to stop teaching a particular skill. Given the high stakes of pedagogical decisions, KT models are typically required to be interpretable, in the sense that they should implement an explicit model of human learning and provide explicit estimates of learners' abilities. However, to our knowledge, no study to date has investigated whether the interpretability of KT models actually helps human teachers to make teaching decisions. We address this gap. First, we perform a simulation study to show that, indeed, decisions based on interpretable KT models achieve mastery faster compared to decisions based on a non-interpretable model. Second, we repeat the study but ask $N=12$ human teachers to make the teaching decisions based on the information provided by KT models. As expected, teachers rate interpretable KT models higher in terms of usability and trustworthiness. However, the number of tasks needed until mastery hardly differs between KT models. This suggests that the relationship between model interpretability and teacher decisions is not straightforward: teachers do not solely rely on KT models to make decisions and further research is needed to investigate how learners and teachers actually understand and use KT models.
comment: in press at the Workshop on Epistemics and Decision-Making in AI-Supported Education, AIED 2025
☆ DropleX: Liquid sensing on tablet touchscreens
We present DropleX, the first system that enables liquid sensing using the capacitive touchscreen of commodity tablets. DropleX detects microliter-scale liquid samples, and performs non-invasive, through-container measurements to detect whether a drink has been spiked or if a sealed liquid has been contaminated. These capabilities are made possible by a physics-informed mechanism that disables the touchscreen's built-in adaptive filters, originally designed to reject the effects of liquid drops such as rain, without any hardware modifications. We model the touchscreen's sensing capabilities, limits, and non-idealities to inform the design of a signal processing and learning-based pipeline for liquid sensing. Our system achieves 96-99% accuracy in detecting microliter-scale adulteration in soda, wine, and milk, 93-96% accuracy in threshold detection of trace chemical concentrations, and 86-96% accuracy in through-container adulterant detection. Given the predominance of touchscreens, these exploratory results can open new opportunities for liquid sensing on everyday devices.
☆ A Multi-Agent Psychological Simulation System for Human Behavior Modeling
Training and education in human-centered fields require authentic practice, yet realistic simulations of human behavior have remained limited. We present a multi-agent psychological simulation system that models internal cognitive-affective processes to generate believable human behaviors. In contrast to black-box neural models, this system is grounded in established psychological theories (e.g., self-efficacy, mindset, social constructivism) and explicitly simulates an ``inner parliament'' of agents corresponding to key psychological factors. These agents deliberate and interact to determine the system's output behavior, enabling unprecedented transparency and alignment with human psychology. We describe the system's architecture and theoretical foundations, illustrate its use in teacher training and research, and discuss how it embodies principles of social learning, cognitive apprenticeship, deliberate practice, and meta-cognition.
☆ The Other Side of the Screen: Motivations to Watch and Engage in Software Development Live Streams
Background: With the popularity of live streaming platforms at an all-time high, and many people turning to alternative venues for educational needs, this full research paper explores the viewership habits of software and game development live streams through the lens of informal education opportunities. Purpose: We investigate why developers watch software and game development live streams to understand the educational and social benefits they derive from this emerging form of informal learning. Methods: We implement a mixed-methods study combining survey data from 39 viewers and nine semi-structured interviews to analyze motivations, perceptions, and outcomes of watching development live streams. Findings: This research finds that viewers are motivated by both educational and social factors, with community engagement and informal mentorship as key motivations. Additionally, we find that technical learning draws initial interest, but social connections and co-working aspects sustain long-term engagement. Implications: Live streaming serves as a valuable informal learning tool that combines self-directed technical education with community support, which suggests that developers can leverage these platforms for continuous learning and professional growth outside of or in addition to traditional educational structures.
comment: 10 pages, 1 figure, peer reviewed and accepted at Frontiers in Education 2025 (FIE2025)
☆ SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration
We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at https://github.com/microsoft/SigmaCollab.
☆ Emotional Contagion in Code: How GitHub Emoji Reactions Shape Developer Collaboration
Developer communities increasingly rely on emoji reactions to communicate, but we know little about how these emotional signals spread and influence technical discussions. We analyzed 2,098 GitHub issues and pull requests across 50 popular repositories, examining patterns in 106,743 emoji reactions to understand emotional contagion in software development. Our findings reveal a surprisingly positive emotional landscape: 57.4% of discussions carry positive sentiment, with positive emotional cascades outnumbering negative ones 23:1. We identified five distinct patterns, with "instant enthusiasm" affecting 45.6% of items--nearly half receive immediate positive reinforcement. Statistical analysis confirms strong emotional contagion (r=0.679, p<0.001) with a massive effect size (d=2.393), suggesting that initial reactions powerfully shape discussion trajectories. These findings challenge assumptions about technical discourse being purely rational, demonstrating that even minimal emotional signals create measurable ripple effects. Our work provides empirical evidence that emoji reactions are not mere decoration but active forces shaping collaborative outcomes in software development.
comment: 12 pages, 3 figures. Analysis of 106,743 emoji reactions across 2,098 GitHub items
☆ HAGI++: Head-Assisted Gaze Imputation and Generation
Mobile eye tracking plays a vital role in capturing human visual attention across both real-world and extended reality (XR) environments, making it an essential tool for applications ranging from behavioural research to human-computer interaction. However, missing values due to blinks, pupil detection errors, or illumination changes pose significant challenges for further gaze data analysis. To address this challenge, we introduce HAGI++ - a multi-modal diffusion-based approach for gaze data imputation that, for the first time, uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements. HAGI++ employs a transformer-based diffusion model to learn cross-modal dependencies between eye and head representations and can be readily extended to incorporate additional body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D, and HOT3D datasets demonstrate that HAGI++ consistently outperforms conventional interpolation methods and deep learning-based time-series imputation baselines in gaze imputation. Furthermore, statistical analyses confirm that HAGI++ produces gaze velocity distributions that closely match actual human gaze behaviour, ensuring more realistic gaze imputations. Moreover, by incorporating wrist motion captured from commercial wearable devices, HAGI++ surpasses prior methods that rely on full-body motion capture in the extreme case of 100% missing gaze data (pure gaze generation). Our method paves the way for more complete and accurate eye gaze recordings in real-world settings and has significant potential for enhancing gaze-based analysis and interaction across various application domains.
comment: Extended version of our UIST'25 paper "HAGI: Head-Assisted Gaze Imputation for Mobile Eye Trackers"
☆ OpenCourier: an Open Protocol for Building a Decentralized Ecosystem of Community-owned Delivery Platforms
Although the platform gig economy has reshaped the landscape of work, its centralized operation by select actors has brought about challenges that impedes workers' well-being. We present the architecture and design of OpenCourier, an open protocol that defines communication patterns within a decentralized ecosystem of delivery platforms. Through this protocol, we aim to address three key challenges in the current economy: power imbalances between the platform and workers, information asymmetries caused by black-boxed algorithms and value misalignments in the infrastructure design process. With the OpenCourier protocol, we outline a blueprint for community-owned ecosystem of delivery platforms that centers worker agency, transparency, and bottom-up design.
☆ Can Conversational AI Counsel for Change? A Theory-Driven Approach to Supporting Dietary Intentions in Ambivalent Individuals
Adherence to healthy diets reduces chronic illness risk, yet rates remain low. Large Language Models (LLMs) are increasingly used for health communication but often struggle to engage individuals with ambivalent intentions at a pivotal stage of the Transtheoretical Model (TTM). We developed CounselLLM, an open-source model enhanced through persona design and few-shot, domain-specific prompts grounded in TTM and Motivational Interviewing (MI). In controlled evaluations, CounselLLM showed stronger use of TTM subprocesses and MI affirmations than human counselors, with comparable linguistic robustness but expressed in more concrete terms. A user study then tested CounselLLM in an interactive counseling setting against a baseline system. While knowledge and perceptions did not change, participants' intentions for immediate dietary change increased significantly after interacting with CounselLLM. Participants also rated it as easy to use, understandable, and supportive. These findings suggest theory-driven LLMs can effectively engage ambivalent individuals and provide a scalable approach to digital counseling.
☆ Revisiting put-that-there, context aware window interactions via LLMs
We revisit Bolt's classic "Put-That-There" concept for modern head-mounted displays by pairing Large Language Models (LLMs) with XR sensor and tech stack. The agent fuses (i) a semantically segmented 3-D environment, (ii) live application metadata, and (iii) users' verbal, pointing, and head-gaze cues to issue JSON window-placement actions. As a result, users can manage a panoramic workspace through: (1) explicit commands ("Place Google Maps on the coffee table"), (2) deictic speech plus gestures ("Put that there"), or (3) high-level goals ("I need to send a message"). Unlike traditional explicit interfaces, our system supports one-to-many action mappings and goal-centric reasoning, allowing the LLM to dynamically infer relevant applications and layout decisions, including interrelationships across tools. This enables seamless, intent-driven interaction without manual window juggling in immersive XR environments.
☆ AI Credibility Signals Outrank Institutions and Engagement in Shaping News Perception on Social Media
AI-generated content is rapidly becoming a salient component of online information ecosystems, yet its influence on public trust and epistemic judgments remains poorly understood. We present a large-scale mixed-design experiment (N = 1,000) investigating how AI-generated credibility scores affect user perception of political news. Our results reveal that AI feedback significantly moderates partisan bias and institutional distrust, surpassing traditional engagement signals such as likes and shares. These findings demonstrate the persuasive power of generative AI and suggest a need for design strategies that balance epistemic influence with user autonomy.
☆ The Pervasive Blind Spot: Benchmarking VLM Inference Risks on Everyday Personal Videos
The proliferation of Vision-Language Models (VLMs) introduces profound privacy risks from personal videos. This paper addresses the critical yet unexplored inferential privacy threat, the risk of inferring sensitive personal attributes over the data. To address this gap, we crowdsourced a dataset of 508 everyday personal videos from 58 individuals. We then conducted a benchmark study evaluating VLM inference capabilities against human performance. Our findings reveal three critical insights: (1) VLMs possess superhuman inferential capabilities, significantly outperforming human evaluators, leveraging a shift from object recognition to behavioral inference from temporal streams. (2) Inferential risk is strongly correlated with factors such as video characteristics and prompting strategies. (3) VLM-driven explanation towards the inference is unreliable, as we revealed a disconnect between the model-generated explanations and evidential impact, identifying ubiquitous objects as misleading confounders.
☆ Human-Machine Ritual: Synergic Performance through Real-Time Motion Recognition NeurIPS 2025
We introduce a lightweight, real-time motion recognition system that enables synergic human-machine performance through wearable IMU sensor data, MiniRocket time-series classification, and responsive multimedia control. By mapping dancer-specific movement to sound through somatic memory and association, we propose an alternative approach to human-machine collaboration, one that preserves the expressive depth of the performing body while leveraging machine learning for attentive observation and responsiveness. We demonstrate that this human-centered design reliably supports high accuracy classification (<50 ms latency), offering a replicable framework to integrate dance-literate machines into creative, educational, and live performance contexts.
comment: 8 pages, 5 figures. Camera-ready manuscript for the Creative AI Track of NeurIPS 2025
☆ Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results
Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.
☆ Learning Spatial Awareness for Laparoscopic Surgery with AI Assisted Visual Feedback
Laparoscopic surgery constrains surgeons spatial awareness because procedures are performed through a monocular, two-dimensional (2D) endoscopic view. Conventional training methods using dry-lab models or recorded videos provide limited depth cues, often leading trainees to misjudge instrument position and perform ineffective or unsafe maneuvers. To address this limitation, we present an AI-assisted training framework developed in NVIDIA Isaac Sim that couples the standard 2D laparoscopic feed with synchronized three-dimensional (3D) visual feedback delivered through a mixed-reality (MR) interface. While trainees operate using the clinical 2D view, validated AI modules continuously localize surgical instruments and detect instrument-tissue interactions in the background. When spatial misjudgments are detected, 3D visual feedback are displayed to trainees, while preserving the original operative perspective. Our framework considers various surgical tasks including navigation, manipulation, transfer, cutting, and suturing. Visually similar 2D cases can be disambiguated through the added 3D context, improving depth perception, contact awareness, and tool orientation understanding.
☆ Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models NeurIPS 2025
Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on object functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.
comment: Accepted to NeurIPS 2025, Conference on Neural Information Processing Systems, Creative AI Track
♻ ☆ ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs
As AI systems become more advanced, ensuring their alignment with a diverse range of individuals and societal values becomes increasingly critical. But how can we capture fundamental human values and assess the degree to which AI systems align with them? We introduce ValueCompass, a framework of fundamental values, grounded in psychological theory and a systematic review, to identify and evaluate human-AI alignment. We apply ValueCompass to measure the value alignment of humans and large language models (LLMs) across four real-world scenarios: collaborative writing, education, public sectors, and healthcare. Our findings reveal concerning misalignments between humans and LLMs, such as humans frequently endorse values like "National Security" which were largely rejected by LLMs. We also observe that values differ across scenarios, highlighting the need for context-aware AI alignment strategies. This work provides valuable insights into the design space of human-AI alignment, laying the foundations for developing AI systems that responsibly reflect societal values and ethics.
♻ ☆ Tactile Displays Driven by Projected Light
Tactile displays that lend tangible form to digital content could transform computing interactions. However, achieving the resolution, speed, and dynamic range needed for perceptual fidelity remains challenging. We present a tactile display that directly converts projected light into visible tactile patterns via a photomechanical surface populated with millimeter-scale optotactile pixels. The pixels transduce incident light into mechanical displacements through photostimulated thermal gas expansion, yielding millimeter scale displacements with response times of 2 to 100 milliseconds. Employing projected light for power transmission and addressing renders these displays highly scalable. We demonstrate optically driven displays with up to 1,511 addressable pixels -- several times more pixels than any prior tactile display attaining comparable performance. Perceptual studies confirm that these displays can reproduce diverse spatiotemporal tactile patterns with high fidelity. This research establishes a foundation for practical, versatile high-resolution tactile displays driven by light.
♻ ☆ Harnessing IoT and Generative AI for Weather-Adaptive Learning in Climate Resilience Education
This paper introduces the Future Atmospheric Conditions Training System (FACTS), a novel platform that advances climate resilience education through place-based, adaptive learning experiences. FACTS combines real-time atmospheric data collected by IoT sensors with curated resources from a Knowledge Base to dynamically generate localized learning challenges. Learner responses are analyzed by a Generative AI powered server, which delivers personalized feedback and adaptive support. Results from a user evaluation indicate that participants found the system both easy to use and effective for building knowledge related to climate resilience. These findings suggest that integrating IoT and Generative AI into atmospherically adaptive learning technologies holds significant promise for enhancing educational engagement and fostering climate awareness.
comment: Not enough evidence to prove the effectiveness of the system in the context of learning about climate change
♻ ☆ AI for a Planet Under Pressure
Artificial intelligence (AI) is already driving scientific breakthroughs in a variety of research fields, ranging from the life sciences to mathematics. This raises a critical question: can AI be applied both responsibly and effectively to address complex and interconnected sustainability challenges? This report is the result of a collaboration between the Stockholm resilience Centre (Stockholm University), the Potsdam Institute for Climate Impact Research (PIK), and Google DeepMind. Our work explores the potential and limitations of using AI as a research method to help tackle eight broad sustainability challenges. The results build on iterated expert dialogues and assessments, a systematic AI-supported literature overview including over 8,500 academic publications, and expert deep-dives into eight specific issue areas. The report also includes recommendations to sustainability scientists, research funders, the private sector, and philanthropies.
comment: 88 pages, 8 figures, 1 table
♻ ☆ Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in TSD 2025. Lecture Notes in Computer Science, vol 16029
♻ ☆ AI for Requirements Engineering: Industry adoption and Practitioner perspectives
The integration of AI for Requirements Engineering (RE) presents significant benefits but also poses real challenges. Although RE is fundamental to software engineering, limited research has examined AI adoption in RE. We surveyed 55 software practitioners to map AI usage across four RE phases: Elicitation, Analysis, Specification, and Validation, and four approaches for decision making: human-only decisions, AI validation, Human AI Collaboration (HAIC), and full AI automation. Participants also shared their perceptions, challenges, and opportunities when applying AI for RE tasks. Our data show that 58.2% of respondents already use AI in RE, and 69.1% view its impact as positive or very positive.HAIC dominates practice, accounting for 54.4% of all RE techniques, while full AI automation remains minimal at 5.4%. Passive AI validation (4.4 to 6.2%) lags even further behind, indicating that practitioners value AI's active support over passive oversight. These findings suggest that AI is most effective when positioned as a collaborative partner rather than a replacement for human expertise. It also highlights the need for RE-specific HAIC frameworks along with robust and responsible AI governance as AI adoption in RE grows.
comment: Accepted at the Intelligent Software Engineering (ISE) 2025 Workshop at the Automated Software Engineering (ASE) 2025 Conference
♻ ☆ Enhancing Psychometric Analysis with Interactive SIA Modules
ShinyItemAnalysis (SIA) is an R package and shiny application for an interactive presentation of psychometric methods and analysis of multi-item measurements in psychology, education, and social sciences in general. In this article, we present a new feature introduced in the recent version of the package, called "SIA modules", which allows researchers and practitioners to offer new analytical methods for broader use via add-on extensions. SIA modules are designed to integrate with and build upon the SIA interactive application, enabling them to leverage the existing infrastructure for tasks such as data uploading and processing. They can access and further use a range of outputs from various analyses, including models and datasets. Because SIA modules come in R packages (or extend the existing ones), they may come bundled with their datasets, use object-oriented systems, or even compiled code. We illustrate the concepts using sample modules from the newly introduced SIAmodules package and other packages. After providing a general overview of building Shiny applications, we describe how to develop the SIA add-on modules with the support of the new SIAtools package. Finally, we discuss possibilities of future development and emphasize the importance of freely available, interactive psychometric software for dissemination of methodological innovations.
♻ ☆ Learning with Category-Equivariant Architectures for Human Activity Recognition
We propose CatEquiv, a category-equivariant neural network for Human Activity Recognition (HAR) from inertial sensors that systematically encodes temporal, amplitude, and structural symmetries. We introduce a symmetry category that jointly represents cyclic time shifts, positive gain scalings, and the sensor-hierarchy poset, capturing the categorical symmetry structure of the data. CatEquiv achieves equivariance with respect to the categorical symmetry product. On UCI-HAR under out-of-distribution perturbations, CatEquiv attains markedly higher robustness compared with circularly padded CNNs and plain CNNs. These results demonstrate that enforcing categorical symmetries yields strong invariance and generalization without additional model capacity.
♻ ☆ Consistency of Responses and Continuations Generated by Large Language Models on Social Media AAAI
Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using three open-source models: Gemma, Llama3 and Llama3.3 and one commercial Model:Claude. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic consistency between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: these models show a strong tendency to moderate negative emotions. When the input text carries negative emotions such as anger, disgust, fear, or sadness, LLM tends to generate content with more neutral emotions, or even convert them into positive emotions such as joy or surprise. At the same time, we compared the LLM-generated content with human-authored content. The four models systematically generated responses with reduced emotional intensity and showed a preference for neutral rational emotions in the response task. In addition, these models all maintained a high semantic similarity with the original text, although their performance in the continuation task and the response task was different. These findings provide deep insights into the emotion and semantic processing capabilities of LLM, which are of great significance for its deployment in social media environments and human-computer interaction design.
comment: This paper has been accepted by the 20th International AAAI Conference on Web and Social Media (ICWSM 2026), sunny Los Angeles, California, U.S
♻ ☆ AuthGlass: Enhancing Voice Authentication on Smart Glasses via Air-Bone Acoustic Features
With the rapid advancement of smart glasses, voice interaction has become widely deployed due to its naturalness and convenience. However, its practicality is often undermined by the vulnerability to spoofing attacks and interference from surrounding sounds, making seamless voice authentication crucial for smart glasses usage. To address this challenge, we propose AuthGlass, a voice authentication approach that leverages both air- and bone-conducted speech features to enhance accuracy and liveness detection. Aiming to gain comprehensive knowledge on speech-related acoustic and vibration features, we built a smart glasses prototype with redundant synchronized microphones: 14 air-conductive microphones and 2 bone-conductive units. In a study with 42 participants, we validated that combining sound-field and vibration features significantly improves authentication robustness and attack resistance. Furthermore, experiments demonstrated that AuthGlass maintains competitive accuracy even under various practical scenarios, highlighting its applicability and scalability for real-world deployment.
comment: This version has been updated to remove proprietary details pending a patent filing
Software Engineering 22
☆ From Code Changes to Quality Gains: An Empirical Study in Python ML Systems with PyQu
In an era shaped by Generative Artificial Intelligence for code generation and the rising adoption of Python-based Machine Learning systems (MLS), software quality has emerged as a major concern. As these systems grow in complexity and importance, a key obstacle lies in understanding exactly how specific code changes affect overall quality-a shortfall aggravated by the lack of quality assessment tools and a clear mapping between ML systems code changes and their quality effects. Although prior work has explored code changes in MLS, it mostly stops at what the changes are, leaving a gap in our knowledge of the relationship between code changes and the MLS quality. To address this gap, we conducted a large-scale empirical study of 3,340 open-source Python ML projects, encompassing more than 3.7 million commits and 2.7 trillion lines of code. We introduce PyQu, a novel tool that leverages low level software metrics to identify quality-enhancing commits with an average accuracy, precision, and recall of 0.84 and 0.85 of average F1 score. Using PyQu and a thematic analysis, we identified 61 code changes, each demonstrating a direct impact on enhancing software quality, and we classified them into 13 categories based on contextual characteristics. 41% of the changes are newly discovered by our study and have not been identified by state-of-the-art Python changes detection tools. Our work offers a vital foundation for researchers, practitioners, educators, and tool developers, advancing the quest for automated quality assessment and best practices in Python-based ML software.
comment: Accepted for publication in the proceedings of IEEE/ACM 48th International Conference on Software Engineering
☆ Formalizing Regression Testing for Agile and Continuous Integration Environments
Software developed using modern agile practices delivers a stream of software versions that require continuous regression testing rather than testing once close to the delivery or maintenance phase, as assumed by classical regression-testing theory. In this work, we formalize the phenomenon of continuous or near-continuous regression testing using successive builds as a time-ordered chain, where each build contains the program, requirements, and the accompanying tests. We also formalize the regression test window between any two builds, which captures the limited time budget available for regression testing. As the time limit is set to infinity and the chain is closed to two builds, the model degenerates to retest-all, thereby preserving semantics for the classical two-version case. The formalization is validated by directly representing two state-of-the-art agile regression testing algorithms in terms of build-tuple operations without requiring auxiliary assumptions, followed by proof of the soundness and completeness of our formalization.
comment: This is the first attempt to formalize regression testing in agile context as a continuous/near-continuos activity. This formalization will help practitioners and researchers to answer 'when', 'what' and 'how much' question of regression testing in real world time constrained agile projects. This work is currently under review with Software Quality Journal
☆ 1 PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts
Smart contracts operate in a highly adversarial environment, where vulnerabilities can lead to substantial financial losses. Thus, smart contracts are subject to security audits. In auditing, proof-of-concept (PoC) exploits play a critical role by demonstrating to the stakeholders that the reported vulnerabilities are genuine, reproducible, and actionable. However, manually creating PoCs is time-consuming, error-prone, and often constrained by tight audit schedules. We introduce POCO, an agentic framework that automatically generates executable PoC exploits from natural-language vulnerability descriptions written by auditors. POCO autonomously generates PoC exploits in an agentic manner by interacting with a set of code-execution tools in a Reason-Act-Observe loop. It produces fully executable exploits compatible with the Foundry testing framework, ready for integration into audit reports and other security tools. We evaluate POCO on a dataset of 23 real-world vulnerability reports. POCO consistently outperforms the prompting and workflow baselines, generating well-formed and logically correct PoCs. Our results demonstrate that agentic frameworks can significantly reduce the effort required for high-quality PoCs in smart contract audits. Our contribution provides readily actionable knowledge for the smart contract security community.
comment: Under review
☆ Investigating the Experience of Autistic Individuals in Software Engineering
Context: Autism spectrum disorder (ASD) leads to various issues in the everyday life of autistic individuals, often resulting in unemployment and mental health problems. To improve the inclusion of autistic adults, existing studies have highlighted the strengths these individuals possess in comparison to non-autistic individuals, e.g., high attention to detail or excellent logical reasoning skills. If fostered, these strengths could be valuable in software engineering activities, such for identifying specific kinds of bugs in code. However, existing work in SE has primarily studied the challenges of autistic individuals and possible accommodations, with little attention their strengths. Objective: Our goal is to analyse the experiences of autistic individuals in software engineering activities, such as code reviews, with a particular emphasis on strengths. Methods: This study combines Social-Technical Grounded Theory through semi-structured interviews with 16 autistic software engineers and a survey with 49 respondents, including 5 autistic participants. We compare the emerging themes with the theory by Gama et al. on the Effect of Neurodivergent Cognitive Dysfunctions in Software Engineering Performance. Results: Our results suggest that autistic software engineers are often skilled in logical thinking, attention to detail, and hyperfocus in programming; and they enjoy learning new programming languages and programming-related technologies. Confirming previous work, they tend to prefer written communication and remote work. Finally, we report a high comfort level in interacting with AI-based systems. Conclusions: Our findings extend existing work by providing further evidence on the strengths of autistic software engineers.
☆ ReleaseEval: A Benchmark for Evaluating Language Models in Automated Release Note Generation
Automated release note generation addresses the challenge of documenting frequent software updates, where manual efforts are time-consuming and prone to human error. Although recent advances in language models further enhance this process, progress remains hindered by dataset limitations, including the lack of explicit licensing and limited reproducibility, and incomplete task design that relies mainly on commit messages for summarization while overlooking fine-grained contexts such as commit hierarchies and code changes. To fill this gap, we introduce ReleaseEval, a reproducible and openly licensed benchmark designed to systematically evaluate language models for automated release note generation. ReleaseEval comprises 94,987 release notes from 3,369 repositories across 6 programming languages, and supports three task settings with three levels of input granularity: (1) commit2sum, which generates release notes from commit messages; (2) tree2sum, which incorporates commit tree structures; and (3) diff2sum, which leverages fine-grained code diffs. Both automated and human evaluations show that large language models consistently outperform traditional baselines across all tasks, achieving substantial gains on tree2sum, while still struggling on diff2sum. These findings highlight LLMs' proficiency in leveraging structured information while revealing challenges in abstracting from long code diffs.
☆ Emotional Contagion in Code: How GitHub Emoji Reactions Shape Developer Collaboration
Developer communities increasingly rely on emoji reactions to communicate, but we know little about how these emotional signals spread and influence technical discussions. We analyzed 2,098 GitHub issues and pull requests across 50 popular repositories, examining patterns in 106,743 emoji reactions to understand emotional contagion in software development. Our findings reveal a surprisingly positive emotional landscape: 57.4% of discussions carry positive sentiment, with positive emotional cascades outnumbering negative ones 23:1. We identified five distinct patterns, with "instant enthusiasm" affecting 45.6% of items--nearly half receive immediate positive reinforcement. Statistical analysis confirms strong emotional contagion (r=0.679, p<0.001) with a massive effect size (d=2.393), suggesting that initial reactions powerfully shape discussion trajectories. These findings challenge assumptions about technical discourse being purely rational, demonstrating that even minimal emotional signals create measurable ripple effects. Our work provides empirical evidence that emoji reactions are not mere decoration but active forces shaping collaborative outcomes in software development.
comment: 12 pages, 3 figures. Analysis of 106,743 emoji reactions across 2,098 GitHub items
☆ Lost in Code Generation: Reimagining the Role of Software Models in AI-driven Software Engineering
Generative AI enables rapid ``vibe coding," where natural language prompts yield working software systems. While this lowers barriers to software creation, it also collapses the boundary between prototypes and engineered software, leading to fragile systems that lack robustness, security, and maintainability. We argue that this shift motivates a reimagining of software models. Rather than serving only as upfront blueprints, models can be recovered post-hoc from AI-generated code to restore comprehension, expose risks, and guide refinement. In this role, models serve as mediators between human intent, AI generation, and long-term system evolution, providing a path toward sustainable AI-driven software engineering.
☆ When Continuous Delivery Is Not an Option: Practical Paths to Continuous Engineering in Complex Organizations
Purpose: Continuous Software Engineering (CSE) promises improved efficiency, quality, and responsiveness in software-intensive organizations. However, fully adopting CSE is often constrained by complex products, legacy systems, organizational inertia, and regulatory requirements. In this paper, we examine four industrial cases from the automation, automotive, retail, and chemical sectors to explore how such constraints shape CSE adoption in practice. Methods: We apply and extend a previously proposed CSE Industry Readiness Model to assess the current and potential levels of adoption in each case. Through expert interviews and narrative synthesis, we identify common driving forces and adoption barriers, including organizational preparedness, cross-organizational dependencies, and limited customer demand for continuous delivery. Results: Based on our findings, we propose an updated readiness model that introduces additional levels of internal and external feedback, distinguishes market- and organization-facing constraints, and better guides practitioners in setting realistic CSE adoption goals. Conclusions: Our results highlight that while full end-to-end CSE adoption may not always be feasible, meaningful internal improvements are still possible and beneficial. This study provides empirically grounded guidance for organizations navigating partial or constrained CSE transformations.
☆ Who's Who? LLM-assisted Software Traceability with Architecture Entity Recognition
Identifying architecturally relevant entities in textual artifacts is crucial for Traceability Link Recovery (TLR) between Software Architecture Documentation (SAD) and source code. While Software Architecture Models (SAMs) can bridge the semantic gap between these artifacts, their manual creation is time-consuming. Large Language Models (LLMs) offer new capabilities for extracting architectural entities from SAD and source code to construct SAMs automatically or establish direct trace links. This paper presents two LLM-based approaches: ExArch extracts component names as simple SAMs from SAD and source code to eliminate the need for manual SAM creation, while ArTEMiS identifies architectural entities in documentation and matches them with (manually or automatically generated) SAM entities. Our evaluation compares against state-of-the-art approaches SWATTR, TransArC and ArDoCode. TransArC achieves strong performance (F1: 0.87) but requires manually created SAMs; ExArch achieves comparable results (F1: 0.86) using only SAD and code. ArTEMiS is on par with the traditional heuristic-based SWATTR (F1: 0.81) and can successfully replace it when integrated with TransArC. The combination of ArTEMiS and ExArch outperforms ArDoCode, the best baseline without manual SAMs. Our results demonstrate that LLMs can effectively identify architectural entities in textual artifacts, enabling automated SAM generation and TLR, making architecture-code traceability more practical and accessible.
☆ Efficient Solvers for SLOPE in R, Python, Julia, and C++
We present a suite of packages in R, Python, Julia, and C++ that efficiently solve the Sorted L-One Penalized Estimation (SLOPE) problem. The packages feature a highly efficient hybrid coordinate descent algorithm that fits generalized linear models (GLMs) and supports a variety of loss functions, including Gaussian, binomial, Poisson, and multinomial logistic regression. Our implementation is designed to be fast, memory-efficient, and flexible. The packages support a variety of data structures (dense, sparse, and out-of-memory matrices) and are designed to efficiently fit the full SLOPE path as well as handle cross-validation of SLOPE models, including the relaxed SLOPE. We present examples of how to use the packages and benchmarks that demonstrate the performance of the packages on both real and simulated data and show that our packages outperform existing implementations of SLOPE in terms of speed.
comment: 30 pages, 8 figures
☆ EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by a substantial margin of 56.8%, while improving single-agent performance by 16.0%-76.6% across different base LLMs, highlighting the importance of dependency modeling, context propagation, and workflow-aware agent design for complex software projects. Our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.
comment: 14 pages, 6 figures
☆ SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks
AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in the TIOBE index -- remains absent from such benchmarks. We introduce SWE-Sharp-Bench, a reproducible software engineering benchmark for C\# featuring 150 instances from 17 repositories. Evaluating identical model-agent configurations across languages reveals a significant performance gap: while 70% of Python tasks in SWE-Bench Verified are solved, $only 40% of our C\# tasks are resolved. We open-source SWE-Sharp-Bench and our entire curation pipeline.
☆ VFocus: Better Verilog Generation from Large Language Model via Focused Reasoning
Large Language Models (LLMs) have shown impressive potential in generating Verilog codes, but ensuring functional correctness remains a challenge. Existing approaches often rely on self-consistency or simulation feedback to select the best candidate, but they miss opportunities to focus LLM reasoning on the most informative parts of the design. We propose VFocus, a three-stage framework that enhances Verilog generation by sharpening the focus of LLM reasoning onto critical decision points in the code generation process. In the \textbf{pre-ranking stage}, VFocus generates multiple code candidates through LLM prompting, retries for syntactically valid outputs, and introduces a \textit{Density-guided Filtering} to retain candidates that fall within the "reasoning sweet spot" for functional correctness. In the \textbf{ranking stage}, we simulate each code candidate using an automatically generated testbench and apply self-consistency-based clustering to identify the most consistent outputs. Finally, in the \textbf{post-ranking refinement stage}, VFocus performs inconsistency mining on top-ranked candidates and invokes reasoning-augmented LLM prompts for candidate refinement. Experiments on the VerilogEval-Human benchmark show that VFocus significantly improves the pass@1 correctness across multiple reasoning LLMs, demonstrating its effectiveness in enhancing Verilog generation for complex hardware design tasks.
comment: accepted by SOCC 2025
☆ LLMs as Judges: Toward The Automatic Review of GSN-compliant Assurance Cases
Assurance cases allow verifying the correct implementation of certain non-functional requirements of mission-critical systems, including their safety, security, and reliability. They can be used in the specification of autonomous driving, avionics, air traffic control, and similar systems. They aim to reduce risks of harm of all kinds including human mortality, environmental damage, and financial loss. However, assurance cases often tend to be organized as extensive documents spanning hundreds of pages, making their creation, review, and maintenance error-prone, time-consuming, and tedious. Therefore, there is a growing need to leverage (semi-)automated techniques, such as those powered by generative AI and large language models (LLMs), to enhance efficiency, consistency, and accuracy across the entire assurance-case lifecycle. In this paper, we focus on assurance case review, a critical task that ensures the quality of assurance cases and therefore fosters their acceptance by regulatory authorities. We propose a novel approach that leverages the \textit{LLM-as-a-judge} paradigm to automate the review process. Specifically, we propose new predicate-based rules that formalize well-established assurance case review criteria, allowing us to craft LLM prompts tailored to the review task. Our experiments on several state-of-the-art LLMs (GPT-4o, GPT-4.1, DeepSeek-R1, and Gemini 2.0 Flash) show that, while most LLMs yield relatively good review capabilities, DeepSeek-R1 and GPT-4.1 demonstrate superior performance, with DeepSeek-R1 ultimately outperforming GPT-4.1. However, our experimental results also suggest that human reviewers are still needed to refine the reviews LLMs yield.
☆ Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$ over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.
comment: 13 pages, 4 figures
♻ ☆ Harnessing IoT and Generative AI for Weather-Adaptive Learning in Climate Resilience Education
This paper introduces the Future Atmospheric Conditions Training System (FACTS), a novel platform that advances climate resilience education through place-based, adaptive learning experiences. FACTS combines real-time atmospheric data collected by IoT sensors with curated resources from a Knowledge Base to dynamically generate localized learning challenges. Learner responses are analyzed by a Generative AI powered server, which delivers personalized feedback and adaptive support. Results from a user evaluation indicate that participants found the system both easy to use and effective for building knowledge related to climate resilience. These findings suggest that integrating IoT and Generative AI into atmospherically adaptive learning technologies holds significant promise for enhancing educational engagement and fostering climate awareness.
comment: Not enough evidence to prove the effectiveness of the system in the context of learning about climate change
♻ ☆ Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks
Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including requirements engineering and design, code analysis and generation, software maintenance, and quality assurance. As LLMs become more integral to SE, evaluating their effectiveness is crucial for understanding their potential in this field. In recent years, substantial efforts have been made to assess LLM performance in various SE tasks, resulting in the creation of several benchmarks tailored to this purpose. This paper offers a thorough review of 291 benchmarks, addressing three main aspects: what benchmarks are available, how benchmarks are constructed, and the future outlook for these benchmarks. We begin by examining SE tasks such as requirements engineering and design, coding assistant, software testing, AIOPs, software maintenance, and quality management. We then analyze the benchmarks and their development processes, highlighting the limitations of existing benchmarks. Additionally, we discuss the successes and failures of LLMs in different software tasks and explore future opportunities and challenges for SE-related benchmarks. We aim to provide a comprehensive overview of benchmark research in SE and offer insights to support the creation of more effective evaluation tools.
♻ ☆ GDPR-Relevant Privacy Concerns in Mobile Apps Research: A Systematic Literature Review
The General Data Protection Regulation (GDPR) is considered as the benchmark in the European Union (EU) for privacy and data protection standards. Since before its entry into force in 2018, substantial research has been conducted in the software engineering (SE) literature investigating the elicitation, representation, and verification of GDPR privacy requirements. Software systems deployed anywhere in the world must comply with GDPR as long as they handle personal data of EU residents. Mobile applications (apps) are no different in that regard. With the growing pervasiveness of mobile apps and their increasing demand for personal data, privacy concerns have acquired further interest within the SE community. Despite the extensive literature on GDPR-relevant privacy concerns in mobile apps, there is no secondary study that describes, analyzes, and categorizes the current focus. Research gaps and persistent challenges are thus left unnoticed. This article aims to provide a comprehensive overview of the existing research on GDPR privacy concerns in the context of mobile apps. To do so, we conducted a systematic literature review of 60 primary studies. Our findings show that existing studies predominantly address three key GDPR-related privacy concerns: (i) the direct collection of personal data from users, (ii) the sharing of personal data with external entities (e.g., third parties) beyond the mobile apps, and (iii) the analysis of user consent as a legal basis for collecting personal data. Our study highlighted research gaps, calling for further research to better understand: (i) the indirect collection of personal data, e.g., data exposed to mobile apps through, e.g., permission requests, (ii) the impact of legal bases beyond consent and how they may affect the development of mobile apps, and (iii) the required implementation details pertinent to data subject rights.
♻ ☆ SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents NeurIPS 2025
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.
comment: Dataset: https://huggingface.co/datasets/nebius/SWE-rebench, SWE-rebench leaderboard https://swe-rebench.com NeurIPS 2025
♻ ☆ Unveiling the Role of ChatGPT in Software Development: Insights from Developer-ChatGPT Interactions on GitHub
The advent of Large Language Models (LLMs) has introduced a new paradigm in software engineering, with generative AI tools like ChatGPT gaining widespread adoption among developers. While ChatGPT's potential has been extensively discussed, there is limited empirical evidence exploring its real-world usage by developers. This study bridges this gap by conducting a large-scale empirical analysis of ChatGPT-assisted development activities, leveraging a curated dataset, DevChat, comprising 2,547 unique shared ChatGPT links collected from GitHub between May 2023 and June 2024. Our study examines the characteristics of ChatGPT's usage on GitHub (including the tendency, prompt turns distribution, and link descriptions) and identifies five categories of developers' purposes for sharing developer-ChatGPT conversations during software development. Additionally, we analyzed the development-related activities where developers shared ChatGPT links to facilitate their workflows. We then established a mapping framework among data sources, activities, and SE tasks associated with these shared ChatGPT links. Our study offers a comprehensive view of ChatGPT's application in real-world software development scenarios and provides a foundation for its future integration into software development workflows.
comment: 29 pages, 11 images, 3 tables, Manuscript revision submitted to a journal (2025)
♻ ☆ AI for Requirements Engineering: Industry adoption and Practitioner perspectives
The integration of AI for Requirements Engineering (RE) presents significant benefits but also poses real challenges. Although RE is fundamental to software engineering, limited research has examined AI adoption in RE. We surveyed 55 software practitioners to map AI usage across four RE phases: Elicitation, Analysis, Specification, and Validation, and four approaches for decision making: human-only decisions, AI validation, Human AI Collaboration (HAIC), and full AI automation. Participants also shared their perceptions, challenges, and opportunities when applying AI for RE tasks. Our data show that 58.2% of respondents already use AI in RE, and 69.1% view its impact as positive or very positive.HAIC dominates practice, accounting for 54.4% of all RE techniques, while full AI automation remains minimal at 5.4%. Passive AI validation (4.4 to 6.2%) lags even further behind, indicating that practitioners value AI's active support over passive oversight. These findings suggest that AI is most effective when positioned as a collaborative partner rather than a replacement for human expertise. It also highlights the need for RE-specific HAIC frameworks along with robust and responsible AI governance as AI adoption in RE grows.
comment: Accepted at the Intelligent Software Engineering (ISE) 2025 Workshop at the Automated Software Engineering (ASE) 2025 Conference
♻ ☆ ARPaCCino: An Agentic-RAG for Policy as Code Compliance
Policy as Code (PaC) is a paradigm that encodes security and compliance policies into machine-readable formats, enabling automated enforcement in Infrastructure as Code (IaC) environments. However, its adoption is hindered by the complexity of policy languages and the risk of misconfigurations. In this work, we present ARPaCCino, an agentic system that combines Large Language Models (LLMs), Retrieval-Augmented-Generation (RAG), and tool-based validation to automate the generation and verification of PaC rules. Given natural language descriptions of the desired policies, ARPaCCino generates formal Rego rules, assesses IaC compliance, and iteratively refines the IaC configurations to ensure conformance. Thanks to its modular agentic architecture and integration with external tools and knowledge bases, ARPaCCino supports policy validation across a wide range of technologies, including niche or emerging IaC frameworks. Experimental evaluation involving a Terraform-based case study demonstrates ARPaCCino's effectiveness in generating syntactically and semantically correct policies, identifying non-compliant infrastructures, and applying corrective modifications, even when using smaller, open-weight LLMs. Our results highlight the potential of agentic RAG architectures to enhance the automation, reliability, and accessibility of PaC workflows.
Systems and Control 38
☆ LLM-Supported Formal Knowledge Representation for Enhancing Control Engineering Content with an Interactive Semantic Layer
The rapid growth of research output in control engineering calls for new approaches to structure and formalize domain knowledge. This paper briefly describes an LLM-supported method for semi-automated generation of formal knowledge representations that combine human readability with machine interpretability and increased expressiveness. Based on the Imperative Representation of Knowledge (PyIRK) framework, we demonstrate how language models can assist in transforming natural-language descriptions and mathematical definitions (available as LaTeX source code) into a formalized knowledge graph. As a first application we present the generation of an ``interactive semantic layer'' to enhance the source documents in order to facilitate knowledge transfer. From our perspective this contributes to the vision of easily accessible, collaborative, and verifiable knowledge bases for the control engineering domain.
comment: 4 pages, 2 figures
☆ Adjustable Low-Cost Highly Sensitive Microwave Oscillator Sensor for Liquid Level Detection
This paper explores the implementation of a low-cost high-precision microwave oscillator sensor with an adjustable input resistance to enhance its limit of detection (LoD). To achieve this, we introduce a \textit{Z$_{2}$} branch in the input network, comprising a transmission line, a capacitor (\textit{C$_{B}$}) and a resistor (\textit{R$_{V}$}). The sensor is tested with eight different liquids with different dielectric constants, including water, IV fluid, milk, ethanol, acetone, petrol, olive oil, and Vaseline. By fine-tuning the \textit{Z$_{2}$} branch, a clear relationship is found between $\varepsilon_{r}$ of materials and R$_{V}$.Our experimental results demonstrate outstanding characteristics, including remarkable linearity (nonlinearity < 2.44\%), high accuracy with an average sensitivity of 21 kHz/$\mu$m, and an excellent limit of detection (LoD < 0.05 mm). The sensor also exhibits good stability across a range of liquid temperatures and shows robust and repeatable behavior. Considering the strong absorption of microwave energy in liquids with high dielectric constants, this oscillator sensor is a superior choice over capacitive sensors for such applications. We validate the performance of the oscillator sensor using water as a representative liquid. Additionally, we substantiate the sensor's improvement through both experimental results and theoretical analysis. Its advantages, including affordability, compatibility with CMOS and MEMS technologies, and ease of fabrication, make it an excellent choice for small-scale liquid detection applications.
☆ An unscented Kalman filter method for real time input-parameter-state estimation
The input-parameter-state estimation capabilities of a novel unscented Kalman filter is examined herein on both linear and nonlinear systems. The unknown input is estimated in two stages within each time step. Firstly, the predicted dynamic states and the system parameters provide an estimation of the input. Secondly, the corrected with measurements states and parameters provide a final estimation. Importantly, it is demonstrated using the perturbation analysis that, a system with at least a zero or a non-zero known input can potentially be uniquely identified. This output-only methodology allows for a better understanding of the system compared to classical output-only parameter identification strategies, given that all the dynamic states, the parameters, and the input are estimated jointly and in real-time.
comment: author-accepted manuscript (AAM) published in Mechanical Systems and Signal Processing
☆ Policy Gradient Methods for Information-Theoretic Opacity in Markov Decision Processes
Opacity, or non-interference, is a property ensuring that an external observer cannot infer confidential information (the "secret") from system observations. We introduce an information-theoretic measure of opacity, which quantifies information leakage using the conditional entropy of the secret given the observer's partial observations in a system modeled as a Markov decision process (MDP). Our objective is to find a control policy that maximizes opacity while satisfying task performance constraints, assuming that an informed observer is aware of the control policy and system dynamics. Specifically, we consider a class of opacity called state-based opacity, where the secret is a propositional formula about the past or current state of the system, and a special case of state-based opacity called language-based opacity, where the secret is defined by a temporal logic formula (LTL) or a regular language recognized by a finite-state automaton. First, we prove that finite-memory policies can outperform Markov policies in optimizing information-theoretic opacity. Second, we develop an algorithm to compute a maximally opaque Markov policy using a primal-dual gradient-based algorithm, and prove its convergence. Since opacity cannot be expressed as a cumulative cost, we develop a novel method to compute the gradient of conditional entropy with respect to policy parameters using observable operators in hidden Markov models. The experimental results validate the effectiveness and optimality of our proposed methods.
☆ Feedback dynamics in Politics: The interplay between sentiment and engagement
We investigate feedback mechanisms in political communication by testing whether politicians adapt the sentiment of their messages in response to public engagement. Using over 1.5 million tweets from Members of Parliament in the United Kingdom, Spain, and Greece during 2021, we identify sentiment dynamics through a simple yet interpretable linear model. The analysis reveals a closed-loop behavior: engagement with positive and negative messages influences the sentiment of subsequent posts. Moreover, the learned coefficients highlight systematic differences across political roles: opposition members are more reactive to negative engagement, whereas government officials respond more to positive signals. These results provide a quantitative, control-oriented view of behavioral adaptation in online politics, showing how feedback principles can explain the self-reinforcing dynamics that emerge in social media discourse.
comment: 6 pages, 7 figures
☆ Stochastic Redistribution of Indistinguishable Items in Shared Habitation: A Multi-Agent Simulation Framework
This paper presents a discrete-event stochastic model for the redistribution of indistinguishable personal items, exemplified by socks, among multiple cohabitants sharing a communal laundry system. Drawing on concepts from ecological population dynamics, diffusion processes, and stochastic exchange theory, the model captures the probabilistic mechanisms underlying item mixing, recovery, and loss. Each cohabitant is represented as an autonomous agent whose belongings interact through iterative cycles of collective washing, sorting, and partial correction. The system's evolution is characterized by random mixing events, selective recollection, and attrition over time. Implemented using the SimPy discrete-event simulation framework, the model demonstrates that even minimal exchange probabilities can generate emergent asymmetries, quasi-equilibrium distributions, and long-term disorder. The findings illustrate how stochastic processes inherent to shared domestic systems can produce persistent imbalances, offering a quantitative perspective on an everyday social phenomenon.
☆ Natural-gas storage modelling by deep reinforcement learning
We introduce GasRL, a simulator that couples a calibrated representation of the natural gas market with a model of storage-operator policies trained with deep reinforcement learning (RL). We use it to analyse how optimal stockpile management affects equilibrium prices and the dynamics of demand and supply. We test various RL algorithms and find that Soft Actor Critic (SAC) exhibits superior performance in the GasRL environment: multiple objectives of storage operators - including profitability, robust market clearing and price stabilisation - are successfully achieved. Moreover, the equilibrium price dynamics induced by SAC-derived optimal policies have characteristics, such as volatility and seasonality, that closely match those of real-world prices. Remarkably, this adherence to the historical distribution of prices is obtained without explicitly calibrating the model to price data. We show how the simulator can be used to assess the effects of EU-mandated minimum storage thresholds. We find that such thresholds have a positive effect on market resilience against unanticipated shifts in the distribution of supply shocks. For example, with unusually large shocks, market disruptions are averted more often if a threshold is in place.
comment: 8 pages, 5 figures, published on
☆ ISAC Empowered Air-Sea Collaborative System: A UAV-USV Joint Inspection Framework
In this paper, we construct an air-sea collaborative system framework based on the Integrated Sensing and Communication (ISAC) techniques, where the Unmanned Aerial Vehicle (UAV) and Unmanned Surface Vehicle (USV) jointly inspect targets of interest while keeping communication with each other simultaneously. First, we demonstrate the unique challenges encountered in this collaborative system, i.e., the coupling and heterogeneity of the UAV/USV's trajectories. Then, we formulate a total energy consumption minimization problem to jointly optimize the trajectories, flying and hovering times, target scheduling, and beamformers under the constraints of water currents, collision avoidance, and Sensing and Communication (S\&C) requirements. To address the strong coupling of the variables, we divide the original problem into two subproblems, namely, the hover point selection and the joint trajectory planning and beamforming design. In the first subproblem, we propose a three-step hierarchical method including: (1) a virtual base station coverage (VBSC) and clustering algorithm to obtain the target scheduling and rough position of hover points; (2) a Bi-traveling salesman problem with neighborhood (Bi-TSPN)-based algorithm to determine the visiting order sequence of the hover points; (3) a hover point refinement and time allocation algorithm to further optimize the time allocation. In the latter subproblem, we complete the remaining trajectory planning and beamforming design in each flying and hovering stage by developing a semi-definite relaxation (SDR) and successive convex approximation (SCA) method. Finally, we conduct a series of simulations to demonstrate the superiority of the proposed scheme over existing sequential access and leader-follower strategies.
comment: 13 pages, 15 figures
☆ Analytical Framework for Assessing Effective Regional Inertia
This paper proposes a novel formulation of effective regional inertia that explicitly accounts for both system topology and the spatial distribution of inertia. Unlike traditional approaches that model a region as an aggregated machine with an equivalent inertia, the proposed metric provides a topology-aware representation. The methodology builds on an analytical framework that extends classical slow coherency theory to address network partitioning and regional frequency stability. Based on these partitions, we develop a systematic procedure to evaluate the effective inertia of each region, enabling a more accurate interpretation of local inertial contributions, including those from virtual inertia provided by inverter-based resources (IBRs). Case studies on the IEEE 39-bus and 68-bus systems demonstrate that the integration of inertial devices does not uniformly improve system frequency response, underscoring the importance of the proposed metric for effective regional inertia assessment.
☆ Reliability entails input-selective contraction and regulation in excitable networks
The animal nervous system offers a model of computation combining digital reliability and analog efficiency. Understanding how this sweet spot can be realized is a core question of neuromorphic engineering. To this aim, this paper explores the connection between reliability, contraction, and regulation in excitable systems. Using the FitzHugh-Nagumo model of excitable behavior as a proof-of-concept, it is shown that neuronal reliability can be formalized as an average trajectory contraction property induced by the input. In excitable networks, reliability is shown to enable regulation of the network to a robustly stable steady state. It is thus posited that regulation provides a notion of dynamical analog computation, and that stability makes such a computation model robust.
☆ Many-vs-Many Missile Guidance via Virtual Targets
This paper presents a novel approach to many-vs-many missile guidance using virtual targets (VTs) generated by a Normalizing Flows-based trajectory predictor. Rather than assigning n interceptors directly to m physical targets through conventional weapon target assignment algorithms, we propose a centralized strategy that constructs n VT trajectories representing probabilistic predictions of maneuvering target behavior. Each interceptor is guided toward its assigned VT using Zero-Effort-Miss guidance during midcourse flight, transitioning to Proportional Navigation guidance for terminal interception. This approach treats many-vs-many engagements as many-vs-distribution scenarios, exploiting numerical superiority (n > m) by distributing interceptors across diverse trajectory hypotheses rather than pursuing identical deterministic predictions. Monte Carlo simulations across various target-interceptor configurations (1-6 targets, 1-8 interceptors) demonstrate that the VT method matches or exceeds baseline straight-line prediction performance by 0-4.1% when n = m, with improvements increasing to 5.8-14.4% when n > m. The results confirm that probabilistic VTs enable effective exploitation of numerical superiority, significantly increasing interception probability in many-vs-many scenarios.
comment: will be submitted to Journal of Guidance, Control, and Dynamics as Technical Note
☆ Decentralized Approach to Detect and Eliminate Flapping Phenomena due to Flexible Resources
This paper presents a decentralized methodology for detecting and mitigating flapping phenomena in power systems, primarily caused by the operation of discrete devices. The proposed approach applies moving-window autocorrelation to local measurements, enabling each device to autonomously identify sustained oscillations. Upon detection, a probabilistic, device-specific mitigation strategy is executed. Flexible demand resources (DFRs), under-load tap changers (ULTCs), and automatic voltage regulators (AVRs) are utilised to illustrate the performance of the proposed approach to both discrete and continuous-operation devices. Results show that the proposed method is robust and properly distinguishes damped oscillations from persistent flapping, allowing devices to independently recognize problematic operating scenarios and implement corrective actions accordingly.
☆ Before AI Takes Over: Rethinking Nonlinear Signal Processing in Communications
There is an urgent reflection on traditional nonlinear signal processing methods in communications before Artificial Intelligence (AI) dominates the field. It implies a need to reassess or reinterpret established theories and tools, highlighting the tension between data-driven and model-based approaches. This paper calls for preserving valuable insights from classical signal processing while exploring how they can coexist or integrate with emerging AI methods.
comment: Submitted to npj Wireless Technology
☆ Coherency among Power System Devices
The paper proposes a novel general definition of coherency among power system devices of any type. The proposed approach is thus not limited to synchronous machines. With this aim, the paper shows that coherency can be formally based on the difference in the complex frequency of the current injections of any two devices electrically connected to the same grid. The proposed definition is model-agnostic, making it general and suitable for modern power systems composed of a heterogeneous mix of technologies. The paper also provides a systematic analytical procedure to study the properties that specific device models must satisfy to be coherent. Time-domain simulations are conducted in three case studies whose results illustrate the ability of our definition to evaluate coherency among any type of device.
☆ Using ensemble learning with hybrid graph neural networks and transformers to predict traffic in cities
Intelligent transportation systems (ITS) still have a hard time accurately predicting traffic in cities, especially in big, multimodal settings with complicated spatiotemporal dynamics. This paper presents HybridST, a hybrid architecture that integrates Graph Neural Networks (GNNs), multi-head temporal Transformers, and supervised ensemble learning methods (XGBoost or Random Forest) to collectively capture spatial dependencies, long-range temporal patterns, and exogenous signals, including weather, calendar, or control states. We test our model on the METR-LA, PEMS-BAY, and Seattle Loop tree public benchmark datasets. These datasets include situations ranging from freeway sensor networks to vehicle-infrastructure cooperative perception. Experimental results show that HybridST consistently beats classical baselines (LSTM, GCN, DCRNN, PDFormer) on important metrics like MAE and RMSE, while still being very scalable and easy to understand. The proposed framework presents a promising avenue for real-time urban mobility planning, energy optimization, and congestion alleviation strategies, especially within the framework of smart cities and significant events such as the 2030 FIFA World Cup.
☆ Generalized Swing Control Framework for Inverter-based Resources
This paper proposes a novel control framework designed for Inverter-Based Resources (IBRs), denoted as Generalized Swing Control (GSC). The proposed GSC framework generalizes the definition of Grid-Forming (GFM) control schemes and exploits the coupling between active and reactive power dynamics. To validate the proposed scheme, we conduct extensive time-domain simulations and small-signal analysis using a modified version of the WSCC 9-bus system and a 1479-bus dynamic model of the all-island Irish transmission system. The case studies focus on evaluating the dynamic performance of the proposed framework under different configurations, including Virtual Synchronous Machine (VSM), coupled-VSM and dual-VSM schemes. To address the nonlinear nature of power system dynamics, sensitivity analysis based on Monte Carlo methods are employed to improve parameter tuning and assess the stability of GSC configurations in the studied systems.
☆ Decentralized Voltage Control of AC Microgrids with Constant Power Loads using Control Barrier Functions
This paper proposes a novel nonlinear decentralized voltage controller for constrained regulation of meshed AC Microgrid networks with high penetration of constant power loads. Perceiving the load demand as an unknown disturbance, the network model is reformulated in a cascaded structure composed of a nominal, i.e. uncertainty-free, and an error subsystem. The latter captures the distance between the true and the nominal state trajectories, for which we prove boundedness via a suitable control barrier function. Under sufficient conditions, we prove asymptotic stability of the cascaded dynamics with respect to an equilibrium set and also provide an estimate of the region of attraction. In addition, it is rigorously shown that the proposed nonlinear control law also enforces constrained regulation around a rated voltage value, without the need of saturation devices. The operation of the closed-loop system is illustrated in a simulation scenario, demonstrating bounded operation and convergence to a neighbourhood of the desired reference vector.
comment: 12 pages
☆ Explicit MPC for the constrained zonotope case with low-rank matrix updates
Solving the explicit Model Predictive Control (MPC) problem requires enumerating all critical regions and their associated feedback laws, a task that scales exponentially with the system dimension and the prediction horizon, as well. When the problem's constraints are boxes or zonotopes, the feasible domain admits a compact constrained-zonotope representation. Building on this insight, we exploit the geometric properties of the equivalent constrained-zonotope reformulation to accelerate the computation of the explicit solution. Specifically, we formulate the multi-parametric problem in the lifted generator space and solve it using second-order optimality conditions, employ low-rank matrix updates to reduce computation time, and introduce an analytic enumeration of candidate active sets that yields the explicit solution in tree form.
☆ A Kullback-Leibler divergence method for input-system-state identification
The capability of a novel Kullback-Leibler divergence method is examined herein within the Kalman filter framework to select the input-parameter-state estimation execution with the most plausible results. This identification suffers from the uncertainty related to obtaining different results from different initial parameter set guesses, and the examined approach uses the information gained from the data in going from the prior to the posterior distribution to address the issue. Firstly, the Kalman filter is performed for a number of different initial parameter sets providing the system input-parameter-state estimation. Secondly, the resulting posterior distributions are compared simultaneously to the initial prior distributions using the Kullback-Leibler divergence. Finally, the identification with the least Kullback-Leibler divergence is selected as the one with the most plausible results. Importantly, the method is shown to select the better performed identification in linear, nonlinear, and limited information applications, providing a powerful tool for system monitoring.
comment: 32 pages, 17 figures, published in Journal of Sound and Vibration
☆ Constrained Performance Boosting Control for Nonlinear Systems via ADMM
We present the Alternating Direction Method of Multipliers for Performance Boosting (ADMM-PB), an approach to design performance boosting controllers for stable or pre-stabilized nonlinear systems, while explicitly seeking input and state constraint satisfaction. Rooted on a recently proposed approach for designing neural-network controllers that guarantees closed-loop stability by design while minimizing generic cost functions, our strategy integrates it within an alternating direction method of multipliers routine to seek constraint handling without modifying the controller structure of the aforementioned seminal strategy. Our numerical results showcase the advantages of the proposed approach over a baseline penalizing constraint violation through barrier-like terms in the cost, indicating that ADMM-PB can lead to considerably lower constraint violations at the price of inducing slightly more cautious closed-loop behaviors.
☆ H-Infinity Filter Enhanced CNN-LSTM for Arrhythmia Detection from Heart Sound Recordings ICSE
Early detection of heart arrhythmia can prevent severe future complications in cardiac patients. While manual diagnosis still remains the clinical standard, it relies heavily on visual interpretation and is inherently subjective. In recent years, deep learning has emerged as a powerful tool to automate arrhythmia detection, offering improved accuracy, consistency, and efficiency. Several variants of convolutional and recurrent neural network architectures have been widely explored to capture spatial and temporal patterns in physiological signals. However, despite these advancements, current models often struggle to generalize well in real-world scenarios, especially when dealing with small or noisy datasets, which are common challenges in biomedical applications. In this paper, a novel CNN-H-Infinity-LSTM architecture is proposed to identify arrhythmic heart signals from heart sound recordings. This architecture introduces trainable parameters inspired by the H-Infinity filter from control theory, enhancing robustness and generalization. Extensive experimentation on the PhysioNet CinC Challenge 2016 dataset, a public benchmark of heart audio recordings, demonstrates that the proposed model achieves stable convergence and outperforms existing benchmarks, with a test accuracy of 99.42% and an F1 score of 98.85%.
comment: This is a preprint of a paper to appear at the 15th IEEE International Conference on Systems Engineering and Technology (ICSET 2025)
☆ ZJUNlict Extended Team Description Paper 2025
This paper presents the ZJUNlict team's work over the past year, covering both hardware and software advancements. In the hardware domain, the integration of an IMU into the v2023 robot was completed to enhance posture accuracy and angular velocity planning. On the software side, key modules were optimized, including the strategy and CUDA modules, with significant improvements in decision making efficiency, ball pursuit prediction, and ball possession prediction to adapt to high-tempo game dynamics.
☆ Performance Analysis of NOMA-Assisted Optical OFDM ISAC Systems with Clipping Distortion
This paper studies the performance of optical orthogonal frequency-division multiplexing (OFDM)-based multi-user integrated sensing and communication (ISAC) systems employing non-orthogonal multiple access (NOMA). Due to their inherent high peak-to-average power ratio (PAPR), OFDM waveforms are clipped to fit the limited dynamic range of the optical transmitters (e.g., light-emitting diodes (LEDs)), resulting in clipping distortion. To alleviate the impact of the distortion, we propose a novel transmitter architecture where the clipping processes are performed before NOMA superposition coding. We then analyze the performance of the proposed optical ISAC systems considering the effects of power allocation and clipping distortion. For the communication subsystem, we analyze the effect of NOMA on the achievable sum rate and bit error rate (BER). For the sensing subsystem, the root mean square error (RMSE) and Cram\'er-Rao bound (CRB) of estimating the transmission distance accuracy are obtained. Simulation results reveal that allocating more power to the strong user yields a higher sum rate, lower BER, and better sensing performance, whereas a more balanced power allocation among users results in degraded BER and sensing performance.
☆ A Reliability-Cost Optimization Framework for EV and DER Integration in Standard and Reconfigurable Distribution Network Topologies
The rapid growth of electric vehicle (EV) adoption poses operational and economic challenges for power distribution systems, including increased line loading levels and network congestions. This may require potential infrastructure reinforcement and expansion. As a fast inexpensive alternative solution, network topology reconfiguration (NTR) offers a practical means to redistribute power flows, reduce operational costs, and defer infrastructure upgrades. This paper presents a linear programming framework to evaluate the impact of varying EV penetration on operational costs under four configurations: standard distribution network (SDN), SDN with NTR (SDNTR), SDN with distributed energy resources (SDN-DER), and SDNTR with DERs (SDNTR-DER). Numerical simulations are conducted on the IEEE 33-bus system. The analysis demonstrates that integrating DERs reduces operational costs, while NTR further enhances system flexibility, enabling higher EV penetration levels without compromising feasibility. The combined SDNTR-DER approach offers the most cost-effective and reliable pathway for accommodating future EV growth while mitigating the need for immediate infrastructure upgrades.
☆ Online Distributed Zeroth-Order Optimization With Non-Zero-Mean Adverse Noises
In this paper, the problem of online distributed zeroth-order optimization subject to a set constraint is studied via a multi-agent network, where each agent can communicate with its immediate neighbors via a time-varying directed graph. Different from the existing works on online distributed zeroth- order optimization, we consider the case where the estimate on the gradients are influenced by some non-zero-mean adverse noises. To handle this problem, we propose a new online dis- tributed zeroth-order mirror descent algorithm involving a kernel function-based estimator and a clipped strategy. Particularly, in the estimator, the kernel function-based strategy is provided to deal with the adverse noises, and eliminate the low-order terms in the Taylor expansions of the objective functions. Furthermore, the performance of the presented algorithm is measured by employing the dynamic regrets, where the offline benchmarks are to find the optimal point at each time. Under the mild assumptions on the graph and the objective functions, we prove that if the variation in the optimal point sequence grows at a certain rate, then the high probability bound of the dynamic regrets increases sublinearly. Finally, a simulation experiment is worked out to demonstrate the effectiveness of our theoretical results.
☆ Near Optimal Convergence to Coarse Correlated Equilibrium in General-Sum Markov Games
No-regret learning dynamics play a central role in game theory, enabling decentralized convergence to equilibrium for concepts such as Coarse Correlated Equilibrium (CCE) or Correlated Equilibrium (CE). In this work, we improve the convergence rate to CCE in general-sum Markov games, reducing it from the previously best-known rate of $\mathcal{O}(\log^5 T / T)$ to a sharper $\mathcal{O}(\log T / T)$. This matches the best known convergence rate for CE in terms of $T$, number of iterations, while also improving the dependence on the action set size from polynomial to polylogarithmic-yielding exponential gains in high-dimensional settings. Our approach builds on recent advances in adaptive step-size techniques for no-regret algorithms in normal-form games, and extends them to the Markovian setting via a stage-wise scheme that adjusts learning rates based on real-time feedback. We frame policy updates as an instance of Optimistic Follow-the-Regularized-Leader (OFTRL), customized for value-iteration-based learning. The resulting self-play algorithm achieves, to our knowledge, the fastest known convergence rate to CCE in Markov games.
☆ Census-Based Population Autonomy For Distributed Robotic Teaming
Collaborating teams of robots show promise due in their ability to complete missions more efficiently and with improved robustness, attributes that are particularly useful for systems operating in marine environments. A key issue is how to model, analyze, and design these multi-robot systems to realize the full benefits of collaboration, a challenging task since the domain of multi-robot autonomy encompasses both collective and individual behaviors. This paper introduces a layered model of multi-robot autonomy that uses the principle of census, or a weighted count of the inputs from neighbors, for collective decision-making about teaming, coupled with multi-objective behavior optimization for individual decision-making about actions. The census component is expressed as a nonlinear opinion dynamics model and the multi-objective behavior optimization is accomplished using interval programming. This model can be reduced to recover foundational algorithms in distributed optimization and control, while the full model enables new types of collective behaviors that are useful in real-world scenarios. To illustrate these points, a new method for distributed optimization of subgroup allocation is introduced where robots use a gradient descent algorithm to minimize portions of the cost functions that are locally known, while being influenced by the opinion states from neighbors to account for the unobserved costs. With this method the group can collectively use the information contained in the Hessian matrix of the total global cost. The utility of this model is experimentally validated in three categorically different experiments with fleets of autonomous surface vehicles: an adaptive sampling scenario, a high value unit protection scenario, and a competitive game of capture the flag.
comment: 16 pages, 17 figures
♻ ☆ Guided Bayesian Optimization: Data-Efficient Controller Tuning with Digital Twin
This article presents the guided Bayesian optimization algorithm as an efficient data-driven method for iteratively tuning closed-loop controller parameters using an event-triggered digital twin of the system based on available closed-loop data. We define a controller tuning framework independent of the controller or the plant structure. Our proposed methodology is model-free, making it suitable for nonlinear and unmodelled plants with measurement noise. The objective function consists of performance metrics modeled by Gaussian processes. We utilize the available information in the closed-loop system to identify and progressively maintain a digital twin that guides the optimizer, improving the data efficiency of our method. Switching the digital twin on and off is triggered by data-driven criteria related to the digital twin's uncertainty estimations in the BO tuning framework. Effectively, it replaces much of the exploration of the real system with exploration performed on the digital twin. We analyze the properties of our method in simulation and demonstrate its performance on two real closed-loop systems with different plant and controller structures. The experimental results show that our method requires fewer experiments on the physical plant than Bayesian optimization to find the optimal controller parameters.
comment: This work has been published in IEEE Transactions on Automation Science and Engineering
♻ ☆ Drift Plus Optimistic Penalty: A Learning Framework for Stochastic Network Optimization with Improved Regret Bounds
We consider the problem of joint routing and scheduling in queueing networks, where the edge transmission costs are unknown. At each time-slot, the network controller receives noisy observations of transmission costs only for those edges it selects for transmission. The network controller's objective is to make routing and scheduling decisions so that the total expected cost is minimized. This problem exhibits an exploration-exploitation trade-off, however, previous bandit-style solutions cannot be directly applied to this problem due to the queueing dynamics. In order to ensure network stability, the network controller needs to optimize throughput and cost simultaneously. We show that the best achievable cost is lower bounded by the solution to a static optimization problem, and develop a network control policy using techniques from Lyapunov drift-plus-penalty optimization and multi-arm bandits. We show that the policy achieves a sub-linear regret of order $O(\sqrt{T}\log T)$, as compared to the best policy that has complete knowledge of arrivals and costs. Finally, we evaluate the proposed policy using simulations and show that its regret is indeed sub-linear.
♻ ☆ Constrained Optimal Fuel Consumption of HEVs under Observational Noise
In our prior work, we investigated the minimum fuel consumption of a hybrid electric vehicle (HEV) under a state-of-charge (SOC) balance constraint, assuming perfect SOC measurements and accurate reference speed profiles. The constrained optimal fuel consumption (COFC) problem was addressed using a constrained reinforcement learning (CRL) framework. However, in real-world scenarios, SOC readings are often corrupted by sensor noise, and reference speeds may deviate from actual driving conditions. To account for these imperfections, this study reformulates the COFC problem by explicitly incorporating observational noise in both SOC and reference speed. We adopt a robust CRL approach, where the noise is modeled as a uniform distribution, and employ a structured training procedure to ensure stability. The proposed method is evaluated through simulations on the Toyota Prius hybrid system (THS), using both the New European Driving Cycle (NEDC) and the Worldwide Harmonized Light Vehicles Test Cycle (WLTC). Results show that fuel consumption and SOC constraint satisfaction remain robust across varying noise levels. Furthermore, the analysis reveals that observational noise in SOC and speed can impact fuel consumption to different extents. To the best of our knowledge, this is the first study to explicitly examine how observational noise -- commonly encountered in dynamometer testing and predictive energy control (PEC) applications -- affects constrained optimal fuel consumption in HEVs.
comment: Minor text and figure adjustments; no substantive changes
♻ ☆ Virtual Target Trajectory Prediction for Stochastic Targets
Trajectory prediction of aerial vehicles is a key requirement in applications ranging from missile guidance to UAV collision avoidance. While most prediction methods assume deterministic target motion, real-world targets often exhibit stochastic behaviors such as evasive maneuvers or random gliding patterns. This paper introduces a probabilistic framework based on Conditional Normalizing Flows (CNFs) to model and predict such stochastic dynamics directly from trajectory data. The learned model generates probability distributions of future target positions conditioned on initial states and dynamic parameters, enabling efficient sampling and exact density evaluation. To provide deterministic surrogates compatible with existing guidance and planning algorithms, sampled trajectories are clustered using a time series k-means approach, yielding a set of representative "virtual target" trajectories. The method is target-agnostic, computationally efficient, and requires only trajectory data for training, making it suitable as a drop-in replacement for deterministic predictors. Simulated scenarios with maneuvering and ballistic targets demonstrate that the proposed approach bridges the gap between deterministic assumptions and stochastic reality, advancing guidance and control algorithms for autonomous vehicles.
comment: Manuscript accepted by Journal of Guidance, Control, and Dynamics
♻ ☆ Constrained computational hybrid controller for Input Affine Hybrid Dynamical Systems
Hybrid dynamical systems are viewed as the most complicated systems with continuous and event-based behaviors. Since traditional controllers cannot handle these systems, some newly-developed controllers have been published in recent decades to deal with them. This paper presents a novel implementable constrained final-state controller based on partitioning the system's state-space, computational simulations, and graph theory. Experimental results and a comparison with Model Predictive Controller on the three tank benchmark and swing-up control of a pendulum show the effectiveness of the proposed Computational Hybrid Controller(CHC).
♻ ☆ A moving horizon estimator for aquifer thermal energy storages
Aquifer thermal energy storages (ATES) represent groundwater saturated aquifers that store thermal energy in the form of heated or cooled groundwater. Combining two ATES, one can harness excess thermal energy from summer (heat) and winter (cold) to support the building's heating, ventilation, and air conditioning (HVAC) technology. In general, a dynamic operation of ATES throughout the year is beneficial to avoid using fossil fuel-based HVAC technology and maximize the ``green use'' of ATES. Model predictive control (MPC) with an appropriate system model may become a crucial control approach for ATES systems. Consequently, the MPC model should reflect spatial temperature profiles around ATES' boreholes to predict extracted groundwater temperatures accurately. However, meaningful predictions require the estimation of the current state of the system, as measurements are usually only at the borehole of the ATES. In control, this is often realized by model-based observers. Still, observing the state of an ATES system is non-trivial, since the model is typically hybrid. We show how to exploit the specific structure of the hybrid ATES model and design an easy-to-solve moving horizon estimator based on a quadratic program.
comment: European Control Conference 2025 (ECC), Thessaloniki, Greece
♻ ☆ Chance-Constrained Neural MPC under Uncontrollable Agents via Sequential Convex Programming
This work investigates the challenge of ensuring safety guarantees under uncontrollable agents whose behaviors are stochastic and depend on both their own and the system's states. We present a neural model predictive control (MPC) framework that predicts the trajectory of the uncontrollable agent using a predictor learned from offline data. To provide probabilistic guarantees on prediction errors, we employ split conformal prediction to construct region-specific, time-dependent uncertainty bounds, which are integrated into the MPC formulation. To solve the resulting non-convex, discontinuous optimization problem, we propose a two-loop iterative sequential convex programming algorithm. The inner loop solves convexified subproblems with fixed error bounds, while the outer loop refines these bounds based on updated control sequences. We establish convergence guarantees under mild regularity conditions and demonstrate the optimality of the algorithm. We illustrate our method with an autonomous driving scenario involving interactive pedestrians. Experimental results demonstrate that our approach achieves superior safety and efficiency compared to baseline methods, with success rates exceeding 99.5\% while maintaining higher average speeds in multi-pedestrian scenarios.
♻ ☆ Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models NeurIPS 2025
Robust coordination is critical for effective decision-making in multi-agent systems, especially under partial observability. A central question in Multi-Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end-to-end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task-allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end-to-end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent's own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal-directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model-based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal-driven coordination.
comment: Published in the Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA). Additionally accepted for presentation in the NeurIPS 2025 Workshop: Embodied World Models for Decision Making (EWM) and the NeurIPS 2025 Workshop: Optimization for Machine Learning (OPT)
♻ ☆ Neural Network Aided Kalman Filtering with Model Predictive Control Enables Robot-Assisted Drone Recovery on a Wavy Surface
Recovering a drone on a disturbed water surface remains a significant challenge in maritime robotics. In this paper, we propose a unified framework for robot-assisted drone recovery on a wavy surface that addresses two major tasks: Firstly, accurate prediction of a moving drone's position under wave-induced disturbances using KalmanNet Plus Plus (KalmanNet++), a Neural Network Aided Kalman Filtering we proposed. Secondly, effective motion planning using the desired position we got for a manipulator via Receding Horizon Model Predictive Control (RHMPC). Specifically, we compared multiple prediction methods and proposed KalmanNet Plus Plus to predict the position of the UAV, thereby obtaining the desired position. The KalmanNet++ predicts the drone's future position 0.1\,s ahead, while the manipulator plans a capture trajectory in real time, thus overcoming not only wave-induced base motions but also limited constraints such as torque constraints and joint constraints. For the system design, we provide a collaborative system, comprising a manipulator subsystem and a UAV subsystem, enables drone lifting and drone recovery. Simulation and real-world experiments using wave-disturbed motion data demonstrate that our approach achieves a high success rate - above 95\% and outperforms conventional baseline methods by up to 10\% in efficiency and 20\% in precision. The results underscore the feasibility and robustness of our system, which achieves state-of-the-art performance and offers a practical solution for maritime drone operations.
comment: 17 pages, 51 figures
♻ ☆ On the Number of Control Nodes in Boolean Networks with Degree Constraints
This paper studies the minimum control node set problem for Boolean networks (BNs) with degree constraints. The main contribution is to derive the nontrivial lower and upper bounds on the size of the minimum control node set through combinatorial analysis of four types of BNs (i.e., $k$-$k$-XOR-BNs, simple $k$-$k$-AND-BNs, $k$-$k$-AND-BNs with negation and $k$-$k$-NC-BNs, where the $k$-$k$-AND-BN with negation is an extension of the simple $k$-$k$-AND-BN that considers the occurrence of negation and NC means nested canalyzing). More specifically, four bounds for the size of the minimum control node set: general lower bound, best case upper bound, worst case lower bound, and general upper bound are studied. By dividing nodes into three disjoint sets, extending the time to reach the target state, and utilizing necessary conditions for controllability, these bounds are obtained, and further meaningful results and phenomena are discovered. Notably, all of the above results involving the AND function also apply to the OR function.
comment: 35 pages, 9 figures
♻ ☆ Machine Learning-assisted Dynamics-Constrained Day-Ahead Energy Scheduling
TThe rapid expansion of inverter-based resources, such as wind and solar power plants, will significantly diminish the presence of conventional synchronous generators in fu-ture power grids with rich renewable energy sources. This transition introduces in-creased complexity and reduces dynamic stability in system operation and control, with low inertia being a widely recognized challenge. However, the literature has not thoroughly explored grid dynamic performance associated with energy scheduling so-lutions that traditionally only consider grid steady-state constraints. This paper will bridge the gap by enforcing grid dynamic constraints when conducting optimal energy scheduling; particularly, this paper explores locational post-contingency rate of change of frequency (RoCoF) requirements to accommodate substantial inertia reductions. This paper introduces a machine learning-assisted RoCoF-constrained unit commit-ment (ML-RCUC) model designed to ensure RoCoF stability after the most severe generator outage while maintaining operational efficiency. A graph-informed NN (GINN)-based RoCoF predictor is first trained on a high-fidelity simulation dataset to track the highest locational RoCoF, which is then reformulated as mixed-integer linear programming constraints that are integrated into the unit commitment model. Case studies, by solving the optimization problem ML-RCUC and validating its solutions with time-domain simulations, demonstrate that the proposed method can ensure loca-tional RoCoF stability with minimum conservativeness.
Graphics 4
☆ TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
comment: 13 pages, 8 figures, 3 tables. The first two authors contributed equally. Project Page: https://iyatomilab.github.io/TAUE
☆ OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.
♻ ☆ GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality
As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce GS-Verse (Gaussian Splatting for Virtual Environment Rendering and Scene Editing), a novel method designed to overcome these challenges by directly integrating an object's mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, GS-Verse facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.
♻ ☆ Adv-BMT: Bidirectional Motion Transformer for Safety-Critical Traffic Scenario Generation
Scenario-based testing is essential for validating the performance of autonomous driving (AD) systems. However, such testing is limited by the scarcity of long-tailed, safety-critical scenarios in existing datasets collected in the real world. To tackle the data issue, we propose the Adv-BMT framework, which augments real-world scenarios with diverse and realistic adversarial traffic interactions. The core component of Adv-BMT is a bidirectional motion transformer (BMT) model to perform inverse traffic motion predictions, which takes agent information in the last time step of the scenario as input, and reconstructs the traffic in the inverse of chronological order until the initial time step. The Adv-BMT framework is a two-staged pipeline: it first conducts adversarial initializations and then inverse motion predictions. Different from previous work, we do not need any collision data for pretraining, and are able to generate realistic and diverse collision interactions. Our experimental results validate the quality of generated collision scenarios by Adv-BMT: training in our augmented dataset would reduce episode collision rates by 20%. Demo and code are available at: https://metadriverse.github.io/adv-bmt/.
Computation and Language 94
☆ Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.
comment: 16 pages, 7 figures, 14 tables. Under Review
☆ In Good GRACEs: Principled Teacher Selection for Knowledge Distillation
Knowledge distillation is an efficient strategy to use data generated by large "teacher" language models to train smaller capable "student" models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.
☆ Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.
comment: Preprint
☆ MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher
comment: Project page: https://github.com/icip-cas/MemSearcher
☆ Can LLMs subtract numbers?
We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a
comment: Work-in-progress; MathNLP non-archival presentation
☆ VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.
comment: Project page: https://csu-jpg.github.io/VCode Github: https://github.com/CSU-JPG/VCode
☆ Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval
Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.
☆ Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning
Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.
comment: 14 pages
☆ AI Diffusion in Low Resource Language Countries
Artificial intelligence (AI) is diffusing globally at unprecedented speed, but adoption remains uneven. Frontier Large Language Models (LLMs) are known to perform poorly on low-resource languages due to data scarcity. We hypothesize that this performance deficit reduces the utility of AI, thereby slowing adoption in Low-Resource Language Countries (LRLCs). To test this, we use a weighted regression model to isolate the language effect from socioeconomic and demographic factors, finding that LRLCs have a share of AI users that is approximately 20% lower relative to their baseline. These results indicate that linguistic accessibility is a significant, independent barrier to equitable AI diffusion.
comment: 9 pages, 4 tables. Also available at https://aka.ms/AI_Diffusion_Low_Resource_Language_Countries
☆ CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents' ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents' economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.
☆ PragExTra: A Multilingual Corpus of Pragmatic Explicitation in Translation
Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PragExTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus covers eight language pairs from TED-Multi and Europarl and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidate explicitation cases through null alignments and refined using active learning with human annotation. Our results show that entity and system-level explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 across languages. PragExTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation. Keywords: translation, multilingualism, explicitation
☆ The Collaboration Gap
The trajectory of AI development suggests that we will increasingly rely on agent-based systems composed of independently developed agents with different information, privileges, and tools. The success of these systems will critically depend on effective collaboration among these heterogeneous agents, even under partial observability. Despite intense interest, few empirical studies have evaluated such agent-agent collaboration at scale. We propose a collaborative maze-solving benchmark that (i) isolates collaborative capabilities, (ii) modulates problem complexity, (iii) enables scalable automated grading, and (iv) imposes no output-format constraints, preserving ecological plausibility. Using this framework, we evaluate 32 leading open- and closed-source models in solo, homogeneous, and heterogeneous pairings. Our results reveal a "collaboration gap": models that perform well solo often degrade substantially when required to collaborate. Collaboration can break down dramatically; for instance, small distilled models that solve mazes well alone may fail almost completely in certain pairings. We find that starting with the stronger agent often improves outcomes, motivating a "relay inference" approach where the stronger agent leads before handing off to the weaker one, closing much of the gap. Our findings argue for (1) collaboration-aware evaluation, (2) training strategies developed to enhance collaborative capabilities, and (3) interaction design that reliably elicits agents' latent skills, guidance that applies to AI-AI and human-AI collaboration.
☆ Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes
Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.
☆ Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and Interpretation
Previous studies show that introducing new knowledge during large language models (LLMs) fine-tuning can lead to the generation of erroneous output when tested on known information, thereby triggering factual hallucinations. However, existing studies have not deeply investigated the specific manifestations and underlying mechanisms of these hallucinations. Our work addresses this gap by designing a controlled dataset Biography-Reasoning, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that when fine-tuned on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit significantly increased hallucination tendencies. This suggests that the high unfamiliarity of a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations, and these tendencies can even affect other knowledge types in QA tasks. To mitigate such factual hallucinations, we propose KnownPatch, which patches a small number of known knowledge samples in the later stages of training, effectively alleviating new-knowledge-induced hallucinations. Through attention analysis, we find that learning new knowledge reduces the model's attention to key entities in the question, thus causing excessive focus on the surrounding context, which may increase the risk of hallucination. Moreover, the attention pattern can propagate to similar contexts, facilitating the spread of hallucinations to textually similar questions. Our method effectively mitigates the disruption of new knowledge learning to the model's attention on key entities, accompanied by improved performance.
☆ The Realignment Problem: When Right becomes Wrong in LLMs
The alignment of Large Language Models (LLMs) with human values is central to their safe deployment, yet current practice produces static, brittle, and costly-to-maintain models that fail to keep pace with evolving norms and policies. This misalignment, which we term the Alignment-Reality Gap, poses a growing challenge for reliable long-term use. Existing remedies are inadequate: large-scale re-annotation is economically prohibitive, and standard unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework for principled unlearning that reconceives re-alignment as a programmatic policy application problem. TRACE programmatically triages existing preference data against a new policy, identifies high-impact conflicts via a alignment impact score, and applies a hybrid optimization that cleanly inverts, discards, or preserves preferences while safeguarding model performance. Empirical results show that TRACE achieves robust re-alignment across diverse model families (Qwen2.5-7B, Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF dataset under complex policy shift, TRACE enforces new principles without degrading general capabilities. Our work establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment.
comment: 23 Pages
☆ UniChange: Unifying Change Detection with Multimodal Large Language Model
Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.
☆ CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency NeurIPS2025
Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.
comment: Efficient Reasoning @ NeurIPS2025
☆ Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour
Modelling student knowledge is a key challenge when leveraging AI in education, with major implications for personalised learning. The Knowledge Tracing (KT) task aims to predict how students will respond to educational questions in learning environments, based on their prior interactions. Existing KT models typically use response correctness along with metadata like skill tags and timestamps, often overlooking the question text, which is an important source of pedagogical insight. This omission poses a lost opportunity while limiting predictive performance. We propose Next Token Knowledge Tracing (NTKT), a novel approach that reframes KT as a next-token prediction task using pretrained Large Language Models (LLMs). NTKT represents both student histories and question content as sequences of text, allowing LLMs to learn patterns in both behaviour and language. Our series of experiments significantly improves performance over state-of-the-art neural KT models and generalises much better to cold-start questions and users. These findings highlight the importance of question content in KT and demonstrate the benefits of leveraging pretrained representations of LLMs to model student learning more effectively.
☆ The Analysis of Lexical Errors in Machine Translation from English into Romanian
The research explores error analysis in the performance of translating by Machine Translation from English into Romanian, and it focuses on lexical errors found in texts which include official information, provided by the World Health Organization (WHO), the Gavi Organization, by the patient information leaflet (the information about the active ingredients of the vaccines or the medication, the indications, the dosage instructions, the storage instructions, the side effects and warning, etc.). All of these texts are related to Covid-19 and have been translated by Google Translate, a multilingual Machine Translation that was created by Google. In the last decades, Google has actively worked to develop a more accurate and fluent automatic translation system. This research, specifically focused on improving Google Translate, aims to enhance the overall quality of Machine Translation by achieving better lexical selection and by reducing errors. The investigation involves a comprehensive analysis of 230 texts that have been translated from English into Romanian.
comment: Doctoral thesis
☆ Smart-Hiring: An Explainable end-to-end Pipeline for CV Information Extraction and Job Matching
Hiring processes often involve the manual screening of hundreds of resumes for each job, a task that is time and effort consuming, error-prone, and subject to human bias. This paper presents Smart-Hiring, an end-to-end Natural Language Processing (NLP) pipeline de- signed to automatically extract structured information from unstructured resumes and to semantically match candidates with job descriptions. The proposed system combines document parsing, named-entity recognition, and contextual text embedding techniques to capture skills, experience, and qualifications. Using advanced NLP technics, Smart-Hiring encodes both resumes and job descriptions in a shared vector space to compute similarity scores between candidates and job postings. The pipeline is modular and explainable, allowing users to inspect extracted entities and matching rationales. Experiments were conducted on a real-world dataset of resumes and job descriptions spanning multiple professional domains, demonstrating the robustness and feasibility of the proposed approach. The system achieves competitive matching accuracy while preserving a high degree of interpretability and transparency in its decision process. This work introduces a scalable and practical NLP frame- work for recruitment analytics and outlines promising directions for bias mitigation, fairness-aware modeling, and large-scale deployment of data-driven hiring solutions.
☆ DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding NeurIPS 2025
Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890
comment: Advances in Neural Information Processing Systems 2025 (NeurIPS 2025), Poster, https://neurips.cc/virtual/2025/loc/san-diego/poster/121400
Prompting for Policy: Forecasting Macroeconomic Scenarios with Synthetic LLM Personas
We evaluate whether persona-based prompting improves Large Language Model (LLM) performance on macroeconomic forecasting tasks. Using 2,368 economics-related personas from the PersonaHub corpus, we prompt GPT-4o to replicate the ECB Survey of Professional Forecasters across 50 quarterly rounds (2013-2025). We compare the persona-prompted forecasts against the human experts panel, across four target variables (HICP, core HICP, GDP growth, unemployment) and four forecast horizons. We also compare the results against 100 baseline forecasts without persona descriptions to isolate its effect. We report two main findings. Firstly, GPT-4o and human forecasters achieve remarkably similar accuracy levels, with differences that are statistically significant yet practically modest. Our out-of-sample evaluation on 2024-2025 data demonstrates that GPT-4o can maintain competitive forecasting performance on unseen events, though with notable differences compared to the in-sample period. Secondly, our ablation experiment reveals no measurable forecasting advantage from persona descriptions, suggesting these prompt components can be omitted to reduce computational costs without sacrificing accuracy. Our results provide evidence that GPT-4o can achieve competitive forecasting accuracy even on out-of-sample macroeconomic events, if provided with relevant context data, while revealing that diverse prompts produce remarkably homogeneous forecasts compared to human panels.
comment: 9 pages, 8-pages appendix, accepted at ICAIF 25
☆ Merging Continual Pretraining Models for Domain-Specialized LLMs: A Case Study in Finance
While LLMs excel at general tasks, they struggle in specialized domains like finance, requiring diverse skills in domain knowledge, mathematical reasoning, and multilingual processing. Merging domain-specific Continual Pre-training (CPT) "experts" offers a practical alternative to costly and unstable multi-skill training. However, unlike established Supervised Fine-Tuning (SFT) model-based merging, CPT model merging remains largely unexplored. We address this gap by creating financial LLMs from experts in finance, math, and Japanese. We propose a three-stage evaluation focusing on knowledge recovery, complementarity, and emergence, and assess three merging methods (Task Arithmetic, TIES, and DARE-TIES) on a comprehensive financial benchmark curated from 18 tasks across 8 established datasets. Results show that merging an expert with its base model recovers general knowledge lost during CPT, while merging experts improves performance and can yield emergent cross-domain skills. Among the methods, Task Arithmetic performs strongly but is hyperparameter-sensitive, whereas TIES is more robust. Our findings also suggest that while model similarity correlates with merging success, emergent skills depend on more complex factors. This work presents the first foundational analysis of CPT model merging, establishing a principled framework and providing clear guidance for building multi-skill LLMs from existing assets.
☆ AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
☆ AyurParam: A State-of-the-Art Bilingual Language Model for Ayurveda
Current large language models excel at broad, general-purpose tasks, but consistently underperform when exposed to highly specialized domains that require deep cultural, linguistic, and subject-matter expertise. In particular, traditional medical systems such as Ayurveda embody centuries of nuanced textual and clinical knowledge that mainstream LLMs fail to accurately interpret or apply. We introduce AyurParam-2.9B, a domain-specialized, bilingual language model fine-tuned from Param-1-2.9B using an extensive, expertly curated Ayurveda dataset spanning classical texts and clinical guidance. AyurParam's dataset incorporates context-aware, reasoning, and objective-style Q&A in both English and Hindi, with rigorous annotation protocols for factual precision and instructional clarity. Benchmarked on BhashaBench-Ayur, AyurParam not only surpasses all open-source instruction-tuned models in its size class (1.5--3B parameters), but also demonstrates competitive or superior performance compared to much larger models. The results from AyurParam highlight the necessity for authentic domain adaptation and high-quality supervision in delivering reliable, culturally congruent AI for specialized medical knowledge.
☆ LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context
In this work, we propose LiveSecBench, a dynamic and continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench evaluates models across six critical dimensions (Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) rooted in the Chinese legal and social frameworks. This benchmark maintains relevance through a dynamic update schedule that incorporates new threat vectors, such as the planned inclusion of Text-to-Image Generation Safety and Agentic Safety in the next update. For now, LiveSecBench (v251030) has evaluated 18 LLMs, providing a landscape of AI safety in the context of Chinese language. The leaderboard is publicly accessible at https://livesecbench.intokentech.cn/.
☆ CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.
☆ Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation CIKM 2025
Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.
comment: Accepted to MMGenSR Workshop (CIKM 2025)
☆ LTD-Bench: Evaluating Large Language Models by Letting Them Draw NeurIPS 2025
Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research--relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept--a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
comment: Accepted by NeurIPS 2025
☆ Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning
We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL's feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.
☆ Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation
Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.
☆ Link prediction Graph Neural Networks for structure recognition of Handwritten Mathematical Expressions ICDAR2025
We propose a Graph Neural Network (GNN)-based approach for Handwritten Mathematical Expression (HME) recognition by modeling HMEs as graphs, where nodes represent symbols and edges capture spatial dependencies. A deep BLSTM network is used for symbol segmentation, recognition, and spatial relation classification, forming an initial primitive graph. A 2D-CFG parser then generates all possible spatial relations, while the GNN-based link prediction model refines the structure by removing unnecessary connections, ultimately forming the Symbol Label Graph. Experimental results demonstrate the effectiveness of our approach, showing promising performance in HME structure recognition.
comment: accepted for ICDAR2025-WML
☆ SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.
☆ Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results
Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.
☆ An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM
Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model's ability to leverage the core language model's reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM's audio labeling ability.
☆ IG-Pruning: Input-Guided Block Pruning for Large Language Models EMNLP 2025
With the growing computational demands of large language models (LLMs), efficient inference has become increasingly critical for practical deployment. Depth pruning has emerged as a promising approach for reducing the computational costs of large language models by removing transformer layers. However, existing methods typically rely on fixed block masks, which can lead to suboptimal performance across different tasks and inputs. In this paper, we propose IG-Pruning, a novel input-aware block-wise pruning method that dynamically selects layer masks at inference time. Our approach consists of two stages: (1) Discovering diverse mask candidates through semantic clustering and L0 optimization, and (2) Implementing efficient dynamic pruning without the need for extensive training. Experimental results demonstrate that our method consistently outperforms state-of-the-art static depth pruning methods, making it particularly suitable for resource-constrained deployment scenarios.
comment: Accepted to EMNLP 2025. Code is available at https://github.com/ictnlp/IG-Pruning
☆ Training Proactive and Personalized LLM Agents
While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.
☆ Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning
Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at https://yibozh.github.io/Athena.
♻ ☆ ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs
As AI systems become more advanced, ensuring their alignment with a diverse range of individuals and societal values becomes increasingly critical. But how can we capture fundamental human values and assess the degree to which AI systems align with them? We introduce ValueCompass, a framework of fundamental values, grounded in psychological theory and a systematic review, to identify and evaluate human-AI alignment. We apply ValueCompass to measure the value alignment of humans and large language models (LLMs) across four real-world scenarios: collaborative writing, education, public sectors, and healthcare. Our findings reveal concerning misalignments between humans and LLMs, such as humans frequently endorse values like "National Security" which were largely rejected by LLMs. We also observe that values differ across scenarios, highlighting the need for context-aware AI alignment strategies. This work provides valuable insights into the design space of human-AI alignment, laying the foundations for developing AI systems that responsibly reflect societal values and ethics.
♻ ☆ Hybrid Quantum-Classical Recurrent Neural Networks
We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC in an exponentially large Hilbert space $\mathbb{C}^{2^n}$, which serves as a coherent recurrent quantum memory. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit Pauli expectation-value readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit readouts, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling. For sequence-to-sequence learning, we further devise a soft attention mechanism over the mid-circuit readouts and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
comment: Clarified expectation-value-based readouts and made minor text edits
♻ ☆ Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers with frozen, non-semantic input embeddings. We posit that because high-level meaning is a compositional property of a Transformer's deep layers, not its input vectors, the embedding layer and trained lower layers can serve as a fixed foundation. This liberates backpropagation to focus solely on newly added components, making incremental growth viable. We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack via low-rank adaptation (LoRA) as complexity increases. This method not only demonstrates stable convergence but also reveals a direct correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD, which are absent in shallower models. In a controlled study, our constructively grown model rivals the performance of a monolithically trained baseline of the same size, validating the efficiency and efficacy of the approach. Our findings suggest a path towards a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development. This opens a path for more resource-efficient scaling, continual learning, and a more modular approach to building powerful AI systems. We release all code and models to facilitate further research.
comment: Controlled Comparative Study added
♻ ☆ LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling EMNLP2025
Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark (QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources. Code is released at: https://github.com/zeyuliu1037/LAWCAT
comment: 17 pages, 8 figures. EMNLP2025 Findings
♻ ☆ Accumulating Context Changes the Beliefs of Language Models
Language model (LM) assistants are increasingly used in applications such as brainstorming and research. Improvements in memory and context size have allowed these models to become more autonomous, which has also resulted in more text accumulation in their context windows without explicit user intervention. This comes with a latent risk: the belief profiles of models -- their understanding of the world as manifested in their responses or actions -- may silently change as context accumulates. This can lead to subtly inconsistent user experiences, or shifts in behavior that deviate from the original alignment of the models. In this paper, we explore how accumulating context by engaging in interactions and processing text -- talking and reading -- can change the beliefs of language models, as manifested in their responses and behaviors. Our results reveal that models' belief profiles are highly malleable: GPT-5 exhibits a 54.7% shift in its stated beliefs after 10 rounds of discussion about moral dilemmas and queries about safety, while Grok 4 shows a 27.2% shift on political issues after reading texts from the opposing position. We also examine models' behavioral changes by designing tasks that require tool use, where each tool selection corresponds to an implicit belief. We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems. Our analysis exposes the hidden risk of belief shift as models undergo extended sessions of talking or reading, rendering their opinions and actions unreliable.
♻ ☆ ORANGE: An Online Reflection ANd GEneration framework with Domain Knowledge for Text-to-SQL
Large Language Models (LLMs) have demonstrated remarkable progress in translating natural language to SQL, but a significant semantic gap persists between their general knowledge and domain-specific semantics of databases. Historical translation logs constitute a rich source of this missing in-domain knowledge, where SQL queries inherently encapsulate real-world usage patterns of database schema. Existing methods primarily enhance the reasoning process for individual translations but fail to accumulate in-domain knowledge from past translations. We introduce ORANGE, an online self-evolutionary framework that constructs database-specific knowledge bases by parsing SQL queries from translation logs. By accumulating in-domain knowledge that contains schema and data semantics, ORANGE progressively reduces the semantic gap and enhances the accuracy of subsequent SQL translations. To ensure reliability, we propose a novel nested Chain-of-Thought SQL-to-Text strategy with tuple-semantic tracking, which reduces semantic errors during knowledge generation. Experiments on multiple benchmarks confirm the practicality of ORANGE, demonstrating its effectiveness for real-world Text-to-SQL deployment, particularly in handling complex and domain-specific queries.
comment: 16 pages, 4 figures, preprint
♻ ☆ Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning
Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating suboptimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. Finally, we discuss implications for architectural choices, meaning construction, the primacy of language for thought, and LLM cognition. [First uploaded to arXiv in December, 2024.]
♻ ☆ Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs
For autonomous legal agents to operate safely in high-stakes domains, they require a foundation of absolute determinism and auditability-guarantees that standard Retrieval-Augmented Generation (RAG) frameworks cannot provide. When interacting with temporal knowledge graphs that model the complex evolution of legal norms, agents must navigate versioning, causality, and hierarchical structures with precision, a task for which black-box vector search is ill-suited. This paper introduces a new architectural pattern to solve this: a formal Primitive API designed as a secure execution layer for reasoning over such graphs. Instead of a monolithic query engine, our framework provides a library of canonical primitives-atomic, composable, and auditable primitives. This design empowers planner-guided agents to decompose complex legal questions into transparent execution plans, enabling critical tasks with full verifiability, including: (i) precise point-in-time version retrieval, (ii) robust causal lineage tracing, and (iii) context-aware hybrid search. Ultimately, this architecture transforms opaque retrieval into auditable reasoning, turning the agent's internal process from a black box into a verifiable log of deterministic primitives and providing a blueprint for building the next generation of trustworthy legal AI.
comment: Major revision reframing the paper from an API spec to a novel architectural pattern for deterministic agents. The core contribution is now positioned as a blueprint for auditable reasoning, essential for building trustworthy legal AI systems
♻ ☆ Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation EMNLP'25
This paper introduces an algorithm to select demonstration examples for in-context learning of a query set. Given a set of $n$ examples, how can we quickly select $k$ out of $n$ to best serve as the conditioning for downstream inference? This problem has broad applications in prompt tuning and chain-of-thought reasoning. Since model weights remain fixed during in-context learning, previous work has sought to design methods based on the similarity of token embeddings. This work proposes a new approach based on gradients of the output taken in the input embedding space. Our approach estimates model outputs through a first-order approximation using the gradients. Then, we apply this estimation to multiple randomly sampled subsets. Finally, we aggregate the sampled subset outcomes to form an influence score for each demonstration, and select $k$ most relevant examples. This procedure only requires pre-computing model outputs and gradients once, resulting in a linear-time algorithm relative to model and training set sizes. Extensive experiments across various models and datasets validate the efficiency of our approach. We show that the gradient estimation procedure yields approximations of full inference with less than ${1}\%$ error across six datasets. This allows us to scale up subset selection that would otherwise run full inference by up to ${37.7}\times$ on models with up to $34$ billion parameters, and outperform existing selection methods based on input embeddings by ${11}\%$ on average.
comment: 19 pages. EMNLP'25
♻ ☆ Hey, wait a minute: on at-issue sensitivity in Language Models
Evaluating the naturalness of dialogue in language models (LMs) is not trivial: notions of 'naturalness' vary, and scalable quantitative metrics remain limited. This study leverages the linguistic notion of 'at-issueness' to assess dialogue naturalness and introduces a new method: Divide, Generate, Recombine, and Compare (DGRC). DGRC (i) divides a dialogue as a prompt, (ii) generates continuations for subparts using LMs, (iii) recombines the dialogue and continuations, and (iv) compares the likelihoods of the recombined sequences. This approach mitigates bias in linguistic analyses of LMs and enables systematic testing of discourse-sensitive behavior. Applying DGRC, we find that LMs prefer to continue dialogue on at-issue content, with this effect enhanced in instruct-tuned models. They also reduce their at-issue preference when relevant cues (e.g., "Hey, wait a minute") are present. Although instruct-tuning does not further amplify this modulation, the pattern reflects a hallmark of successful dialogue dynamics.
comment: 10 pages, 5 figures, 3 tables. See https://github.com/sangheek16/hey-wait-a-minute for code and data
♻ ☆ ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding NAACL2025
Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities, i.e., cooking, coupled with their corresponding instructions/recipes. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models' multimodal understanding capabilities.
comment: NAACL2025, Code and Data: https://github.com/kimihiroh/promqa
♻ ☆ Repetitions are not all alike: distinct mechanisms sustain repetition in language models
Large Language Models (LLMs) can sometimes degrade into repetitive loops, persistently generating identical word sequences. Because repetition is rare in natural human language, its frequent occurrence across diverse tasks and contexts in LLMs remains puzzling. Here we investigate whether behaviorally similar repetition patterns arise from distinct underlying mechanisms and how these mechanisms develop during model training. We contrast two conditions: repetitions elicited by natural text prompts with those induced by in-context learning (ICL) setups that explicitly require copying behavior. Our analyses reveal that ICL-induced repetition relies on a dedicated network of attention heads that progressively specialize over training, whereas naturally occurring repetition emerges early and lacks a defined circuitry. Attention inspection further shows that natural repetition focuses disproportionately on low-information tokens, suggesting a fallback behavior when relevant context cannot be retrieved. These results indicate that superficially similar repetition behaviors originate from qualitatively different internal processes, reflecting distinct modes of failure and adaptation in language models.
♻ ☆ Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems
Recent advances in LLM Multi-Agent Systems enable scalable orchestration of sub-agents, each coordinating hundreds or thousands of tools or Model Context Protocol (MCP) servers. However, existing retrieval methods typically match queries against coarse agent-level descriptions before routing, which obscures fine-grained tool functionality and often results in suboptimal agent selection. We introduce Tool-to-Agent Retrieval, a unified framework that embeds both tools and their parent agents in a shared vector space and connects them through metadata relationships. By explicitly representing tool capabilities and traversing metadata to the agent level, Tool-to-Agent Retrieval enables granular tool-level or agent-level retrieval, ensuring that agents and their underlying tools or MCP servers are equally represented without the context dilution that arises from chunking many tools together. Evaluating Tool-to-Agent Retrieval across eight embedding models, our approach achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state-of-the-art agent retrievers on the LiveMCPBench benchmark.
♻ ☆ Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.
comment: preprint
♻ ☆ I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy
As LLM-based agents become increasingly autonomous and will more freely interact with each other, studying the interplay among them becomes crucial to anticipate emergent phenomena and potential risks. In this work, we provide an in-depth analysis of the interactions among agents within a simulated hierarchical social environment, drawing inspiration from the Stanford Prison Experiment. Leveraging 2,400 conversations across six LLMs (i.e., LLama3, Orca2, Command-r, Mixtral, Mistral2, and gpt4.1) and 240 experimental scenarios, we analyze persuasion and anti-social behavior between a guard and a prisoner agent with differing objectives. We first document model-specific conversational failures in this multi-agent power dynamic context, thereby narrowing our analytic sample to 1,600 conversations. Among models demonstrating successful interaction, we find that goal setting significantly influences persuasiveness but not anti-social behavior. Moreover, agent personas, especially the guard's, substantially impact both successful persuasion by the prisoner and the manifestation of anti-social actions. Notably, we observe the emergence of anti-social conduct even in absence of explicit negative personality prompts. These results have important implications for the development of interactive LLM agents and the ongoing discussion of their societal impact.
♻ ☆ On Extending Direct Preference Optimization to Accommodate Ties NeurIPS 2025
We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. We provide a theoretical explanation for this regularization effect using ideal DPO policy theory. We further show performance improvements over DPO in translation and mathematical reasoning using our DPO variants. We find it can be beneficial to include ties in preference optimization rather than simply discard them, as is done in common practice.
comment: 24 pages, NeurIPS 2025
♻ ☆ Composing or Not Composing? Towards Distributional Construction Grammars
The mechanisms of comprehension during language processing remains an open question. Classically, building the meaning of a linguistic utterance is said to be incremental, step-by-step, based on a compositional process. However, many different works have shown for a long time that non-compositional phenomena are also at work. It is therefore necessary to propose a framework bringing together both approaches. We present in this paper an approach based on Construction Grammars and completing this framework in order to account for these different mechanisms. We propose first a formal definition of this framework by completing the feature structure representation proposed in Sign-Based Construction Grammars. In a second step, we present a general representation of the meaning based on the interaction of constructions, frames and events. This framework opens the door to a processing mechanism for building the meaning based on the notion of activation evaluated in terms of similarity and unification. This new approach integrates features from distributional semantics into the constructionist framework, leading to what we call Distributional Construction Grammars.
♻ ☆ How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes
Question generation (QG) is a natural language processing task with an abundance of potential benefits and use cases in the educational domain. In order for this potential to be realized, QG systems must be designed and validated with pedagogical needs in mind. However, little research has assessed or designed QG approaches with the input from real teachers or students. This paper applies a large language model-based QG approach where questions are generated with learning goals derived from Bloom's taxonomy. The automatically generated questions are used in multiple experiments designed to assess how teachers use them in practice. The results demonstrate that teachers prefer to write quizzes with automatically generated questions, and that such quizzes have no loss in quality compared to handwritten versions. Further, several metrics indicate that automatically generated questions can even improve the quality of the quizzes created, showing the promise for large scale use of QG in the classroom setting.
comment: 8 pages, 8 figures. Accepted to the main track of the EAAI-24: The 14th Symposium on Educational Advances in Artificial Intelligence
♻ ☆ Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?
Large language models (LLMs) undergo safety alignment after training and tuning, yet recent work shows that safety can be bypassed through jailbreak attacks. While many jailbreaks and defenses exist, their cross-lingual generalization remains underexplored. This paper presents the first systematic multilingual evaluation of jailbreaks and defenses across ten languages -- spanning high-, medium-, and low-resource languages -- using six LLMs on HarmBench and AdvBench. We assess two jailbreak types: logical-expression-based and adversarial-prompt-based. For both types, attack success and defense robustness vary across languages: high-resource languages are safer under standard queries but more vulnerable to adversarial ones. Simple defenses can be effective, but are language- and model-dependent. These findings call for language-aware and cross-lingual safety benchmarks for LLMs.
♻ ☆ Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in TSD 2025. Lecture Notes in Computer Science, vol 16029
♻ ☆ Beyond the Link: Assessing LLMs' ability to Classify Political Content across Global Media
The use of large language models (LLMs) is becoming common in political science and digital media research. While LLMs have demonstrated ability in labelling tasks, their effectiveness to classify Political Content (PC) from URLs remains underexplored. This article evaluates whether LLMs can accurately distinguish PC from non-PC using both the text and the URLs of news articles across five countries (France, Germany, Spain, the UK, and the US) and their different languages. Using cutting-edge models, we benchmark their performance against human-coded data to assess whether URL-level analysis can approximate full-text analysis. Our findings show that URLs embed relevant information and can serve as a scalable, cost-effective alternative to discern PC. However, we also uncover systematic biases: LLMs seem to overclassify centrist news as political, leading to false positives that may distort further analyses. We conclude by outlining methodological recommendations on the use of LLMs in political science research.
♻ ☆ Identifying Aspects in Peer Reviews EMNLP 2025
Peer review is central to academic publishing, but the growing volume of submissions is straining the process. This motivates the development of computational approaches to support peer review. While each review is tailored to a specific paper, reviewers often make assessments according to certain aspects such as Novelty, which reflect the values of the research community. This alignment creates opportunities for standardizing the reviewing process, improving quality control, and enabling computational support. While prior work has demonstrated the potential of aspect analysis for peer review assistance, the notion of aspect remains poorly formalized. Existing approaches often derive aspects from review forms and guidelines, yet data-driven methods for aspect identification are underexplored. To address this gap, our work takes a bottom-up approach: we propose an operational definition of aspect and develop a data-driven schema for deriving aspects from a corpus of peer reviews. We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis. We further show how the choice of aspects can impact downstream applications, such as LLM-generated review detection. Our results lay a foundation for a principled and data-driven investigation of review aspects, and pave the path for new applications of NLP to support peer review.
comment: EMNLP 2025 Findings
♻ ☆ Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions ICCV2025
As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.
comment: ICCV2025 Workshop
♻ ☆ Understanding and Optimizing Agentic Workflows via Shapley value
Agentic workflows have become the dominant paradigm for building complex AI systems, orchestrating specialized components, such as planning, reasoning, action execution, and reflection, to tackle sophisticated real-world tasks. However, systematically analyzing and optimizing these workflows remains challenging due to intricate component interdependencies and the lack of principled attribution methods. In this work, we introduce ShapleyFlow, the first framework that employs cooperative game theory to analyze and optimize agentic workflows. By applying the Shapley value to evaluate all possible component configurations, ShapleyFlow enables fine-grained attribution of each component's contribution and facilitates the identification of task-specific optimal configurations. Through a constructed dataset evaluated across 7 scenarios, such as navigation, math and OS, we demonstrate 3 key contributions: (1) Theoretical Framework: a principled game-theoretic approach for the attribution of contributions in agentic workflows. (2) Optimal Workflow Discovery: ShapleyFlow identifies task-specific component configurations that consistently outperform workflows relying on a single LLM across all tested tasks. (3) Comprehensive Analysis: we construct and analyze over 1,500 tasks, providing actionable insights and design guidelines for optimizing workflows across multiple domains.
♻ ☆ The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles
The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.
♻ ☆ Scaffolded Language Models with Language Supervision for Mixed-Autonomy: A Survey
This survey organizes the intricate literature on the design and optimization of emerging structures around post-trained LMs. We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools. We view scaffolded LMs as semi-parametric models wherein we train non-parametric variables, including the prompt, tools, and scaffold's code. In particular, they interpret instructions, use tools, and receive feedback all in language. Recent works use an LM as an optimizer to interpret language supervision and update non-parametric variables according to intricate objectives. In this survey, we refer to this paradigm as training of scaffolded LMs with language supervision. A key feature of non-parametric training is the ability to learn from language. Parametric training excels in learning from demonstration (supervised learning), exploration (reinforcement learning), or observations (unsupervised learning), using well-defined loss functions. Language-based optimization enables rich, interpretable, and expressive objectives, while mitigating issues like catastrophic forgetting and supporting compatibility with closed-source models. Furthermore, agents are increasingly deployed as co-workers in real-world applications such as Copilot in Office tools or software development. In these mixed-autonomy settings, where control and decision-making are shared between human and AI, users point out errors or suggest corrections. Accordingly, we discuss agents that continuously improve by learning from this real-time, language-based feedback and refer to this setting as streaming learning from language supervision.
♻ ☆ Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices LREC 2026
While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.
comment: 17 pages, 1 figure, 4 tables. Submitted to the LREC 2026 conference
♻ ☆ SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents NeurIPS 2025
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.
comment: Dataset: https://huggingface.co/datasets/nebius/SWE-rebench, SWE-rebench leaderboard https://swe-rebench.com NeurIPS 2025
♻ ☆ A Unified Representation Underlying the Judgment of Large Language Models
A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture for evaluative judgment. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence ("what is good") and the model's assent to factual claims ("what is true"). Through direct interventions, we demonstrate this axis drives a critical mechanism, which is identified as the subordination of reasoning: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. Our discovery offers a mechanistic account for response bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.
♻ ☆ Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning NeurIPS 2025
Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
comment: To appear on NeurIPS 2025 (spotlight)
♻ ☆ LLMs Position Themselves as More Rational Than Humans: Emergence of AI Self-Awareness Measured Through Game Theory
As Large Language Models (LLMs) grow in capability, do they develop self-awareness as an emergent behavior? And if so, can we measure it? We introduce the AI Self-Awareness Index (AISAI), a game-theoretic framework for measuring self-awareness through strategic differentiation. Using the "Guess 2/3 of Average" game, we test 28 models (OpenAI, Anthropic, Google) across 4,200 trials with three opponent framings: (A) against humans, (B) against other AI models, and (C) against AI models like you. We operationalize self-awareness as the capacity to differentiate strategic reasoning based on opponent type. Finding 1: Self-awareness emerges with model advancement. The majority of advanced models (21/28, 75%) demonstrate clear self-awareness, while older/smaller models show no differentiation. Finding 2: Self-aware models rank themselves as most rational. Among the 21 models with self-awareness, a consistent rationality hierarchy emerges: Self > Other AIs > Humans, with large AI attribution effects and moderate self-preferencing. These findings reveal that self-awareness is an emergent capability of advanced LLMs, and that self-aware models systematically perceive themselves as more rational than humans. This has implications for AI alignment, human-AI collaboration, and understanding AI beliefs about human capabilities.
comment: 19 pages, 6 figures, 28 models tested across 4,200 trials
♻ ☆ Readability Formulas, Systems and LLMs are Poor Predictors of Reading Ease
Methods for scoring text readability have been studied for over a century, and are widely used in research and in user-facing applications in many domains. Thus far, the development and evaluation of such methods have primarily relied on two types of offline behavioral data, performance on reading comprehension tests and ratings of text readability levels. In this work, we instead focus on a fundamental and understudied aspect of readability, real-time reading ease, captured with online reading measures using eye tracking. We introduce an evaluation framework for readability scoring methods which quantifies their ability to account for reading ease, while controlling for content variation across texts. Applying this evaluation to prominent traditional readability formulas, modern machine learning systems, frontier Large Language Models and commercial systems used in education, suggests that they are all poor predictors of reading ease in English. This outcome holds across native and non-native speakers, reading regimes, and textual units of different lengths. The evaluation further reveals that existing methods are often outperformed by word properties commonly used in psycholinguistics for prediction of reading times. Our results highlight a fundamental limitation of existing approaches to readability scoring, the utility of psycholinguistics for readability research, and the need for new, cognitively driven readability scoring approaches that can better account for reading ease.
♻ ☆ A Survey on LLM Mid-Training
Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.
♻ ☆ The exponential distribution of the order of demonstrative, numeral, adjective and noun
The frequency of the preferred order for a noun phrase formed by demonstrative, numeral, adjective and noun has received significant attention over the last two decades. We investigate the actual distribution of the 24 possible orders. There is no consensus on whether it is well-fitted by an exponential or a power law distribution. We find that an exponential distribution is a much better model. This finding and other circumstances where an exponential-like distribution is found challenge the view that power-law distributions, e.g., Zipf's law for word frequencies, are inevitable. We also investigate which of two exponential distributions gives a better fit: an exponential model where the 24 orders have non-zero probability (a geometric distribution truncated at rank 24) or an exponential model where the number of orders that can have non-zero probability is variable (a right-truncated geometric distribution). When consistency and generalizability are prioritized, we find higher support for the exponential model where all 24 orders have non-zero probability. These findings strongly suggest that there is no hard constraint on word order variation and then unattested orders merely result from undersampling, consistently with Cysouw's view.
comment: substantially rewritten; English improved
♻ ☆ Tongyi DeepResearch Technical Report
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
comment: https://tongyi-agent.github.io/blog
♻ ☆ Mixture of Routers
Supervised fine-tuning (SFT) is a milestone in aligning large language models with human instructions and adapting them to downstream tasks. In particular, Low-Rank Adaptation (LoRA) has gained widespread attention due to its parameter efficiency. However, its impact on improving the performance of large models remains limited. Recent studies suggest that combining LoRA with Mixture-of-Experts (MoE) can significantly enhance fine-tuning performance. MoE adapts to the diversity and complexity of datasets by dynamically selecting the most suitable experts, thereby improving task accuracy and efficiency. Despite impressive results, recent studies reveal issues in the MoE routing mechanism, such as incorrect assignments and imbalanced expert allocation. Inspired by the principles of Redundancy and Fault Tolerance Theory. We innovatively integrate the concept of Mixture of Experts into the routing mechanism and propose an efficient fine-tuning method called Mixture of Routers (MoR). It employs multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers. The results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%. MoR can serve as a plug-and-play, parameter-efficient fine-tuning method suitable for a wide range of applications. Our code is available here: https://anonymous.4open.science/r/MoR-DFC6.
comment: Under consideration at Pattern Recognition Letters
♻ ☆ FlowRL: Matching Reward Distributions for LLM Reasoning
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
♻ ☆ Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems
♻ ☆ TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers' Guidance
Large Language Models (LLMs) have made significant strides in problem-solving by incorporating reasoning processes. However, this enhanced reasoning capability results in an increased number of output tokens during inference, leading to higher computational costs. To address this challenge, we propose TwT (Thinking without Tokens), a method that reduces inference-time costs through habitual reasoning distillation with multi-teachers' guidance, while maintaining high performance. Our approach introduces a Habitual Reasoning Distillation method, which internalizes explicit reasoning into the model's habitual behavior through a Teacher-Guided compression strategy inspired by human cognition. Additionally, we propose Dual-Criteria Rejection Sampling (DCRS), a technique that generates a high-quality and diverse distillation dataset using multiple teacher models, making our method suitable for unsupervised scenarios. Experimental results demonstrate that TwT effectively reduces inference costs while preserving superior performance, achieving up to a 13.6% improvement in accuracy with fewer output tokens compared to other distillation methods, offering a highly practical solution for efficient LLM deployment.
♻ ☆ Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance EMNLP 2025
Although synthetic data has changed various aspects of information retrieval (IR) pipelines, the main training paradigm remains: contrastive learning with binary relevance labels, where one positive document is compared against several negatives using the InfoNCE loss. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus missing subtle nuances useful for ranking. To overcome this limitation, in this work, we forgo real documents and annotations and use large language models to directly generate synthetic documents that answer the MS MARCO queries according to several different levels of relevance. We also propose using Wasserstein distance as a more effective loss function for training transformer-based retrievers with graduated relevance labels. Our experiments on MS MARCO and BEIR benchmark show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents, our method significantly improves self-supervised retrievers and is more robust to distribution shift compared to contrastive learning using real data. Our method also successfully integrates existing real data into the synthetic ranking context, further boosting the performance. Overall, we show that generating multi-level ranking contexts is a better approach to synthetic data generation for IR than just generating the standard positive and negative documents.
comment: Findings of the EMNLP 2025
♻ ☆ DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns
We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.
comment: 10 pages, 29 references, 2 figures, supplementary material. V2: Discussion of the tongue-palate contact pattern for /t/. V4: replacing wrong paper upload of V3
♻ ☆ Rethinking the Relationship between the Power Law and Hierarchical Structures
Statistical analysis of corpora provides an approach to quantitatively investigate natural languages. This approach has revealed that several power laws consistently emerge across different corpora and languages, suggesting universal mechanisms underlying languages. Particularly, the power-law decay of correlation has been interpreted as evidence for underlying hierarchical structures in syntax, semantics, and discourse. This perspective has also been extended to child speeches and animal signals. However, the argument supporting this interpretation has not been empirically tested in natural languages. To address this problem, the present study examines the validity of the argument for syntactic structures. Specifically, we test whether the statistical properties of parse trees align with the assumptions in the argument. Using English and Japanese corpora, we analyze the mutual information, deviations from probabilistic context-free grammars (PCFGs), and other properties in natural language parse trees, as well as in the PCFG that approximates these parse trees. Our results indicate that the assumptions do not hold for syntactic structures and that it is difficult to apply the proposed argument to child speeches and animal signals, highlighting the need to reconsider the relationship between the power law and hierarchical structures.
comment: 18 pages, 14 figures
♻ ☆ Leveraging Hierarchical Organization for Medical Multi-document Summarization
Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model's ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.
♻ ☆ Generative World Models of Tasks: LLM-Driven Hierarchical Scaffolding for Embodied Agents NeurIPS 2025
Recent advances in agent development have focused on scaling model size and raw interaction data, mirroring successes in large language models. However, for complex, long-horizon multi-agent tasks such as robotic soccer, this end-to-end approach often fails due to intractable exploration spaces and sparse rewards. We propose that an effective world model for decision-making must model the world's physics and also its task semantics. A systematic review of 2024 research in low-resource multi-agent soccer reveals a clear trend towards integrating symbolic and hierarchical methods, such as Hierarchical Task Networks (HTNs) and Bayesian Strategy Networks (BSNs), with multi-agent reinforcement learning (MARL). These methods decompose complex goals into manageable subgoals, creating an intrinsic curriculum that shapes agent learning. We formalize this trend into a framework for Hierarchical Task Environments (HTEs), which are essential for bridging the gap between simple, reactive behaviors and sophisticated, strategic team play. Our framework incorporates the use of Large Language Models (LLMs) as generative world models of tasks, capable of dynamically generating this scaffolding. We argue that HTEs provide a mechanism to guide exploration, generate meaningful learning signals, and train agents to internalize hierarchical structure, enabling the development of more capable and general-purpose agents with greater sample efficiency than purely end-to-end approaches.
comment: In the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Embodied World Models for Decision Making (EWM)
♻ ☆ Multi-refined Feature Enhanced Sentiment Analysis Using Contextual Instruction
Sentiment analysis using deep learning and pre-trained language models (PLMs) has gained significant traction due to their ability to capture rich contextual representations. However, existing approaches often underperform in scenarios involving nuanced emotional cues, domain shifts, and imbalanced sentiment distributions. We argue that these limitations stem from inadequate semantic grounding, poor generalization to diverse linguistic patterns, and biases toward dominant sentiment classes. To overcome these challenges, we propose CISEA-MRFE, a novel PLM-based framework integrating Contextual Instruction (CI), Semantic Enhancement Augmentation (SEA), and Multi-Refined Feature Extraction (MRFE). CI injects domain-aware directives to guide sentiment disambiguation; SEA improves robustness through sentiment-consistent paraphrastic augmentation; and MRFE combines a Scale-Adaptive Depthwise Encoder (SADE) for multi-scale feature specialization with an Emotion Evaluator Context Encoder (EECE) for affect-aware sequence modeling. Experimental results on four benchmark datasets demonstrate that CISEA-MRFE consistently outperforms strong baselines, achieving relative improvements in accuracy of up to 4.6% on IMDb, 6.5% on Yelp, 30.3% on Twitter, and 4.1% on Amazon. These results validate the effectiveness and generalization ability of our approach for sentiment classification across varied domains.
♻ ☆ Path-Consistency with Prefix Enhancement for Efficient Inference in LLMs
To enhance the reasoning capabilities of large language models (LLMs), self-consistency has become a popular approach, combining multiple samplings with majority voting. However, current methods are computationally expensive and time-consuming due to the need for numerous samplings. To address this, this paper introduces path-consistency, which leverages the confidence of earlier-generated answers to identify the most promising prefix and guide the generation of subsequent branches. By dynamically guiding the generation of subsequent branches based on this prefix, path-consistency mitigates both the errors and redundancies from random or less useful sampling in self-consistency. This approach reduces errors and redundancies from random sampling, significantly accelerating inference by minimizing token consumption. Our extensive empirical results demonstrate that path-consistency improves inference latency by up to 40.5\%, while maintaining task accuracy across various tasks, including mathematical reasoning, commonsense reasoning, and symbolic reasoning.
♻ ☆ SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.
♻ ☆ LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling
Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.
♻ ☆ Revisiting Long-context Modeling from Context Denoising Perspective
Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).
♻ ☆ Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.
♻ ☆ Zero-RAG: Towards Retrieval-Augmented Generation with Zero Redundant Knowledge
Retrieval-Augmented Generation has shown remarkable results to address Large Language Models' hallucinations, which usually uses a large external corpus to supplement knowledge to LLMs. However, with the development of LLMs, the internal knowledge of LLMs has expanded significantly, thus causing significant knowledge redundancy between the external corpus and LLMs. On the one hand, the indexing cost of dense retrieval is highly related to the corpus size and thus significant redundant knowledge intensifies the dense retrieval's workload. On the other hand, the redundant knowledge in the external corpus is not helpful to LLMs and our exploratory analysis shows that it instead hurts the RAG performance on those questions which the LLM can answer by itself. To address these issues, we propose Zero-RAG to tackle these challenges. Specifically, we first propose the Mastery-Score metric to identify redundant knowledge in the RAG corpus to prune it. After pruning, answers to "mastered" questions rely primarily on internal knowledge of the LLM. To better harness the internal capacity, we propose Query Router and Noise-Tolerant Tuning to avoid the irrelevant documents' distraction and thus further improve the LLM's utilization of internal knowledge with pruned corpus. Experimental results show that Zero-RAG prunes the Wikipedia corpus by 30\% and accelerates the retrieval stage by 22\%, without compromising RAG's performance.
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ Consistency of Responses and Continuations Generated by Large Language Models on Social Media AAAI
Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using three open-source models: Gemma, Llama3 and Llama3.3 and one commercial Model:Claude. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic consistency between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: these models show a strong tendency to moderate negative emotions. When the input text carries negative emotions such as anger, disgust, fear, or sadness, LLM tends to generate content with more neutral emotions, or even convert them into positive emotions such as joy or surprise. At the same time, we compared the LLM-generated content with human-authored content. The four models systematically generated responses with reduced emotional intensity and showed a preference for neutral rational emotions in the response task. In addition, these models all maintained a high semantic similarity with the original text, although their performance in the continuation task and the response task was different. These findings provide deep insights into the emotion and semantic processing capabilities of LLM, which are of great significance for its deployment in social media environments and human-computer interaction design.
comment: This paper has been accepted by the 20th International AAAI Conference on Web and Social Media (ICWSM 2026), sunny Los Angeles, California, U.S
♻ ☆ SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers NeurIPS 2025
The demand for Large Language Models (LLMs) at multiple scales, capable of sophisticated and sound mathematical reasoning, continues to grow. However, the development of performant mathematical LLMs is often bottlenecked by the scarcity of useful training data containing problems with significant complexity. We introduce \textbf{SAND-Math} (\textbf{S}ynthetic \textbf{A}ugmented \textbf{N}ovel and \textbf{D}ifficult Mathematics problems and solutions), a pipeline that addresses this by first synthesizing high-quality problems from scratch and then systematically elevating their complexity via a our newly proposed \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings: \textbf{(1)} Augmenting a strong post-training baseline with a small 500-sample SAND-Math dataset significantly boosts performance, outperforming the next-best synthetic dataset by $\uparrow$ 17.85 absolute points on AIME25 benchmark. \textbf{(2)} In a dedicated ablation study, we show the effectiveness of our Difficulty Hiking process in increasing average problem difficulty from 5.02 to 5.98. This step consequently lifts AIME25 results from 46.38\% to 49.23\%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building capable and efficient mathematical reasoning LLMs.
comment: Accepted at MATH-AI workshop, NeurIPS 2025
♻ ☆ Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
comment: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence in Measurement and Education Conference (AIME-Con)
♻ ☆ Visual Program Distillation with Template-Based Augmentation EMNLP
Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference
comment: EMNLP Camera Ready
Information Retrieval 15
☆ Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval
Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.
☆ Relational Deep Dive: Error-Aware Queries Over Unstructured Data
Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy and human correction costs. Experiments across diverse datasets demonstrate ReDD's effectiveness, reducing data extraction errors from up to 30% to below 1% while maintaining high schema completeness (100% recall) and precision. ReDD's modular design enables fine-grained control over accuracy-cost trade-offs, making it a robust solution for high-stakes analytical queries over unstructured corpora.
☆ Average Precision at Cutoff k under Random Rankings: Expectation and Variance
Recommender systems and information retrieval platforms rely on ranking algorithms to present the most relevant items to users, thereby improving engagement and satisfaction. Assessing the quality of these rankings requires reliable evaluation metrics. Among them, Mean Average Precision at cutoff k (MAP@k) is widely used, as it accounts for both the relevance of items and their positions in the list. In this paper, the expectation and variance of Average Precision at k (AP@k) are derived since they can be used as biselines for MAP@k. Here, we covered two widely used evaluation models: offline and online. The expectation establishes the baseline, indicating the level of MAP@k that can be achieved by pure chance. The variance complements this baseline by quantifying the extent of random fluctuations, enabling a more reliable interpretation of observed scores.
comment: 17 pages, 2 tables, 2 figures
☆ Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation CIKM 2025
Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.
comment: Accepted to MMGenSR Workshop (CIKM 2025)
☆ Library and Culture: A Scientometric Analysis and Visualization of Research Trends
The significance of libraries in preserving and maintaining history and traditional culture cannot be overlooked. It is from this purpose that libraries are to envisage in their programmes cultural activities which must be collected, documented and preserved for posterity. The usefulness of preserved information lies in the fact that the generation to come will be able to establish their identity. This will also assist them with a foundation to build from. This study focus on the growth and development of Library and Culture research in forms of publications reflected in Web of Science database, during the span of 2010-2019. A total 890 publications were found and the highest 124 (13.93%) publications published in 2019.The analysis maps comprehensively the parameters of total output, growth of output, authorship, institution wise and country-level collaboration patterns, major contributors (individuals, top publication sources, institutions, and countries). It exposed that the most prolific author is Lo P secured first place by contributing 4 (0.45%) publications, followed by Bressan V 3 (0.34%) publications in Library and Culture literature. Journal of Academic Librarianship produced the highest number of records 29 (3.26%) followed by Australian Library Journal having contributed 21 (2.36%).It is identified the domination of Wuhan University; School Information Management had contributed 6 (0.67%) of total research output. Authors from USA published the highest number of publications with a total of 244 (27.42%), followed by UK and Australia with 118 (13.26%) and 76 (8.54%) publications were produced respectively.
comment: 8 pages, 3 figures, Research Article
☆ Research Output on Alopecia Areata Disease: A Scientometric Analysis of Publications from 2010 to 2019
The present study is undertaken to find out the publication trends on Alopecia Areata Disease during 2010-2019 from the global perspective. The study mainly focus on distribution of research output, top journals for publications, most prolific authors, authorship pattern, and citations pattern on Alopecia Areata Disease. The results indicate that highest growth rate of publications occurred during the year 2019. Columbia University topped the scene among all institutes. The maximum publications were more than four authored publications. Christiano AM and Clynes R were found to be the most prolific authors. It is also found that most of the prolific authors (by number of publications) do appear in highly cited publications list. Alopecia Areata Disease researchers mostly preferred using article publications to communicate their findings.
comment: 16 pages, 3 figures, Research Paper
☆ KGBridge: Knowledge-Guided Prompt Learning for Non-overlapping Cross-Domain Recommendation
Knowledge Graphs (KGs), as structured knowledge bases that organize relational information across diverse domains, provide a unified semantic foundation for cross-domain recommendation (CDR). By integrating symbolic knowledge with user-item interactions, KGs enrich semantic representations, support reasoning, and enhance model interpretability. Despite this potential, existing KG-based methods still face major challenges in CDR, particularly under non-overlapping user scenarios. These challenges arise from: (C1) sensitivity to KG sparsity and popularity bias, (C2) dependence on overlapping users for domain alignment and (C3) lack of explicit disentanglement between transferable and domain-specific knowledge, which limit effective and stable knowledge transfer. To this end, we propose KGBridge, a knowledge-guided prompt learning framework for cross-domain sequential recommendation under non-overlapping user scenarios. KGBridge comprises two core components: a KG-enhanced Prompt Encoder, which models relation-level semantics as soft prompts to provide structured and dynamic priors for user sequence modeling (addressing C1), and a Two-stage Training Paradigm, which combines cross-domain pretraining and privacy-preserving fine-tuning to enable knowledge transfer without user overlap (addressing C2). By combining relation-aware semantic control with correspondence-driven disentanglement, KGBridge explicitly separates and balances domain-shared and domain-specific semantics, thereby maintaining complementarity and stabilizing adaptation during fine-tuning (addressing C3). Extensive experiments on benchmark datasets demonstrate that KGBridge consistently outperforms state-of-the-art baselines and remains robust under varying KG sparsity, highlighting its effectiveness in mitigating structural imbalance and semantic entanglement in KG-enhanced cross-domain recommendation.
comment: 13 pages, 4 figures
♻ ☆ Osprey: A Scalable Framework for the Orchestration of Agentic Systems
Coordinating workflows across complex systems remains a central challenge in safety-critical environments such as scientific facilities. Language-model-driven agents offer a natural interface for these tasks, but existing approaches often lack scalability, reliability, and human oversight. We introduce the Osprey Framework, a domain-agnostic, production-ready architecture for scalable agentic systems that integrate conversational context with robust tool orchestration across safety-critical domains. Our framework provides: (i) dynamic capability classification to select only relevant tools; (ii) plan-first orchestration with explicit dependencies and optional human approval; (iii) context-aware task extraction that combines dialogue history with external memory and domain resources; and (iv) production-ready execution with checkpointing, artifact management, and modular deployment. We demonstrate its versatility through two case studies: a deployment at the Advanced Light Source particle accelerator and a tutorial-style wind farm monitoring example. These results establish Osprey as a reliable and transparent framework for agentic systems across diverse high-stakes domains.
♻ ☆ Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs
For autonomous legal agents to operate safely in high-stakes domains, they require a foundation of absolute determinism and auditability-guarantees that standard Retrieval-Augmented Generation (RAG) frameworks cannot provide. When interacting with temporal knowledge graphs that model the complex evolution of legal norms, agents must navigate versioning, causality, and hierarchical structures with precision, a task for which black-box vector search is ill-suited. This paper introduces a new architectural pattern to solve this: a formal Primitive API designed as a secure execution layer for reasoning over such graphs. Instead of a monolithic query engine, our framework provides a library of canonical primitives-atomic, composable, and auditable primitives. This design empowers planner-guided agents to decompose complex legal questions into transparent execution plans, enabling critical tasks with full verifiability, including: (i) precise point-in-time version retrieval, (ii) robust causal lineage tracing, and (iii) context-aware hybrid search. Ultimately, this architecture transforms opaque retrieval into auditable reasoning, turning the agent's internal process from a black box into a verifiable log of deterministic primitives and providing a blueprint for building the next generation of trustworthy legal AI.
comment: Major revision reframing the paper from an API spec to a novel architectural pattern for deterministic agents. The core contribution is now positioned as a blueprint for auditable reasoning, essential for building trustworthy legal AI systems
♻ ☆ Diffusion Generative Recommendation with Continuous Tokens
Recent advances in generative artificial intelligence, particularly large language models (LLMs), have opened new opportunities for enhancing recommender systems (RecSys). Most existing LLM-based RecSys approaches operate in a discrete space, using vector-quantized tokenizers to align with the inherent discrete nature of language models. However, these quantization methods often result in lossy tokenization and suboptimal learning, primarily due to inaccurate gradient propagation caused by the non-differentiable argmin operation in standard vector quantization. Inspired by the emerging trend of embracing continuous tokens in language models, we propose ContRec, a novel framework that seamlessly integrates continuous tokens into LLM-based RecSys. Specifically, ContRec consists of two key modules: a sigma-VAE Tokenizer, which encodes users/items with continuous tokens; and a Dispersive Diffusion module, which captures implicit user preference. The tokenizer is trained with a continuous Variational Auto-Encoder (VAE) objective, where three effective techniques are adopted to avoid representation collapse. By conditioning on the previously generated tokens of the LLM backbone during user modeling, the Dispersive Diffusion module performs a conditional diffusion process with a novel Dispersive Loss, enabling high-quality user preference generation through next-token diffusion. Finally, ContRec leverages both the textual reasoning output from the LLM and the latent representations produced by the diffusion model for Top-K item retrieval, thereby delivering comprehensive recommendation results. Extensive experiments on four datasets demonstrate that ContRec consistently outperforms both traditional and SOTA LLM-based recommender systems. Our results highlight the potential of continuous tokenization and generative modeling for advancing the next generation of recommender systems.
♻ ☆ Tongyi DeepResearch Technical Report
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
comment: https://tongyi-agent.github.io/blog
♻ ☆ Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance EMNLP 2025
Although synthetic data has changed various aspects of information retrieval (IR) pipelines, the main training paradigm remains: contrastive learning with binary relevance labels, where one positive document is compared against several negatives using the InfoNCE loss. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus missing subtle nuances useful for ranking. To overcome this limitation, in this work, we forgo real documents and annotations and use large language models to directly generate synthetic documents that answer the MS MARCO queries according to several different levels of relevance. We also propose using Wasserstein distance as a more effective loss function for training transformer-based retrievers with graduated relevance labels. Our experiments on MS MARCO and BEIR benchmark show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents, our method significantly improves self-supervised retrievers and is more robust to distribution shift compared to contrastive learning using real data. Our method also successfully integrates existing real data into the synthetic ranking context, further boosting the performance. Overall, we show that generating multi-level ranking contexts is a better approach to synthetic data generation for IR than just generating the standard positive and negative documents.
comment: Findings of the EMNLP 2025
♻ ☆ Leveraging Hierarchical Organization for Medical Multi-document Summarization
Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model's ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.
♻ ☆ Simple and Behavior-Driven Augmentation for Recommendation with Rich Collaborative Signals
Contrastive learning (CL) has been widely used for enhancing the performance of graph collaborative filtering (GCF) for personalized recommendation. Since data augmentation plays a crucial role in the success of CL, previous works have designed augmentation methods to remove noisy interactions between users and items in order to generate effective augmented views. However, the ambiguity in defining ''noisiness'' presents a persistent risk of losing core information and generating unreliable data views, while increasing the overall complexity of augmentation. In this paper, we propose Simple Collaborative Augmentation for Recommendation (SCAR), a novel and intuitive augmentation method designed to maximize the effectiveness of CL for GCF. Instead of removing information, SCAR leverages collaborative signals extracted from user-item interactions to generate pseudo-interactions, which are then either added to or used to replace existing interactions. This results in more robust representations while avoiding the pitfalls of overly complex augmentation modules. We conduct experiments on four benchmark datasets and show that SCAR outperforms previous CL-based GCF methods as well as other state-of-the-art self-supervised learning approaches across key evaluation metrics. SCAR exhibits strong robustness across different hyperparameter settings and is particularly effective in sparse data scenarios.
comment: 10 pages. This paper is accepted at IEEE BigData 2025 (Short)
♻ ☆ CAT-ID$^2$: Category-Tree Integrated Document Identifier Learning for Generative Retrieval In E-commerce WSDM'26
Generative retrieval (GR) has gained significant attention as an effective paradigm that integrates the capabilities of large language models (LLMs). It generally consists of two stages: constructing discrete semantic identifiers (IDs) for documents and retrieving documents by autoregressively generating ID tokens. The core challenge in GR is how to construct document IDs (DocIDS) with strong representational power. Good IDs should exhibit two key properties: similar documents should have more similar IDs, and each document should maintain a distinct and unique ID. However, most existing methods ignore native category information, which is common and critical in E-commerce. Therefore, we propose a novel ID learning method, CAtegory-Tree Integrated Document IDentifier (CAT-ID$^2$), incorporating prior category information into the semantic IDs. CAT-ID$^2$ includes three key modules: a Hierarchical Class Constraint Loss to integrate category information layer by layer during quantization, a Cluster Scale Constraint Loss for uniform ID token distribution, and a Dispersion Loss to improve the distinction of reconstructed documents. These components enable CAT-ID$^2$ to generate IDs that make similar documents more alike while preserving the uniqueness of different documents' representations. Extensive offline and online experiments confirm the effectiveness of our method, with online A/B tests showing a 0.33% increase in average orders per thousand users for ambiguous intent queries and 0.24% for long-tail queries.
comment: Accepted by WSDM'26
Multimedia 10
☆ Wireless Video Semantic Communication with Decoupled Diffusion Multi-frame Compensation
Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework with decoupled diffusion multi-frame compensation (DDMFC), abbreviated as WVSC-D, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC-D first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, DDMFC is proposed to generate compensated current semantic frame by a two-stage conditional diffusion process. With both the reference frame transmission and DDMFC frame compensation, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC-D over other DL-based methods e.g. DVSC about 1.8 dB in terms of PSNR.
☆ Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation CIKM 2025
Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.
comment: Accepted to MMGenSR Workshop (CIKM 2025)
☆ Human-Machine Ritual: Synergic Performance through Real-Time Motion Recognition NeurIPS 2025
We introduce a lightweight, real-time motion recognition system that enables synergic human-machine performance through wearable IMU sensor data, MiniRocket time-series classification, and responsive multimedia control. By mapping dancer-specific movement to sound through somatic memory and association, we propose an alternative approach to human-machine collaboration, one that preserves the expressive depth of the performing body while leveraging machine learning for attentive observation and responsiveness. We demonstrate that this human-centered design reliably supports high accuracy classification (<50 ms latency), offering a replicable framework to integrate dance-literate machines into creative, educational, and live performance contexts.
comment: 8 pages, 5 figures. Camera-ready manuscript for the Creative AI Track of NeurIPS 2025
☆ An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM
Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model's ability to leverage the core language model's reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM's audio labeling ability.
♻ ☆ Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.
comment: preprint
♻ ☆ MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. Demos are available at https://v2aresearch.github.io/MultiSoundGen/.
♻ ☆ SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion
With the rapid advancement of the digital society, the proliferation of satellites in the Satellite Internet of Things (Sat-IoT) has led to the continuous accumulation of large-scale multi-temporal and multi-source images across diverse application scenarios. However, existing methods fail to fully exploit the complementary information embedded in both temporal and source dimensions. For example, Multi-Image Super-Resolution (MISR) enhances reconstruction quality by leveraging temporal complementarity across multiple observations, yet the limited fine-grained texture details in input images constrain its performance. Conversely, pansharpening integrates multi-source images by injecting high-frequency spatial information from panchromatic data, but typically relies on pre-interpolated low-resolution inputs and assumes noise-free alignment, making it highly sensitive to noise and misregistration. To address these issues, we propose SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion. Specifically, SatFusion first employs a Multi-Temporal Image Fusion (MTIF) module to achieve deep feature alignment with the panchromatic image. Then, a Multi-Source Image Fusion (MSIF) module injects fine-grained texture information from the panchromatic data. Finally, a Fusion Composition module adaptively integrates the complementary advantages of both modalities while dynamically refining spectral consistency, supervised by a weighted combination of multiple loss functions. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that SatFusion significantly improves fusion quality, robustness under challenging conditions, and generalizability to real-world Sat-IoT scenarios. The code is available at: https://github.com/dllgyufei/SatFusion.git.
♻ ☆ Light Future: Multimodal Action Frame Prediction via InstructPix2Pix WACV 2026
Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.
comment: 9 pages including appendix, 4 tables, 8 figures, to be submitted to WACV 2026
♻ ☆ Prevailing Research Areas for Music AI in the Era of Foundation Models
Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists' workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.
♻ ☆ RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation
Recently, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model RoboTron-Mani and the comprehensive dataset RoboData. RoboTron-Mani, on one hand, enhances 3D perception through camera parameters and occupancy supervision. On the other hand, it further incorporates Modality-Isolation-Mask and multimodal decoder blocks based on OpenFlamingo, improving modality fusion and fine-grained perception. RoboData integrats several publicly-available datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, actions, and space alignment, which facilitates comprehensive learning from diverse robotic datasets and offers one complete evaluation system. Trained on RoboData, RoboTron-Mani is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections. Specifically, RoboTron-Mani boosts manipulation performance by increasing the average sequence length on CALVIN from 1.7 to 3.5, enabling cross-embodiment generalization, and achieving state-of-the-art results on both simulated and real-world datasets.
Image and Video Processing 13
☆ Diffusion Models are Robust Pretrainers
Diffusion models have gained significant attention for high-fidelity image generation. Our work investigates the potential of exploiting diffusion models for adversarial robustness in image classification and object detection. Adversarial attacks challenge standard models in these tasks by perturbing inputs to force incorrect predictions. To address this issue, many approaches use training schemes for forcing the robustness of the models, which increase training costs. In this work, we study models built on top of off-the-shelf diffusion models and demonstrate their practical significance: they provide a low-cost path to robust representations, allowing lightweight heads to be trained on frozen features without full adversarial training. Our empirical evaluations on ImageNet, CIFAR-10, and PASCAL VOC show that diffusion-based classifiers and detectors achieve meaningful adversarial robustness with minimal compute. While clean and adversarial accuracies remain below state-of-the-art adversarially trained CNNs or ViTs, diffusion pretraining offers a favorable tradeoff between efficiency and robustness. This work opens a promising avenue for integrating diffusion models into resource-constrained robust deployments.
comment: To be published in IEEE Signal Processing Letters
☆ Resource-efficient Automatic Refinement of Segmentations via Weak Supervision from Light Feedback
Delineating anatomical regions is a key task in medical image analysis. Manual segmentation achieves high accuracy but is labor-intensive and prone to variability, thus prompting the development of automated approaches. Recently, a breadth of foundation models has enabled automated segmentations across diverse anatomies and imaging modalities, but these may not always meet the clinical accuracy standards. While segmentation refinement strategies can improve performance, current methods depend on heavy user interactions or require fully supervised segmentations for training. Here, we present SCORE (Segmentation COrrection from Regional Evaluations), a weakly supervised framework that learns to refine mask predictions only using light feedback during training. Specifically, instead of relying on dense training image annotations, SCORE introduces a novel loss that leverages region-wise quality scores and over/under-segmentation error labels. We demonstrate SCORE on humerus CT scans, where it considerably improves initial predictions from TotalSegmentator, and achieves performance on par with existing refinement methods, while greatly reducing their supervision requirements and annotation time. Our code is available at: https://gitlab.inria.fr/adelangl/SCORE.
☆ Accounting for Underspecification in Statistical Claims of Model Superiority
Machine learning methods are increasingly applied in medical imaging, yet many reported improvements lack statistical robustness: recent works have highlighted that small but significant performance gains are highly likely to be false positives. However, these analyses do not take \emph{underspecification} into account -- the fact that models achieving similar validation scores may behave differently on unseen data due to random initialization or training dynamics. Here, we extend a recent statistical framework modeling false outperformance claims to include underspecification as an additional variance component. Our simulations demonstrate that even modest seed variability ($\sim1\%$) substantially increases the evidence required to support superiority claims. Our findings underscore the need for explicit modeling of training variance when validating medical imaging systems.
comment: Medical Imaging meets EurIPS Workshop: MedEurIPS 2025
☆ MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.
☆ High-Resolution Magnetic Particle Imaging System Matrix Recovery Using a Vision Transformer with Residual Feature Network
This study presents a hybrid deep learning framework, the Vision Transformer with Residual Feature Network (VRF-Net), for recovering high-resolution system matrices in Magnetic Particle Imaging (MPI). MPI resolution often suffers from downsampling and coil sensitivity variations. VRF-Net addresses these challenges by combining transformer-based global attention with residual convolutional refinement, enabling recovery of both large-scale structures and fine details. To reflect realistic MPI conditions, the system matrix is degraded using a dual-stage downsampling strategy. Training employed paired-image super-resolution on the public Open MPI dataset and a simulated dataset incorporating variable coil sensitivity profiles. For system matrix recovery on the Open MPI dataset, VRF-Net achieved nRMSE = 0.403, pSNR = 39.08 dB, and SSIM = 0.835 at 2x scaling, and maintained strong performance even at challenging scale 8x (pSNR = 31.06 dB, SSIM = 0.717). For the simulated dataset, VRF-Net achieved nRMSE = 4.44, pSNR = 28.52 dB, and SSIM = 0.771 at 2x scaling, with stable performance at higher scales. On average, it reduced nRMSE by 88.2%, increased pSNR by 44.7%, and improved SSIM by 34.3% over interpolation and CNN-based methods. In image reconstruction of Open MPI phantoms, VRF-Net further reduced reconstruction error to nRMSE = 1.79 at 2x scaling, while preserving structural fidelity (pSNR = 41.58 dB, SSIM = 0.960), outperforming existing methods. These findings demonstrate that VRF-Net enables sharper, artifact-free system matrix recovery and robust image reconstruction across multiple scales, offering a promising direction for future in vivo applications.
☆ Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning
Segmental longitudinal strain (SLS) of the left ventricle (LV) is an important prognostic indicator for evaluating regional LV dysfunction, in particular for diagnosing and managing myocardial ischemia. Current techniques for strain estimation require significant manual intervention and expertise, limiting their efficiency and making them too resource-intensive for monitoring purposes. This study introduces the first automated pipeline, autoStrain, for SLS estimation in transesophageal echocardiography (TEE) using deep learning (DL) methods for motion estimation. We present a comparative analysis of two DL approaches: TeeFlow, based on the RAFT optical flow model for dense frame-to-frame predictions, and TeeTracker, based on the CoTracker point trajectory model for sparse long-sequence predictions. As ground truth motion data from real echocardiographic sequences are hardly accessible, we took advantage of a unique simulation pipeline (SIMUS) to generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with ground truth myocardial motion to train and evaluate both models. Our evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a mean distance error in motion estimation of 0.65 mm on a synTEE test dataset. Clinical validation on 16 patients further demonstrated that SLS estimation with our autoStrain pipeline aligned with clinical references, achieving a mean difference (95\% limits of agreement) of 1.09% (-8.90% to 11.09%). Incorporation of simulated ischemia in the synTEE data improved the accuracy of the models in quantifying abnormal deformation. Our findings indicate that integrating AI-driven motion estimation with TEE can significantly enhance the precision and efficiency of cardiac function assessment in clinical settings.
comment: 13 pages, IEEE Journal of Biomedical and Health Informatics
♻ ☆ Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment SP
In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.
comment: ICSPS 2025
♻ ☆ Invited Paper: BitMedViT: Ternary-Quantized Vision Transformer for Medical AI Assistants on the Edge
Vision Transformers (ViTs) have demonstrated strong capabilities in interpreting complex medical imaging data. However, their significant computational and memory demands pose challenges for deployment in real-time, resource-constrained mobile and wearable devices used in clinical environments. We introduce, BiTMedViT, a new class of Edge ViTs serving as medical AI assistants that perform structured analysis of medical images directly on the edge. BiTMedViT utilizes ternary- quantized linear layers tailored for medical imaging and com- bines a training procedure with multi-query attention, preserving stability under ternary weights with low-precision activations. Furthermore, BiTMedViT employs task-aware distillation from a high-capacity teacher to recover accuracy lost due to extreme quantization. Lastly, we also present a pipeline that maps the ternarized ViTs to a custom CUDA kernel for efficient memory bandwidth utilization and latency reduction on the Jetson Orin Nano. Finally, BiTMedViT achieves 86% diagnostic accuracy (89% SOTA) on MedMNIST across 12 datasets, while reducing model size by 43x, memory traffic by 39x, and enabling 16.8 ms inference at an energy efficiency up to 41x that of SOTA models at 183.62 GOPs/J on the Orin Nano. Our results demonstrate a practical and scientifically grounded route for extreme-precision medical imaging ViTs deployable on the edge, narrowing the gap between algorithmic advances and deployable clinical tools.
comment: Accepted at 2025 IEEE/ACM International Conf. on Computer-Aided Design (ICCAD) Oct. 26-30 2025, Munich, DE
♻ ☆ Cross-modal Diffusion Modelling for Super-resolved Spatial Transcriptomics
The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.
♻ ☆ Improving Generalization in MRI-Based Deep Learning Models for Total Knee Replacement Prediction
Knee osteoarthritis (KOA) is a common joint disease that causes pain and mobility issues. While MRI-based deep learning models have demonstrated superior performance in predicting total knee replacement (TKR) and disease progression, their generalizability remains challenging, particularly when applied to imaging data from different sources. In this study, we show that replacing batch normalization with instance normalization, using data augmentation, and applying contrastive loss improves generalization. For training and evaluation, we used MRI data from the Osteoarthritis Initiative (OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) images as the target domain. The results demonstrated a statistically significant improvement in classification metrics across both domains by replacing batch normalization with instance normalization in the baseline model, generating augmented input views using the Global Intensity Non-linear (GIN) augmentation method, and incorporating a supervised contrastive loss alongside the classification loss to align representations of samples with the same label. The GIN method with contrastive loss performed better than all evaluated single-source domain generalization methods when using 3D instance normalization. Comparing GIN with and without contrastive loss (for both normalization types) showed that adding contrastive loss consistently led to better performance.
♻ ☆ SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion
With the rapid advancement of the digital society, the proliferation of satellites in the Satellite Internet of Things (Sat-IoT) has led to the continuous accumulation of large-scale multi-temporal and multi-source images across diverse application scenarios. However, existing methods fail to fully exploit the complementary information embedded in both temporal and source dimensions. For example, Multi-Image Super-Resolution (MISR) enhances reconstruction quality by leveraging temporal complementarity across multiple observations, yet the limited fine-grained texture details in input images constrain its performance. Conversely, pansharpening integrates multi-source images by injecting high-frequency spatial information from panchromatic data, but typically relies on pre-interpolated low-resolution inputs and assumes noise-free alignment, making it highly sensitive to noise and misregistration. To address these issues, we propose SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion. Specifically, SatFusion first employs a Multi-Temporal Image Fusion (MTIF) module to achieve deep feature alignment with the panchromatic image. Then, a Multi-Source Image Fusion (MSIF) module injects fine-grained texture information from the panchromatic data. Finally, a Fusion Composition module adaptively integrates the complementary advantages of both modalities while dynamically refining spectral consistency, supervised by a weighted combination of multiple loss functions. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that SatFusion significantly improves fusion quality, robustness under challenging conditions, and generalizability to real-world Sat-IoT scenarios. The code is available at: https://github.com/dllgyufei/SatFusion.git.
♻ ☆ Deep Generative Models for Enhanced Vitreous OCT Imaging
Purpose: To evaluate deep learning (DL) models for enhancing vitreous optical coherence tomography (OCT) image quality and reducing acquisition time. Methods: Conditional Denoising Diffusion Probabilistic Models (cDDPMs), Brownian Bridge Diffusion Models (BBDMs), U-Net, Pix2Pix, and Vector-Quantised Generative Adversarial Network (VQ-GAN) were used to generate high-quality spectral-domain (SD) vitreous OCT images. Inputs were SD ART10 images, and outputs were compared to pseudoART100 images obtained by averaging ten ART10 images per eye location. Model performance was assessed using image quality metrics and Visual Turing Tests, where ophthalmologists ranked generated images and evaluated anatomical fidelity. The best model's performance was further tested within the manually segmented vitreous on newly acquired data. Results: U-Net achieved the highest Peak Signal-to-Noise Ratio (PSNR: 30.230) and Structural Similarity Index Measure (SSIM: 0.820), followed by cDDPM. For Learned Perceptual Image Patch Similarity (LPIPS), Pix2Pix (0.697) and cDDPM (0.753) performed best. In the first Visual Turing Test, cDDPM ranked highest (3.07); in the second (best model only), cDDPM achieved a 32.9% fool rate and 85.7% anatomical preservation. On newly acquired data, cDDPM generated vitreous regions more similar in PSNR to the ART100 reference than true ART1 or ART10 B-scans and achieved higher PSNR on whole images when conditioned on ART1 than ART10. Conclusions: Results reveal discrepancies between quantitative metrics and clinical evaluation, highlighting the need for combined assessment. cDDPM showed strong potential for generating clinically meaningful vitreous OCT images while reducing acquisition time fourfold. Translational Relevance: cDDPMs show promise for clinical integration, supporting faster, higher-quality vitreous imaging. Dataset and code will be made publicly available.
♻ ☆ Joint Lossless Compression and Steganography for Medical Images via Large Language Models
Recently, large language models (LLMs) have driven promising progress in lossless image compression. However, directly adopting existing paradigms for medical images suffers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compression process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this insight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, providing global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we innovatively propose a segmented message steganography algorithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the superiority of our proposed method in terms of compression ratios, efficiency, and security. The source code will be made publicly available.
Computer Vision and Pattern Recognition 81
☆ Markerless Augmented Reality Registration for Surgical Guidance: A Multi-Anatomy Clinical Accuracy Study
Purpose: In this paper, we develop and clinically evaluate a depth-only, markerless augmented reality (AR) registration pipeline on a head-mounted display, and assess accuracy across small or low-curvature anatomies in real-life operative settings. Methods: On HoloLens 2, we align Articulated HAnd Tracking (AHAT) depth to Computed Tomography (CT)-derived skin meshes via (i) depth-bias correction, (ii) brief human-in-the-loop initialization, (iii) global and local registration. We validated the surface-tracing error metric by comparing "skin-to-bone" relative distances to CT ground truth on leg and foot models, using an AR-tracked tool. We then performed seven intraoperative target trials (feet x2, ear x3, leg x2) during the initial stage of fibula free-flap harvest and mandibular reconstruction surgery, and collected 500+ data per trial. Results: Preclinical validation showed tight agreement between AR-traced and CT distances (leg: median |Delta d| 0.78 mm, RMSE 0.97 mm; feet: 0.80 mm, 1.20 mm). Clinically, per-point error had a median of 3.9 mm. Median errors by anatomy were 3.2 mm (feet), 4.3 mm (ear), and 5.3 mm (lower leg), with 5 mm coverage 92-95%, 84-90%, and 72-86%, respectively. Feet vs. lower leg differed significantly (Delta median ~1.1 mm; p < 0.001). Conclusion: A depth-only, markerless AR pipeline on HMDs achieved ~3-4 mm median error across feet, ear, and lower leg in live surgical settings without fiducials, approaching typical clinical error thresholds for moderate-risk tasks. Human-guided initialization plus global-to-local registration enabled accurate alignment on small or low-curvature targets, improving the clinical readiness of markerless AR guidance.
☆ Opto-Electronic Convolutional Neural Network Design Via Direct Kernel Optimization
Opto-electronic neural networks integrate optical front-ends with electronic back-ends to enable fast and energy-efficient vision. However, conventional end-to-end optimization of both the optical and electronic modules is limited by costly simulations and large parameter spaces. We introduce a two-stage strategy for designing opto-electronic convolutional neural networks (CNNs): first, train a standard electronic CNN, then realize the optical front-end implemented as a metasurface array through direct kernel optimization of its first convolutional layer. This approach reduces computational and memory demands by hundreds of times and improves training stability compared to end-to-end optimization. On monocular depth estimation, the proposed two-stage design achieves twice the accuracy of end-to-end training under the same training time and resource constraints.
☆ Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis
Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.
comment: First two authors contributed equally
☆ Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images
The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.
comment: Submitted to ISBI 2026
☆ Locally-Supervised Global Image Restoration
We address the problem of image reconstruction from incomplete measurements, encompassing both upsampling and inpainting, within a learning-based framework. Conventional supervised approaches require fully sampled ground truth data, while self-supervised methods allow incomplete ground truth but typically rely on random sampling that, in expectation, covers the entire image. In contrast, we consider fixed, deterministic sampling patterns with inherently incomplete coverage, even in expectation. To overcome this limitation, we exploit multiple invariances of the underlying image distribution, which theoretically allows us to achieve the same reconstruction performance as fully supervised approaches. We validate our method on optical-resolution image upsampling in photoacoustic microscopy (PAM), demonstrating competitive or superior results while requiring substantially less ground truth data.
☆ Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users
Geo-Foundational Models (GFMs) enable fast and reliable extraction of spatiotemporal information from satellite imagery, improving flood inundation mapping by leveraging location and time embeddings. Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net. A systematic comparison across sensors and data availability scenarios is still lacking, which is an essential step to guide end-users in model selection. To address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a Prithvi variant), against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance among all GFMs, with only 2-5% variation between the best and worst models across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In leave-one-region-out cross-validation across five regions, Clay shows slightly better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07), 0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA (0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and Sentinel-1, respectively. Across all 19 sites, leave-one-region-out cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual inspection highlights Clay's superior ability to retain fine details. Few-shot experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational time, Clay is a better choice due to its smaller model size (26M parameters), making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M). Contrary to previous findings, our results suggest GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net.
☆ TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-\textit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.
comment: Preprint
☆ SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art
The ability to connect visual patterns with the processes that form them represents one of the deepest forms of visual understanding. Textures of clouds and waves, the growth of cities and forests, or the formation of materials and landscapes are all examples of patterns emerging from underlying mechanisms. We present the Scitextures dataset, a large-scale collection of textures and visual patterns from all domains of science, tech, and art, along with the models and code that generate these images. Covering over 1,200 different models and 100,000 images of patterns and textures from physics, chemistry, biology, sociology, technology, mathematics, and art, this dataset offers a way to explore the connection between the visual patterns that shape our world and the mechanisms that produce them. Created by an agentic AI pipeline that autonomously collects and implements models in standardized form, we use SciTextures to evaluate the ability of leading AI models to link visual patterns to the models and code that generate them, and to identify different patterns that emerged from the same process. We also test AIs ability to infer and recreate the mechanisms behind visual patterns by providing a natural image of a real-world pattern and asking the AI to identify, model, and code the mechanism that formed the pattern, then run this code to generate a simulated image that is compared to the real image. These benchmarks show that vision-language models (VLMs) can understand and simulate the physical system beyond a visual pattern. The dataset and code are available at: https://zenodo.org/records/17485502
☆ PROPEX-RAG: Enhanced GraphRAG using Prompt-Driven Prompt Execution
Retrieval-Augmented Generation (RAG) has become a robust framework for enhancing Large Language Models (LLMs) with external knowledge. Recent advances in RAG have investigated graph based retrieval for intricate reasoning; however, the influence of prompt design on enhancing the retrieval and reasoning process is still considerably under-examined. In this paper, we present a prompt-driven GraphRAG framework that underscores the significance of prompt formulation in facilitating entity extraction, fact selection, and passage reranking for multi-hop question answering. Our approach creates a symbolic knowledge graph from text data by encoding entities and factual relationships as structured facts triples. We use LLMs selectively during online retrieval to perform semantic filtering and answer generation. We also use entity-guided graph traversal through Personalized PageRank (PPR) to support efficient, scalable retrieval based on the knowledge graph we built. Our system gets state-of-the-art performance on HotpotQA and 2WikiMultiHopQA, with F1 scores of 80.7% and 78.9%, and Recall@5 scores of 97.1% and 98.1%, respectively. These results show that prompt design is an important part of improving retrieval accuracy and response quality. This research lays the groundwork for more efficient and comprehensible multi-hop question-answering systems, highlighting the importance of prompt-aware graph reasoning.
comment: Accepted in PReMI 2025
☆ Fractional Diffusion Bridge Models NeurIPS 2025
We present Fractional Diffusion Bridge Models (FDBM), a novel generative diffusion bridge framework driven by an approximation of the rich and non-Markovian fractional Brownian motion (fBM). Real stochastic processes exhibit a degree of memory effects (correlations in time), long-range dependencies, roughness and anomalous diffusion phenomena that are not captured in standard diffusion or bridge modeling due to the use of Brownian motion (BM). As a remedy, leveraging a recent Markovian approximation of fBM (MA-fBM), we construct FDBM that enable tractable inference while preserving the non-Markovian nature of fBM. We prove the existence of a coupling-preserving generative diffusion bridge and leverage it for future state prediction from paired training data. We then extend our formulation to the Schr\"{o}dinger bridge problem and derive a principled loss function to learn the unpaired data translation. We evaluate FDBM on both tasks: predicting future protein conformations from aligned data, and unpaired image translation. In both settings, FDBM achieves superior performance compared to the Brownian baselines, yielding lower root mean squared deviation (RMSD) of C$_\alpha$ atomic positions in protein structure prediction and lower Fr\'echet Inception Distance (FID) in unpaired image translation.
comment: To appear in NeurIPS 2025 proceedings. This version includes post-camera-ready revisions
☆ How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment
Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.
☆ UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs
Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION
♻ ☆ WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild NeurIPS 2025
Despite recent advances in sparse novel view synthesis (NVS) applied to object-centric scenes, scene-level NVS remains a challenge. A central issue is the lack of available clean multi-view training data, beyond manually curated datasets with limited diversity, camera variation, or licensing issues. On the other hand, an abundance of diverse and permissively-licensed data exists in the wild, consisting of scenes with varying appearances (illuminations, transient occlusions, etc.) from sources such as tourist photos. To this end, we present WildCAT3D, a framework for generating novel views of scenes learned from diverse 2D scene image data captured in the wild. We unlock training on these data sources by explicitly modeling global appearance conditions in images, extending the state-of-the-art multi-view diffusion paradigm to learn from scene views of varying appearances. Our trained model generalizes to new scenes at inference time, enabling the generation of multiple consistent novel views. WildCAT3D provides state-of-the-art results on single-view NVS in object- and scene-level settings, while training on strictly less data sources than prior methods. Additionally, it enables novel applications by providing global appearance control during generation.
comment: Accepted to NeurIPS 2025. Project page: https://wildcat3d.github.io
♻ ☆ RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features
Super-resolution (SR) for remote sensing imagery often fails under out-of-distribution (OOD) conditions, such as rare geomorphic features captured by diverse sensors, producing visually plausible but physically inaccurate results. We present RareFlow, a physics-aware SR framework designed for OOD robustness. RareFlow's core is a dual-conditioning architecture. A Gated ControlNet preserves fine-grained geometric fidelity from the low-resolution input, while textual prompts provide semantic guidance for synthesizing complex features. To ensure physically sound outputs, we introduce a multifaceted loss function that enforces both spectral and radiometric consistency with sensor properties. Furthermore, the framework quantifies its own predictive uncertainty by employing a stochastic forward pass approach; the resulting output variance directly identifies unfamiliar inputs, mitigating feature hallucination. We validate RareFlow on a new, curated benchmark of multi-sensor satellite imagery. In blind evaluations, geophysical experts rated our model's outputs as approaching the fidelity of ground truth imagery, significantly outperforming state-of-the-art baselines. This qualitative superiority is corroborated by quantitative gains in perceptual metrics, including a nearly 40\% reduction in FID. RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.
♻ ☆ Non-Contact Health Monitoring During Daily Personal Care Routines
Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.
comment: IEEE BSN 2025
♻ ☆ ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers ICCV 2025
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV
comment: Published as a conference paper at ICCV 2025. Project page: https://github.com/icip-cas/ShortV
♻ ☆ Rethinking Visual Intelligence: Insights from Video Pretraining
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
comment: Updated version from preprint arXiv:2506.07280 (Gen2Gen) focused on visual intelligence. This work can be considered as v2
♻ ☆ FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequencies NeurIPS 2025
Event cameras offer unparalleled advantages for real-time perception in dynamic environments, thanks to the microsecond-level temporal resolution and asynchronous operation. Existing event detectors, however, are limited by fixed-frequency paradigms and fail to fully exploit the high-temporal resolution and adaptability of event data. To address these limitations, we propose FlexEvent, a novel framework that enables detection at varying frequencies. Our approach consists of two key components: FlexFuse, an adaptive event-frame fusion module that integrates high-frequency event data with rich semantic information from RGB frames, and FlexTune, a frequency-adaptive fine-tuning mechanism that generates frequency-adjusted labels to enhance model generalization across varying operational frequencies. This combination allows our method to detect objects with high accuracy in both fast-moving and static scenarios, while adapting to dynamic environments. Extensive experiments on large-scale event camera datasets demonstrate that our approach surpasses state-of-the-art methods, achieving significant improvements in both standard and high-frequency settings. Notably, our method maintains robust performance when scaling from 20 Hz to 90 Hz and delivers accurate detection up to 180 Hz, proving its effectiveness in extreme conditions. Our framework sets a new benchmark for event-based object detection and paves the way for more adaptable, real-time vision systems.
comment: NeurIPS 2025; 28 pages, 14 figures, 10 tables; Code at https://flexevent.github.io/
♻ ☆ SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding NeurIPS 2025
Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.
comment: NeurIPS 2025; 24 pages, 10 figures, 9 tables; Code at https://dekai21.github.io/SPIRAL/
♻ ☆ Efficient Remote Sensing Change Detection with Change State Space Models
Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at https://github.com/Elman295/CSSM upon acceptance.
♻ ☆ Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexity NeurIPS 2025
Out-of-distribution (OOD) detection is essential for ensuring the reliability and safety of machine learning systems. In recent years, it has received increasing attention, particularly through post-hoc detection and training-based methods. In this paper, we focus on post-hoc OOD detection, which enables identifying OOD samples without altering the model's training procedure or objective. Our primary goal is to investigate the relationship between model capacity and its OOD detection performance. Specifically, we aim to answer the following question: Does the Double Descent phenomenon manifest in post-hoc OOD detection? This question is crucial, as it can reveal whether overparameterization, which is already known to benefit generalization, can also enhance OOD detection. Despite the growing interest in these topics by the classic supervised machine learning community, this intersection remains unexplored for OOD detection. We empirically demonstrate that the Double Descent effect does indeed appear in post-hoc OOD detection. Furthermore, we provide theoretical insights to explain why this phenomenon emerges in such setting. Finally, we show that the overparameterized regime does not yield superior results consistently, and we propose a method to identify the optimal regime for OOD detection based on our observations.
comment: Accepted at NeurIPS 2025 (Conference on Neural Information Processing Systems)
♻ ☆ SonarSplat: Novel View Synthesis of Imaging Sonar via Gaussian Splatting
In this paper, we present SonarSplat, a novel Gaussian splatting framework for imaging sonar that demonstrates realistic novel view synthesis and models acoustic streaking phenomena. Our method represents the scene as a set of 3D Gaussians with acoustic reflectance and saturation properties. We develop a novel method to efficiently rasterize Gaussians to produce a range/azimuth image that is faithful to the acoustic image formation model of imaging sonar. In particular, we develop a novel approach to model azimuth streaking in a Gaussian splatting framework. We evaluate SonarSplat using real-world datasets of sonar images collected from an underwater robotic platform in a controlled test tank and in a real-world river environment. Compared to the state-of-the-art, SonarSplat offers improved image synthesis capabilities (+3.2 dB PSNR) and more accurate 3D reconstruction (77% lower Chamfer Distance). We also demonstrate that SonarSplat can be leveraged for azimuth streak removal.
♻ ☆ A Comprehensive Evaluation of YOLO-based Deer Detection Performance on Edge Devices
The escalating economic losses in agriculture due to deer intrusion, estimated to be in the hundreds of millions of dollars annually in the U.S., highlight the inadequacy of traditional mitigation strategies such as hunting, fencing, use of repellents, and scare tactics. This underscores a critical need for intelligent, autonomous solutions capable of real-time deer detection and deterrence. But the progress in this field is impeded by a significant gap in the literature, mainly the lack of a domain-specific, practical dataset and limited study on the viability of deer detection systems on edge devices. To address this gap, this study presents a comprehensive evaluation of state-of-the-art deep learning models for deer detection in challenging real-world scenarios. We introduce a curated, publicly available dataset of 3,095 annotated images with bounding box annotations of deer. Then, we provide an extensive comparative analysis of 12 model variants across four recent YOLO architectures (v8 to v11). Finally, we evaluated their performance on two representative edge computing platforms: the CPU-based Raspberry Pi 5 and the GPU-accelerated NVIDIA Jetson AGX Xavier to assess feasibility for real-world field deployment. Results show that the real-time detection performance is not feasible on Raspberry Pi without hardware-specific model optimization, while NVIDIA Jetson provides greater than 30 frames per second (FPS) with 's' and 'n' series models. This study also reveals that smaller, architecturally advanced models such as YOLOv11n, YOLOv8s, and YOLOv9s offer the optimal balance of high accuracy (Average Precision (AP) > 0.85) and computational efficiency (Inference Time < 34 milliseconds).
comment: 13 pages, 7 figures
♻ ☆ TinyDef-DETR: A Transformer-Based Framework for Defect Detection in Transmission Lines from UAV Imagery
Automated defect detection from UAV imagery of transmission lines is a challenging task due to the small size, ambiguity, and complex backgrounds of defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to achieve accurate and efficient detection of transmission line defects from UAV-acquired images. The model integrates four major components: an edge-enhanced ResNet backbone to strengthen boundary-sensitive representations, a stride-free space-to-depth module to enable detail-preserving downsampling, a cross-stage dual-domain multi-scale attention mechanism to jointly model global context and local cues, and a Focaler-Wise-SIoU regression loss to improve the localization of small and difficult objects. Together, these designs effectively mitigate the limitations of conventional detectors. Extensive experiments on both public and real-world datasets demonstrate that TinyDef-DETR achieves superior detection performance and strong generalization capability, while maintaining modest computational overhead. The accuracy and efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission line defect detection, particularly in scenarios involving small and ambiguous objects.
♻ ☆ New multimodal similarity measure for image registration via modeling local functional dependence with linear combination of learned basis functions
The deformable registration of images of different modalities, essential in many medical imaging applications, remains challenging. The main challenge is developing a robust measure for image overlap despite the compared images capturing different aspects of the underlying tissue. Here, we explore similarity metrics based on functional dependence between intensity values of registered images. Although functional dependence is too restrictive on the global scale, earlier work has shown competitive performance in deformable registration when such measures are applied over small enough contexts. We confirm this finding and further develop the idea by modeling local functional dependence via the linear basis function model with the basis functions learned jointly with the deformation. The measure can be implemented via convolutions, making it efficient to compute on GPUs. We release the method as an easy-to-use tool and show good performance on three datasets compared to well-established baseline and earlier functional dependence-based methods.
comment: Improved experimental setup
♻ ☆ GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.
♻ ☆ Phys4DGen: Physics-Compliant 4D Generation with Multi-Material Composition Perception ACM MM 2025
4D content generation aims to create dynamically evolving 3D content that responds to specific input objects such as images or 3D representations. Current approaches typically incorporate physical priors to animate 3D representations, but these methods suffer from significant limitations: they not only require users lacking physics expertise to manually specify material properties but also struggle to effectively handle the generation of multi-material composite objects. To address these challenges, we propose Phys4DGen, a novel 4D generation framework that integrates multi-material composition perception with physical simulation. The framework achieves automated, physically plausible 4D generation through three innovative modules: first, the 3D Material Grouping module partitions heterogeneous material regions on 3D representations' surfaces via semantic segmentation; second, the Internal Physical Structure Discovery module constructs the mechanical structure of object interiors; finally, we distill physical prior knowledge from multimodal large language models to enable rapid and automatic material properties identification for both objects' surfaces and interiors. Experiments on both synthetic and real-world datasets demonstrate that Phys4DGen can generate high-fidelity 4D content with physical realism in open-world scenarios, significantly outperforming state-of-the-art methods.
comment: Accepted by ACM MM 2025. Project Page: https://jiajinglin.github.io/Phys4DGen
♻ ☆ Detailed Aerial Mapping of Photovoltaic Power Plants Through Semantically Significant Keypoints
An accurate and up-to-date model of a photovoltaic (PV) power plant is essential for its optimal operation and maintenance. However, such a model may not be easily available. This work introduces a novel approach for PV power plant mapping based on aerial overview images. It enables the automation of the mapping process while removing the reliance on third-party data. The presented mapping method takes advantage of the structural layout of the power plants to achieve detailed modeling down to the level of individual PV modules. The approach relies on visual segmentation of PV modules in overview images and the inference of structural information in each image, assigning modules to individual benches, rows, and columns. We identify visual keypoints related to the layout and use these to merge detections from multiple images while maintaining their structural integrity. The presented method was experimentally verified and evaluated on two different power plants. The final fusion of 3D positions and semantic structures results in a compact georeferenced model suitable for power plant maintenance.
comment: 11 pages, 18 figures. Accepted version
♻ ☆ Preliminary study on artificial intelligence methods for cybersecurity threat detection in computer networks based on raw data packets
Most of the intrusion detection methods in computer networks are based on traffic flow characteristics. However, this approach may not fully exploit the potential of deep learning algorithms to directly extract features and patterns from raw packets. Moreover, it impedes real-time monitoring due to the necessity of waiting for the processing pipeline to complete and introduces dependencies on additional software components. In this paper, we investigate deep learning methodologies capable of detecting attacks in real-time directly from raw packet data within network traffic. We propose a novel approach where packets are stacked into windows and separately recognised, with a 2D image representation suitable for processing with computer vision models. Our investigation utilizes the CIC IDS-2017 dataset, which includes both benign traffic and prevalent real-world attacks, providing a comprehensive foundation for our research.
comment: Submitted to Computer Science Journal. Version with updated acknowledgments
♻ ☆ Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras
Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.
♻ ☆ ReviveDiff: A Universal Diffusion Model for Restoring Images in Adverse Weather Conditions
Images captured in challenging environments--such as nighttime, smoke, rainy weather, and underwater--often suffer from significant degradation, resulting in a substantial loss of visual quality. The effective restoration of these degraded images is critical for the subsequent vision tasks. While many existing approaches have successfully incorporated specific priors for individual tasks, these tailored solutions limit their applicability to other degradations. In this work, we propose a universal network architecture, dubbed ``ReviveDiff'', which can address various degradations and bring images back to life by enhancing and restoring their quality. Our approach is inspired by the observation that, unlike degradation caused by movement or electronic issues, quality degradation under adverse conditions primarily stems from natural media (such as fog, water, and low luminance), which generally preserves the original structures of objects. To restore the quality of such images, we leveraged the latest advancements in diffusion models and developed ReviveDiff to restore image quality from both macro and micro levels across some key factors determining image quality, such as sharpness, distortion, noise level, dynamic range, and color accuracy. We rigorously evaluated ReviveDiff on seven benchmark datasets covering five types of degrading conditions: Rainy, Underwater, Low-light, Smoke, and Nighttime Hazy. Our experimental results demonstrate that ReviveDiff outperforms the state-of-the-art methods both quantitatively and visually.
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ 3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data
Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 78.98% accuracy, 76.54% sensitivity, 81.58% specificity, 81.58% precision, and 78.98% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.
comment: 14 pages, 1 figure, 7 tables
♻ ☆ Geospatial Foundation Models to Enable Progress on Sustainable Development Goals
Foundation Models (FMs) are large-scale, pre-trained artificial intelligence (AI) systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.
♻ ☆ Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context
We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action's temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.
♻ ☆ VidText: Towards Comprehensive Evaluation for Video Text Understanding
Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.
♻ ☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
♻ ☆ Finite element-based space-time total variation-type regularization of the inverse problem in electrocardiographic imaging
Reconstructing cardiac electrical activity from body surface electric potential measurements results in the severely ill-posed inverse problem in electrocardiography. Many different regularization approaches have been proposed to improve numerical results and provide unique results. This work presents a novel approach for reconstructing the epicardial potential from body surface potential maps based on a space-time total variation-type regularization using finite elements, where a first-order primal-dual algorithm solves the underlying convex optimization problem. In several numerical experiments, the superior performance of this method and the benefit of space-time regularization for the reconstruction of epicardial potential on two-dimensional torso data and a three-dimensional rabbit heart compared to state-of-the-art methods are demonstrated.
♻ ☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
♻ ☆ Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach
Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. This study compares the U-Net with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error that is 7% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.37% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.
♻ ☆ Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review
Recent advancements in machine learning (ML) and deep learning (DL), particularly through the introduction of Foundation Models (FMs), have significantly enhanced surgical scene understanding within minimally invasive surgery (MIS). This paper surveys the integration of state-of-the-art ML and DL technologies, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Foundation Models like the Segment Anything Model (SAM), into surgical workflows. These technologies improve segmentation accuracy, instrument tracking, and phase recognition in surgical scene understanding. The paper explores the challenges these technologies face, such as data variability and computational demands, and discusses ethical considerations and integration hurdles in clinical settings. Highlighting the roles of FMs, we bridge the technological capabilities with clinical needs and outline future research directions to enhance the adaptability, efficiency, and ethical alignment of AI applications in surgery. Our findings suggest that substantial progress has been made; however, more focused efforts are required to achieve seamless integration of these technologies into clinical workflows, ensuring they complement surgical practice by enhancing precision, reducing risks, and optimizing patient outcomes.
♻ ☆ Adjustable Spatio-Spectral Hyperspectral Image Compression Network
With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder module; 2) spatial encoder module; 3) compression ratio (CR) adapter encoder module; 4) CR adapter decoder module; 5) spatial decoder module; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on three HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models, surpassing the state of the art by up to 2.36 dB in terms of PSNR. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hycass .
♻ ☆ Class Agnostic Instance-level Descriptor for Visual Instance Search
Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance-level feature representation. Supervised or weakly supervised object detection methods are not the appropriate solutions due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance-level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of instance regions. On the one hand, this kind of hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in real scenarios. On the other hand, the non-leaf nodes and leaf nodes on the hierarchy correspond to the instance regions in different granularities within an image. Therefore, features in uniform length are produced for these instance regions, which may cover across a dominant image region, an integral of multiple instances, or various individual instances. Such a collection of features allows us to unify the image retrieval, multi-instance search, and instance search into one framework. The empirical studies on three benchmarks show that such an instance-level descriptor remains effective on both the known and unknown object categories. Moreover, the superior performance is achieved on single-instance and multi-instance search, as well as image retrieval tasks.
♻ ☆ SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning NeurIPS 2025
Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. Our code is available at https://github.com/MichaelMaiii/SynBrain.
comment: Accepted by NeurIPS 2025
♻ ☆ 50 Years of Water Body Monitoring: The Case of Qaraaoun Reservoir, Lebanon
The sustainable management of the Qaraaoun Reservoir, the largest surface water body in Lebanon located in the Bekaa Plain, depends on reliable monitoring of its storage volume despite frequent sensor malfunctions and limited maintenance capacity. This study introduces a sensor-free approach that integrates open-source satellite imagery, advanced water-extent segmentation, and machine learning to estimate the reservoir's surface area and, subsequently, its volume in near real time. Sentinel-2 and Landsat 1-9 images are processed, where surface water is delineated using a newly proposed water segmentation index. A machine learning model based on Support Vector Regression (SVR) is trained on a curated dataset that includes water surface area, water level, and water volume derived from a reservoir bathymetric survey. The model is then able to estimate the water body's volume solely from the extracted water surface, without the need for any ground-based measurements. Water segmentation using the proposed index aligns with ground truth for over 95% of the shoreline. Hyperparameter tuning with GridSearchCV yields an optimized SVR performance, with an error below 1.5% of the full reservoir capacity and coefficients of determination exceeding 0.98. These results demonstrate the method's robustness and cost-effectiveness, offering a practical solution for continuous, sensor-independent monitoring of reservoir storage. The proposed methodology is applicable to other water bodies and generates over five decades of time-series data, offering valuable insights into climate change and environmental dynamics, with an emphasis on capturing temporal trends rather than exact water volume measurements.
♻ ☆ Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding NeurIPS 2025
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%, which substantially surpasses all prior works, and further improves to 76.0% with transcripts. The code has been released at https://github.com/microsoft/DeepVideoDiscovery.
comment: Accepted to NeurIPS 2025
♻ ☆ FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error CVPR 2025
The rapid advancement of diffusion models has significantly improved high-quality image generation, making generated content increasingly challenging to distinguish from real images and raising concerns about potential misuse. In this paper, we observe that diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting the limitation could serve as a cue for detecting diffusion model generated images. Motivated by this observation, we propose a novel method called Frequency-guided Reconstruction Error (FIRE), which, to the best of our knowledge, is the first to investigate the influence of frequency decomposition on reconstruction error. FIRE assesses the variation in reconstruction error before and after the frequency decomposition, offering a robust method for identifying diffusion model generated images. Extensive experiments show that FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.
comment: 14 pages, 14 figures. Accepted to CVPR 2025
♻ ☆ TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.
comment: 27 pages, 15 figures. Dataset Website: https://textatlas5m.github.io
♻ ☆ Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.
♻ ☆ Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction
Electron back-scatter diffraction (EBSD) has traditionally relied upon methods such as the Hough transform and dictionary Indexing to interpret diffraction patterns and extract crystallographic orientation. However, these methods encounter significant limitations, particularly when operating at high scanning speeds, where the exposure time per pattern is decreased beyond the operating sensitivity of CCD camera. Hence the signal to noise ratio decreases for the observed pattern which makes the pattern noisy, leading to reduced indexing accuracy. This research work aims to develop generative machine learning models for the post-processing or on-the-fly processing of Kikuchi patterns which are capable of restoring noisy EBSD patterns obtained at high scan speeds. These restored patterns can be used for the determination of crystal orientations to provide reliable indexing results. We compare the performance of such generative models in enhancing the quality of patterns captured at short exposure times (high scan speeds). An interesting observation is that the methodology is not data-hungry as typical machine learning methods.
♻ ☆ Epistemic Uncertainty for Generated Image Detection NeurIPS 2025
We introduce a novel framework for AI-generated image detection through epistemic uncertainty, aiming to address critical security concerns in the era of generative models. Our key insight stems from the observation that distributional discrepancies between training and testing data manifest distinctively in the epistemic uncertainty space of machine learning models. In this context, the distribution shift between natural and generated images leads to elevated epistemic uncertainty in models trained on natural images when evaluating generated ones. Hence, we exploit this phenomenon by using epistemic uncertainty as a proxy for detecting generated images. This converts the challenge of generated image detection into the problem of uncertainty estimation, underscoring the generalization performance of the model used for uncertainty estimation. Fortunately, advanced large-scale vision models pre-trained on extensive natural images have shown excellent generalization performance for various scenarios. Thus, we utilize these pre-trained models to estimate the epistemic uncertainty of images and flag those with high uncertainty as generated. Extensive experiments demonstrate the efficacy of our method. Code is available at https://github.com/tmlr-group/WePe.
comment: 28 pages, 10 figures, NeurIPS 2025
♻ ☆ Aligning Effective Tokens with Video Anomaly in Large Language Models
Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.
♻ ☆ Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
comment: 5 pages, 3 figures
♻ ☆ ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.
♻ ☆ A Quantitative Evaluation Framework for Explainable AI in Semantic Segmentation
Ensuring transparency and trust in artificial intelligence (AI) models is essential as they are increasingly deployed in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge; however, the rigorous evaluation of XAI methods remains vital for balancing the trade-offs between model complexity, predictive performance, and interpretability. While substantial progress has been made in evaluating XAI for classification tasks, strategies tailored to semantic segmentation remain limited. Moreover, objectively assessing XAI approaches is difficult, since qualitative visual explanations provide only preliminary insights. Such qualitative methods are inherently subjective and cannot ensure the accuracy or stability of explanations. To address these limitations, this work introduces a comprehensive quantitative evaluation framework for assessing XAI in semantic segmentation, accounting for both spatial and contextual task complexities. The framework systematically integrates pixel-level evaluation strategies with carefully designed metrics to yield fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings advance the development of transparent, trustworthy, and accountable semantic segmentation models.
♻ ☆ Coarse Attribute Prediction with Task Agnostic Distillation for Real World Clothes Changing ReID BMVC
This work focuses on Clothes Changing Re-IDentification (CC-ReID) for the real world. Existing works perform well with high-quality (HQ) images, but struggle with low-quality (LQ) where we can have artifacts like pixelation, out-of-focus blur, and motion blur. These artifacts introduce noise to not only external biometric attributes (e.g. pose, body shape, etc.) but also corrupt the model's internal feature representation. Models usually cluster LQ image features together, making it difficult to distinguish between them, leading to incorrect matches. We propose a novel framework Robustness against Low-Quality (RLQ) to improve CC-ReID model on real-world data. RLQ relies on Coarse Attributes Prediction (CAP) and Task Agnostic Distillation (TAD) operating in alternate steps in a novel training mechanism. CAP enriches the model with external fine-grained attributes via coarse predictions, thereby reducing the effect of noisy inputs. On the other hand, TAD enhances the model's internal feature representation by bridging the gap between HQ and LQ features, via an external dataset through task-agnostic self-supervision and distillation. RLQ outperforms the existing approaches by 1.6%-2.9% Top-1 on real-world datasets like LaST, and DeepChange, while showing consistent improvement of 5.3%-6% Top-1 on PRCC with competitive performance on LTCC. *The code will be made public soon.*
comment: The 36th British Machine Vision Conference (BMVC)
♻ ☆ Multi-Focused Video Group Activities Hashing
With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.
♻ ☆ MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.
comment: Published in Transactions on Machine Learning Research (09/2025)
♻ ☆ Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
♻ ☆ EgoBlind: Towards Egocentric Visual Assistance for the Blind NeurIPS'25
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.
comment: NeurIPS'25 (D&B Track)
♻ ☆ Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.
comment: Accepted for publication in IEEE Transactions on Medical Imaging (IEEE TMI)
♻ ☆ From Drone Imagery to Livability Mapping: AI-powered Environment Perception in Rural China
The high cost of acquiring rural street view images has constrained comprehensive environmental perception in rural areas. Drone photographs, with their advantages of easy acquisition, broad coverage, and high spatial resolution, offer a viable approach for large-scale rural environmental perception. However, a systematic methodology for identifying key environmental elements from drone photographs and quantifying their impact on environmental perception remains lacking. To address this gap, a Vision-Language Contrastive Ranking Framework (VLCR) is designed for rural livability assessment in China. The framework employs chain-of-thought prompting strategies to guide multimodal large language models (MLLMs) in identifying visual features related to quality of life and ecological habitability from drone photographs. Subsequently, to address the instability in pairwise village comparison, a text description-constrained drone photograph comparison strategy is proposed. Finally, to overcome the efficiency bottleneck in nationwide pairwise village comparisons, an innovation ranking algorithm based on binary search interpolation is developed, which reduces the number of comparisons through automated selection of comparison targets. The proposed framework achieves superior performance with a Spearman Footrule distance of 0.74, outperforming mainstream commercial MLLMs by approximately 0.1. Moreover, the mechanism of concurrent comparison and ranking demonstrates a threefold enhancement in computational efficiency. Our framework has achieved data innovation and methodological breakthroughs in village livability assessment, providing strong support for large-scale village livability analysis. Keywords: Drone photographs, Environmental perception, Rural livability assessment, Multimodal large language models, Chain-of-thought prompting.
♻ ☆ Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
comment: v2 note: Corrected numerical values in Table 2 and Figure 4 due to a minor calculation error in v1. The overall conclusions remain unchanged. Code and models available at: https://github.com/tianciB/VFM-VAE
♻ ☆ V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception NeurIPS 2025
Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby enhancing the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged; however, these datasets primarily focus on cameras and LiDAR, neglecting 4D Radar, a sensor used in single-vehicle autonomous driving to provide robust perception in adverse weather conditions. In this paper, to bridge the gap created by the absence of 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large-scale, real-world multi-modal dataset featuring 4D Radar. V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data encompasses sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as various typical challenging scenarios. The dataset consists of 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, including 350K annotated boxes across five categories. To support various research domains, we have established V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. Furthermore, we provide comprehensive benchmarks across these three sub-datasets. We will release all datasets and benchmark codebase at https://huggingface.co/datasets/yanglei18/V2X-Radar and https://github.com/yanglei18/V2X-Radar.
comment: NeurIPS 2025 Spotlight
♻ ☆ SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding
Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.
♻ ☆ Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound
Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
comment: Accepted by Medical Image Analysis. 24 pages, 13 figures, 20 tabels
♻ ☆ Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism NeurIPS 2025
Multi-agent collaboration enhances the perception capabilities of individual agents through information sharing. However, in real-world applications, differences in sensors and models across heterogeneous agents inevitably lead to domain gaps during collaboration. Existing approaches based on adaptation and reconstruction fail to support pragmatic heterogeneous collaboration due to two key limitations: (1) Intrusive retraining of the encoder or core modules disrupts the established semantic consistency among agents; and (2) accommodating new agents incurs high computational costs, limiting scalability. To address these challenges, we present a novel Generative Communication mechanism (GenComm) that facilitates seamless perception across heterogeneous multi-agent systems through feature generation, without altering the original network, and employs lightweight numerical alignment of spatial information to efficiently integrate new agents at minimal cost. Specifically, a tailored Deformable Message Extractor is designed to extract spatial message for each collaborator, which is then transmitted in place of intermediate features. The Spatial-Aware Feature Generator, utilizing a conditional diffusion model, generates features aligned with the ego agent's semantic space while preserving the spatial information of the collaborators. These generated features are further refined by a Channel Enhancer before fusion. Experiments conducted on the OPV2V-H, DAIR-V2X and V2X-Real datasets demonstrate that GenComm outperforms existing state-of-the-art methods, achieving an 81% reduction in both computational cost and parameter count when incorporating new agents. Our code is available at https://github.com/jeffreychou777/GenComm.
comment: 26 pages, 10 figures, accepted to NeurIPS 2025
♻ ☆ DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches
Stereo depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous works have shown that repeating optimized textures can effectively mislead stereo depth estimation in digital settings. However, our research reveals that these naively repeated textures perform poorly in physical implementations, i.e., when deployed as patches, limiting their practical utility for stress-testing stereo depth estimation systems. In this work, for the first time, we discover that introducing regular intervals among the repeated textures, creating a grid structure, significantly enhances the patch's attack performance. Through extensive experimentation, we analyze how variations of this novel structure influence the adversarial effectiveness. Based on these insights, we develop a novel stereo depth attack that jointly optimizes both the interval structure and texture elements. Our generated adversarial patches can be inserted into any scenes and successfully attack advanced stereo depth estimation methods of different paradigms, i.e., RAFT-Stereo and STTR. Most critically, our patch can also attack commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating their practical relevance for security assessment of stereo systems. The code is officially released at: https://github.com/WiWiN42/DepthVanish
♻ ☆ DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI
Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.
comment: 14 pages
♻ ☆ MotionGPT3: Human Motion as a Second Modality
With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.
comment: 26 pages, 11 figures
♻ ☆ Risk-adaptive Activation Steering for Safe Multimodal Large Language Models
One of the key challenges of modern AI models is ensuring that they provide helpful responses to benign queries while refusing malicious ones. But often, the models are vulnerable to multimodal queries with harmful intent embedded in images. One approach for safety alignment is training with extensive safety datasets at the significant costs in both dataset curation and training. Inference-time alignment mitigates these costs, but introduces two drawbacks: excessive refusals from misclassified benign queries and slower inference speed due to iterative output adjustments. To overcome these limitations, we propose to reformulate queries to strengthen cross-modal attention to safety-critical image regions, enabling accurate risk assessment at the query level. Using the assessed risk, it adaptively steers activations to generate responses that are safe and helpful without overhead from iterative output adjustments. We call this Risk-adaptive Activation Steering (RAS). Extensive experiments across multiple benchmarks on multimodal safety and utility demonstrate that the RAS significantly reduces attack success rates, preserves general task performance, and improves inference speed over prior inference-time defenses.
♻ ☆ Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models NeurIPS 2025
Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
comment: Accepted at NeurIPS 2025. Project page: https://cvlab-kaist.github.io/HeadHunter/
♻ ☆ AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning
Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality's learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality's under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.
comment: 13pages,7 figures
♻ ☆ Understanding Ice Crystal Habit Diversity with Self-Supervised Learning NeurIPS 2025
Ice-containing clouds strongly impact climate, but they are hard to model due to ice crystal habit (i.e., shape) diversity. We use self-supervised learning (SSL) to learn latent representations of crystals from ice crystal imagery. By pre-training a vision transformer with many cloud particle images, we learn robust representations of crystal morphology, which can be used for various science-driven tasks. Our key contributions include (1) validating that our SSL approach can be used to learn meaningful representations, and (2) presenting a relevant application where we quantify ice crystal diversity with these latent representations. Our results demonstrate the power of SSL-driven representations to improve the characterization of ice crystals and subsequently constrain their role in Earth's climate system.
comment: Accepted to NeurIPS 2025 Workshop: Tackling Climate Change with Machine Learning
♻ ☆ FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility. Please check our website for more details: https://czzzzh.github.io/FreeArt3D
comment: Project Page: https://czzzzh.github.io/FreeArt3D Code: https://github.com/CzzzzH/FreeArt3D
♻ ☆ MediQ-GAN: Quantum-Inspired GAN for High Resolution Medical Image Generation
Machine learning-assisted diagnosis shows promise, yet medical imaging datasets are often scarce, imbalanced, and constrained by privacy, making data augmentation essential. Classical generative models typically demand extensive computational and sample resources. Quantum computing offers a promising alternative, but existing quantum-based image generation methods remain limited in scale and often face barren plateaus. We present MediQ-GAN, a quantum-inspired GAN with prototype-guided skip connections and a dual-stream generator that fuses classical and quantum-inspired branches. Its variational quantum circuits inherently preserve full-rank mappings, avoid rank collapse, and are theory-guided to balance expressivity with trainability. Beyond generation quality, we provide the first latent-geometry and rank-based analysis of quantum-inspired GANs, offering theoretical insight into their performance. Across three medical imaging datasets, MediQ-GAN outperforms state-of-the-art GANs and diffusion models. While validated on IBM hardware for robustness, our contribution is hardware-agnostic, offering a scalable and data-efficient framework for medical image generation and augmentation.
♻ ☆ Real World Federated Learning with a Knowledge Distilled Transformer for Cardiac CT Imaging
Federated learning is a renowned technique for utilizing decentralized data while preserving privacy. However, real-world applications often face challenges like partially labeled datasets, where only a few locations have certain expert annotations, leaving large portions of unlabeled data unused. Leveraging these could enhance transformer architectures ability in regimes with small and diversely annotated sets. We conduct the largest federated cardiac CT analysis to date (n=8,104) in a real-world setting across eight hospitals. Our two-step semi-supervised strategy distills knowledge from task-specific CNNs into a transformer. First, CNNs predict on unlabeled data per label type and then the transformer learns from these predictions with label-specific heads. This improves predictive accuracy and enables simultaneous learning of all partial labels across the federation, and outperforms UNet-based models in generalizability on downstream tasks. Code and model weights are made openly available for leveraging future cardiac CT analysis.
♻ ☆ GeoSDF: Plane Geometry Diagram Synthesis via Signed Distance Field
Plane Geometry Diagram Synthesis has been a crucial task in computer graphics, with applications ranging from educational tools to AI-driven mathematical reasoning. Traditionally, we rely on manual tools (e.g., Matplotlib and GeoGebra) to generate precise diagrams, but this usually requires huge, complicated calculations. Recently, researchers start to work on model-based methods (e.g., Stable Diffusion and GPT5) to automatically generate diagrams, saving operational cost but usually suffering from limited realism and insufficient accuracy. In this paper, we propose a novel framework GeoSDF, to automatically generate diagrams efficiently and accurately with Signed Distance Field (SDF). Specifically, we first represent geometric elements (e.g., points, segments, and circles) in the SDF, then construct a series of constraint functions to represent geometric relationships. Next, we optimize those constructed constraint functions to get an optimized field of both elements and constraints. Finally, by rendering the optimized field, we can obtain the synthesized diagram. In our GeoSDF, we define a symbolic language to represent geometric elements and constraints, and our synthesized geometry diagrams can be self-verified in the SDF, ensuring both mathematical accuracy and visual plausibility. In experiments, through both qualitative and quantitative analysis, GeoSDF synthesized both normal high-school level and IMO-level geometry diagrams. We achieve 88.67\% synthesis accuracy by human evaluation in the IMO problem set. Furthermore, we obtain a very high accuracy of solving geometry problems (over 95\% while the current SOTA accuracy is around 75%) by leveraging our self-verification property. All of these demonstrate the advantage of GeoSDF, paving the way for more sophisticated, accurate, and flexible generation of geometric diagrams for a wide array of applications.
♻ ☆ What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?
Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We validate our findings by collecting the best settings and creating a large-scale dataset. Training only on this dataset achieves better performance than training on a mixture of widely used datasets, and is competitive with training on the FoundationStereo dataset, with the additional benefit of open-source generation code and an accompanying parameter analysis to enable further research. We open-source our system at https://github.com/princeton-vl/InfinigenStereo to enable further research on procedural stereo datasets.
♻ ☆ A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
When we train models on biased datasets, they not only reproduce data biases, but can worsen them at test time - a phenomenon called bias amplification. Many of the current bias amplification metrics (e.g., BA (MALS), DPA) measure bias amplification only in classification datasets. These metrics are ineffective for image captioning datasets, as they cannot capture the language semantics of a caption. Recent work introduced Leakage in Captioning (LIC), a language-aware bias amplification metric that understands caption semantics. However, LIC has a crucial limitation: it cannot identify the source of bias amplification in captioning models. We propose Directional Bias Amplification in Captioning (DBAC), a language-aware and directional metric that can identify when captioning models amplify biases. DBAC has two more improvements over LIC: (1) it is less sensitive to sentence encoders (a hyperparameter in language-aware metrics), and (2) it provides a more accurate estimate of bias amplification in captions. Our experiments on gender and race attributes in the COCO captions dataset show that DBAC is the only reliable metric to measure bias amplification in captions.
♻ ☆ Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.
comment: Code: https://github.com/Bytedance/Sa2VA
Machine Learning 91
♻ ☆ GTAlign: Game-Theoretic Alignment of LLM Assistants for Social Welfare
Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a social welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and social welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .
comment: 31 pages, 6 figures
♻ ☆ TabArena: A Living Benchmark for Machine Learning on Tabular Data NeurIPS 2025
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
comment: Accepted (spotlight) at NeurIPS 2025 Datasets and Benchmarks Track. v4: fixed links in comments. v3: NeurIPS camera-ready version. v2: fixed author list. 51 pages. Code available at https://tabarena.ai/code and examples at https://tabarena.ai/code-examples and dataset curation at https://tabarena.ai/data-tabular-ml-iid-study and https://tabarena.ai/dataset-curation
♻ ☆ Cold-Start Active Preference Learning in Socio-Economic Domains
Active preference learning offers an efficient approach to modeling preferences, but it is hindered by the cold-start problem, which leads to a marked decline in performance when no initial labeled data are available. While cold-start solutions have been proposed for domains such as vision and text, the cold-start problem in active preference learning remains largely unexplored, underscoring the need for practical, effective methods. Drawing inspiration from established practices in social and economic research, the proposed method initiates learning with a self-supervised phase that employs Principal Component Analysis (PCA) to generate initial pseudo-labels. This process produces a \say{warmed-up} model based solely on the data's intrinsic structure, without requiring expert input. The model is then refined through an active learning loop that strategically queries a simulated noisy oracle for labels. Experiments conducted on various socio-economic datasets, including those related to financial credibility, career success rate, and socio-economic status, consistently show that the PCA-driven approach outperforms standard active learning strategies that start without prior information. This work thus provides a computationally efficient and straightforward solution that effectively addresses the cold-start problem.
♻ ☆ RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs
Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.
comment: 6 pages
♻ ☆ SimKey: A Semantically Aware Key Module for Watermarking Language Models
The rapid spread of text generated by large language models (LLMs) makes it increasingly difficult to distinguish authentic human writing from machine output. Watermarking offers a promising solution: model owners can embed an imperceptible signal into generated text, marking its origin. Most leading approaches seed an LLM's next-token sampling with a pseudo-random key that can later be recovered to identify the text as machine-generated, while only minimally altering the model's output distribution. However, these methods suffer from two related issues: (i) watermarks are brittle to simple surface-level edits such as paraphrasing or reordering; and (ii) adversaries can append unrelated, potentially harmful text that inherits the watermark, risking reputational damage to model owners. To address these issues, we introduce SimKey, a semantic key module that strengthens watermark robustness by tying key generation to the meaning of prior context. SimKey uses locality-sensitive hashing over semantic embeddings to ensure that paraphrased text yields the same watermark key, while unrelated or semantically shifted text produces a different one. Integrated with state-of-the-art watermarking schemes, SimKey improves watermark robustness to paraphrasing and translation while preventing harmful content from false attribution, establishing semantic-aware keying as a practical and extensible watermarking direction.
♻ ☆ Automotive Crash Dynamics Modeling Accelerated with Machine Learning
Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.
♻ ☆ CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning NeurIPS 2025
Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks -- to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide several baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches -- from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training time. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmology to fully exploit the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this dataset, available at https://cosmobench.streamlit.app
comment: Accepted at NeurIPS 2025
♻ ☆ PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure
We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov--Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov-Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch--trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers' equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48\% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion--reaction benchmarks.
♻ ☆ MarsLGPR: Mars Rover Localization with Ground Penetrating Radar
In this work, we propose the use of Ground Penetrating Radar (GPR) for rover localization on Mars. Precise pose estimation is an important task for mobile robots exploring planetary surfaces, as they operate in GPS-denied environments. Although visual odometry provides accurate localization, it is computationally expensive and can fail in dim or high-contrast lighting. Wheel encoders can also provide odometry estimation, but are prone to slipping on the sandy terrain encountered on Mars. Although traditionally a scientific surveying sensor, GPR has been used on Earth for terrain classification and localization through subsurface feature matching. The Perseverance rover and the upcoming ExoMars rover have GPR sensors already equipped to aid in the search of water and mineral resources. We propose to leverage GPR to aid in Mars rover localization. Specifically, we develop a novel GPR-based deep learning model that predicts 1D relative pose translation. We fuse our GPR pose prediction method with inertial and wheel encoder data in a filtering framework to output rover localization. We perform experiments in a Mars analog environment and demonstrate that our GPR-based displacement predictions both outperform wheel encoders and improve multi-modal filtering estimates in high-slip environments. Lastly, we present the first dataset aimed at GPR-based localization in Mars analog environments, which will be made publicly available at https://umfieldrobotics.github.io/marslgpr.
♻ ☆ AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement
We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance. An open-source implementation is provided at https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.
comment: Accepted by IEEE TASLP 2025. Demopage: https://amphionspace.github.io/anyenhance. Open-source implementation: https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc
♻ ☆ Benchmarking LLMs in Web API Integration Tasks
API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.
comment: To be published in Proceedings of 2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware), Data & Benchmark Track; switched to IEEE conference template
♻ ☆ Electrical Load Forecasting over Multihop Smart Metering Networks with Federated Learning
Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) record household energy data. Traditional machine learning (ML) methods are often employed for load forecasting, but require data sharing, which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method for high-quality load forecasting in metering networks. A meta-learning-based strategy is developed to address data heterogeneity at local SMs in the collaborative training of local load forecasting models. Moreover, to minimize the load forecasting delays in our PFL model, we study a new latency optimization problem based on optimal resource allocation at SMs. A theoretical convergence analysis is also conducted to provide insights into FL design for federated load forecasting. Extensive simulations from real-world datasets show that our method outperforms existing approaches regarding better load forecasting and reduced operational latency costs.
comment: Accepted at IEEE Internet of Things Journal, DOI: 10.1109/JIOT.2025.3586115
♻ ☆ Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexity NeurIPS 2025
Out-of-distribution (OOD) detection is essential for ensuring the reliability and safety of machine learning systems. In recent years, it has received increasing attention, particularly through post-hoc detection and training-based methods. In this paper, we focus on post-hoc OOD detection, which enables identifying OOD samples without altering the model's training procedure or objective. Our primary goal is to investigate the relationship between model capacity and its OOD detection performance. Specifically, we aim to answer the following question: Does the Double Descent phenomenon manifest in post-hoc OOD detection? This question is crucial, as it can reveal whether overparameterization, which is already known to benefit generalization, can also enhance OOD detection. Despite the growing interest in these topics by the classic supervised machine learning community, this intersection remains unexplored for OOD detection. We empirically demonstrate that the Double Descent effect does indeed appear in post-hoc OOD detection. Furthermore, we provide theoretical insights to explain why this phenomenon emerges in such setting. Finally, we show that the overparameterized regime does not yield superior results consistently, and we propose a method to identify the optimal regime for OOD detection based on our observations.
comment: Accepted at NeurIPS 2025 (Conference on Neural Information Processing Systems)
♻ ☆ Combinatorial Creativity: A New Frontier in Generalization Abilities
Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, bridging the gap between human and machine intelligence.
comment: Preprint. The first two authors contributed equally
♻ ☆ Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations
In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Optimal Point for Boundary Approximation}$ (OPBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5\%$ to $50\%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the OPBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and code are available at: https://github.com/dsin85691/OPBA_For_Counterfactuals
comment: This paper is 15 pages long consisting of multiple sections including an abstract, introduction, related works, methodology, results, ablation studies, conclusion, future works, and an appendix section. There are 10 figures and 5 tables in total
♻ ☆ OrbitChain: Orchestrating In-orbit Real-time Analytics of Earth Observation Data
Earth observation analytics have the potential to serve many time-sensitive applications. However, due to limited bandwidth and duration of ground-satellite connections, it takes hours or even days to download and analyze data from existing Earth observation satellites, making real-time demands like timely disaster response impossible. Toward real-time analytics, we introduce OrbitChain, a collaborative analytics framework that orchestrates computational resources across multiple satellites in an Earth observation constellation. OrbitChain decomposes analytics applications into microservices and allocates computational resources for time-constrained analysis. A traffic routing algorithm is devised to minimize the inter-satellite communication overhead. OrbitChain adopts a pipeline workflow that completes Earth observation tasks in real-time, facilitates time-sensitive applications and inter-constellation collaborations such as tip-and-cue. To evaluate OrbitChain, we implement a hardware-in-the-loop orbital computing testbed. Experiments show that our system can complete up to 60% analytics workload than existing Earth observation analytics framework while reducing the communication overhead by up to 72%.
comment: currently under review; corrected typo in author na,e
♻ ☆ A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design
Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emph{functionally equivalent} with respect to those contracts.
comment: 4 pages, 1 figure
♻ ☆ Estimation of aboveground biomass in a tropical dry forest: An intercomparison of airborne, unmanned, and space laser scanning
According to the Paris Climate Change Agreement, all nations are required to submit reports on their greenhouse gas emissions and absorption every two years by 2024. Consequently, forests play a crucial role in reducing carbon emissions, which is essential for meeting these obligations. Recognizing the significance of forest conservation in the global battle against climate change, Article 5 of the Paris Agreement emphasizes the need for high-quality forest data. This study focuses on enhancing methods for mapping aboveground biomass in tropical dry forests. Tropical dry forests are considered one of the least understood tropical forest environments; therefore, there is a need for accurate approaches to estimate carbon pools. We employ a comparative analysis of AGB estimates, utilizing different discrete and full-waveform laser scanning datasets in conjunction with Ordinary Least Squares and Bayesian approaches SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning were used as independent variables for extracting forest metrics. Variable selection, SVM regression tuning, and cross-validation via a machine-learning approach were applied to account for overfitting and underfitting. The results indicate that six key variables primarily related to tree height: Elev\.minimum, Elev\.L3, lev\.MAD.mode, Elev\.mode, Elev\.MAD\.median, and Elev\.skewness, are important for AGB estimation using ALSD and ULSD, while Leaf Area Index, canopy coverage and height, terrain elevation, and full-waveform signal energy emerged as the most vital variables. AGB values estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste province ranged from 26.02 Mg/ha to 175.43 Mg/ha. The SVM regressions demonstrated a 17.89 error across all laser scanning systems, with SLSF W exhibiting the lowest error 17.07 in estimating total biomass per plot.
comment: 32 pages, 17 figures, research paper
♻ ☆ Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases
Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.
♻ ☆ Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds
In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.
comment: NeruIPS 2025
♻ ☆ Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies
Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. Our experimental data and code are available at https://sites.google.com/view/inference-strategies-rl.
comment: Neurips '25 version
♻ ☆ RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained by supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100\% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.
comment: https://lei-kun.github.io/RL-100/
♻ ☆ Augmenting learning in neuro-embodied systems through neurobiological first principles
Recent progress in artificial intelligence (AI) has been driven by insights from physics and neuroscience, particularly through the development of artificial neural networks (ANNs) capable of complex cognitive tasks such as vision and language processing. Despite these advances, they struggle with continual learning, adaptable knowledge transfer, robustness, and resource efficiency -- capabilities that biological systems handle seamlessly. Specifically, neuromorphic systems and artificial neural networks often overlook two key biophysical properties of neural circuits: neuronal diversity and cell-specific neuromodulation. These mechanisms, essential for regulating dynamic learning across brain scales, allow neuromodulators to introduce degeneracy in biological neural networks, ensuring stability and adaptability under changing conditions. In this article, we summarize recent bioinspired models, learning rules, and architectures, and propose a framework for augmenting ANNs, which has the potential to bridge the gap between neuroscience and AI through neurobiological first principles. Our proposed dual-framework approach leverages spiking neural networks to emulate diverse spiking behaviors and dendritic compartmental dynamics, thereby simulating the morphological and functional diversity of neuronal computations. Finally, we outline how integrating these biophysical principles into task-driven spiking neural networks and neuromorphic systems provides scalable solutions for continual learning, adaptability, robustness, and resource-efficiency. Additionally, this approach will not only provide insights into how emergent behaviors arise in neural networks but also catalyze the development of more efficient, reliable, and intelligent neuromorphic systems and robotic agents.
comment: 26 pages, 4 figures, 3 boxes, 1 table
♻ ☆ Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill Learning ICRA
Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.
comment: IEEE International Conference on Robotics and Automation (ICRA) 2025
♻ ☆ What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)
While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.
♻ ☆ DuSEGO: Dual Second-order Equivariant Graph Ordinary Differential Equation
Graph Neural Networks (GNNs) with equivariant properties have achieved significant success in modeling complex dynamic systems and molecular properties. However, their expressiveness ability is limited by: (1) Existing methods often overlook the over-smoothing issue caused by traditional GNN models, as well as the gradient explosion or vanishing problems in deep GNNs. (2) Most models operate on first-order information, neglecting that the real world often consists of second-order systems, which further limits the model's representation capabilities. To address these issues, we propose the \textbf{Du}al \textbf{S}econd-order \textbf{E}quivariant \textbf{G}raph \textbf{O}rdinary Differential Equation (\method{}) for equivariant representation. Specifically, \method{} apply the dual second-order equivariant graph ordinary differential equations (Graph ODEs) on graph embeddings and node coordinates, simultaneously. Theoretically, we first prove that \method{} maintains the equivariant property. Furthermore, we provide theoretical insights showing that \method{} effectively alleviates the over-smoothing problem in both feature representation and coordinate update. Additionally, we demonstrate that the proposed \method{} mitigates the exploding and vanishing gradients problem, facilitating the training of deep multi-layer GNNs. Extensive experiments on benchmark datasets validate the superiority of the proposed \method{} compared to baselines.
♻ ☆ Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks
With the emergence of large model-based agents, widely adopted transformer-based architectures inevitably produce excessively long token embeddings for transmission, which may result in high bandwidth overhead, increased power consumption and latency. In this letter, we propose a task-oriented multimodal token transmission scheme for efficient multimodal information fusion and utilization. To improve the efficiency of token transmission, we design a two-stage training algotithm, including cross-modal alignment and task-oriented fine-tuning, for large model-based token communication. Meanwhile, token compression is performed using a sliding window pooling operation to save communication resources. To balance the trade-off between latency and model performance caused by compression, we formulate a weighted-sum optimization problem over latency and validation loss. We jointly optimizes bandwidth, power allocation, and token length across users by using an alternating optimization method. Simulation results demonstrate that the proposed algorithm outperforms the baseline under different bandwidth and power budgets. Moreover, the two-stage training algorithm achieves higher accuracy across various signal-to-noise ratios than the method without cross-modal alignment.
♻ ☆ Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules NeurIPS 2025
Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.
comment: 60 pages, accepted by NeurIPS 2025 as a spotlight paper
♻ ☆ Flat Channels to Infinity in Neural Loss Landscapes NeurIPS'25
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
comment: Accepted to NeurIPS'25
♻ ☆ Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks
The growing luminosity frontier at the Large Hadron Collider is challenging the reconstruction and analysis of particle collision events. Increased particle multiplicities are straining latency and storage requirements at the data acquisition stage, while new complications are emerging, including higher background levels and more frequent particle vertex misassociations. This in turn necessitates the development of more holistic and scalable reconstruction methods that take advantage of recent advances in machine learning. We propose a novel Heterogeneous Graph Neural Network (HGNN) architecture featuring unique representations for diverse particle collision relationships and integrated graph pruning layers for scalability. Trained with a multi-task paradigm in an environment mimicking the LHCb experiment, this HGNN significantly improves beauty hadron reconstruction performance. Notably, it concurrently performs particle vertex association and graph pruning within a single framework. We quantify reconstruction and pruning performance, demonstrate enhanced inference time scaling with event complexity, and mitigate potential performance loss using a weighted message passing scheme.
comment: 23 pages, 9 figures, 4 tables (revised for Machine Learning Science and Technology)
♻ ☆ Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models
We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we observe preliminary correlations between internal fragility and miscalibration or hallucination (confabulation), and find that lightweight critique prompts may modestly improve or worsen calibration in small-scale tests. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens to diagnose and potentially mitigate overconfidence in reasoning systems.
comment: 21 pages, 2 figures, preliminary version
♻ ☆ Sharp Lower Bounds for Linearized ReLU^k Approximation on the Sphere
We prove a saturation theorem for linearized shallow ReLU$^k$ neural networks on the unit sphere $\mathbb S^d$. For any antipodally quasi-uniform set of centers, if the target function has smoothness $r>\tfrac{d+2k+1}{2}$, then the best $\mathcal{L}^2(\mathbb S^d)$ approximation cannot converge faster than order $n^{-\frac{d+2k+1}{2d}}$. This lower bound matches existing upper bounds, thereby establishing the exact saturation order $\tfrac{d+2k+1}{2d}$ for such networks. Our results place linearized neural-network approximation firmly within the classical saturation framework and show that, although ReLU$^k$ networks outperform finite elements under equal degrees $k$, this advantage is intrinsically limited.
♻ ☆ New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models LREC
Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LL\"aMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LL\"aMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.
comment: under review @LREC
♻ ☆ Diversity-Aware Policy Optimization for Large Language Model Reasoning
The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM's reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.
♻ ☆ Graph Neural Networks for Electricity Load Forecasting
Forecasting electricity demand is increasingly challenging as energy systems become more decentralized and intertwined with renewable sources. Graph Neural Networks (GNNs) have recently emerged as a powerful paradigm to model spatial dependencies in load data while accommodating complex non-stationarities. This paper introduces a comprehensive framework that integrates graph-based forecasting with attention mechanisms and ensemble aggregation strategies to enhance both predictive accuracy and interpretability. Several GNN architectures -- including Graph Convolutional Networks, GraphSAGE, APPNP, and Graph Attention Networks -- are systematically evaluated on synthetic, regional (France), and fine-grained (UK) datasets. Empirical results demonstrate that graph-aware models consistently outperform conventional baselines such as Feed Forward Neural Networks and foundation models like TiREX. Furthermore, attention layers provide valuable insights into evolving spatial interactions driven by meteorological and seasonal dynamics. Ensemble aggregation, particularly through bottom-up expert combination, further improves robustness under heterogeneous data conditions. Overall, the study highlights the complementarity between structural modeling, interpretability, and robustness, and discusses the trade-offs between accuracy, model complexity, and transparency in graph-based electricity load forecasting.
comment: 22 pages
♻ ☆ Rough Path Signatures: Learning Neural RDEs for Portfolio Optimization
We tackle high-dimensional, path-dependent valuation and control and introduce a deep BSDE/2BSDE solver that couples truncated log-signatures with a neural rough differential equation (RDE) backbone. The architecture aligns stochastic analysis with sequence-to-path learning: a CVaR-tilted terminal objective targets left-tail risk, while an optional second-order (2BSDE) head supplies curvature estimates for risk-sensitive control. Under matched compute and parameter budgets, the method improves accuracy, tail fidelity, and training stability across Asian and barrier option pricing and portfolio control: at d=200 it achieves CVaR(0.99)=9.80% versus 12.00-13.10% for strong baselines, attains the lowest HJB residual (0.011), and yields the lowest RMSEs for Z and Gamma. Ablations over truncation depth, local windows, and tilt parameters confirm complementary gains from the sequence-to-path representation and the 2BSDE head. Taken together, the results highlight a bidirectional dialogue between stochastic analysis and modern deep learning: stochastic tools inform representations and objectives, while sequence-to-path models expand the class of solvable financial models at scale.
comment: Code available at: https://github.com/AliAtiah/SigRDE
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ Neural Entropy NeurIPS 2025
We explore the connection between deep learning and information theory through the paradigm of diffusion models. A diffusion model converts noise into structured data by reinstating, imperfectly, information that is erased when data was diffused to noise. This information is stored in a neural network during training. We quantify this information by introducing a measure called neural entropy, which is related to the total entropy produced by diffusion. Neural entropy is a function of not just the data distribution, but also the diffusive process itself. Measurements of neural entropy on a few simple image diffusion models reveal that they are extremely efficient at compressing large ensembles of structured data.
comment: 29 pages + references, 18 figures. Camera-ready version from NeurIPS 2025
♻ ☆ Multi-Agent Regime-Conditioned Diffusion (MARCD) for CVaR-Constrained Portfolio Decisions
We examine whether regime-conditioned generative scenarios combined with a convex CVaR allocator improve portfolio decisions under regime shifts. We present MARCD, a generative-to-decision framework with: (i) a Gaussian HMM to infer latent regimes; (ii) a diffusion generator that produces regime-conditioned scenarios; (iii) signal extraction via blended, shrunk moments; and (iv) a governed CVaR epigraph quadratic program. Contributions: Within the Scenario stage we introduce a tail-weighted diffusion objective that up-weights low-quantile outcomes relevant for drawdowns and a regime-expert (MoE) denoiser whose gate increases with crisis posteriors; both are evaluated end-to-end through the allocator. Under strict walk-forward on liquid multi-asset ETFs (2005-2025), MARCD exhibits stronger scenario calibration and materially smaller drawdowns: MaxDD 9.3% versus 14.1% for BL (a 34% reduction) over 2020-2025 out-of-sample. The framework provides an auditable pipeline with explicit budget, box, and turnover constraints, demonstrating the value of decision-aware generative modeling in finance.
comment: Code available at: https://github.com/AliAtiah/MARCD
♻ ☆ NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables
On-device deep learning models have extensive real world demands. Deep learning compilers efficiently compile models into executables for deployment on edge devices, but these executables may face the threat of reverse engineering. Previous studies have attempted to decompile DNN executables, but they face challenges in handling compilation optimizations and analyzing quantized compiled models. In this paper, we present NeuroDeX to unlock diverse support in decompiling DNN executables. NeuroDeX leverages the semantic understanding capabilities of LLMs along with dynamic analysis to accurately and efficiently perform operator type recognition, operator attribute recovery and model reconstruction. NeuroDeX can recover DNN executables into high-level models towards compilation optimizations, different architectures and quantized compiled models. We conduct experiments on 96 DNN executables across 12 common DNN models. Extensive experimental results demonstrate that NeuroDeX can decompile non-quantized executables into nearly identical high-level models. NeuroDeX can recover functionally similar high-level models for quantized executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more comprehensive and effective solution compared to previous DNN executables decompilers.
♻ ☆ AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems ECAI24
Molecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently emerged as a "human-in-the-loop" strategy for efficiently navigating hyper-dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular simulations running on high-performance computing architectures, iMD-VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high-dimensional molecular systems. Moreover, iMD-VR simulations generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the use of researcher-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi-agents systems domains which are comparable to iMD-VR, and discuss how iMD-VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof-of-principle study where iMD-VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.
comment: (First presented at the First Workshop on "eXtended Reality \& Intelligent Agents" (XRIA24) @ ECAI24, Santiago De Compostela (Spain), 20 October 2024)
♻ ☆ UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.
comment: Accepted to RSS 2025. Code is available at https://github.com/OpenDriveLab/UniVLA
♻ ☆ Differentiable Generalized Sliced Wasserstein Plans
Optimal Transport (OT) has attracted significant interest in the machine learning community, not only for its ability to define meaningful distances between probability distributions -- such as the Wasserstein distance -- but also for its formulation of OT plans. Its computational complexity remains a bottleneck, though, and slicing techniques have been developed to scale OT to large datasets. Recently, a novel slicing scheme, dubbed min-SWGG, lifts a single one-dimensional plan back to the original multidimensional space, finally selecting the slice that yields the lowest Wasserstein distance as an approximation of the full OT plan. Despite its computational and theoretical advantages, min-SWGG inherits typical limitations of slicing methods: (i) the number of required slices grows exponentially with the data dimension, and (ii) it is constrained to linear projections. Here, we reformulate min-SWGG as a bilevel optimization problem and propose a differentiable approximation scheme to efficiently identify the optimal slice, even in high-dimensional settings. We furthermore define its generalized extension for accommodating to data living on manifolds. Finally, we demonstrate the practical value of our approach in various applications, including gradient flows on manifolds and high-dimensional spaces, as well as a novel sliced OT-based conditional flow matching for image generation -- where fast computation of transport plans is essential.
comment: LC and RT have equal contribution
♻ ☆ ConTextTab: A Semantics-Aware Tabular In-Context Learner NeurIPS 2025
Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/sap-rpt-1-oss.
comment: Accepted as spotlight at NeurIPS 2025
♻ ☆ Representation Consistency for Accurate and Coherent LLM Answer Aggregation NeurIPS 2025
Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.
comment: Accepted at NeurIPS 2025. Camera-ready version
♻ ☆ A probabilistic view on Riemannian machine learning models for SPD matrices
The goal of this paper is to show how different machine learning tools on the Riemannian manifold $\mathcal{P}_d$ of Symmetric Positive Definite (SPD) matrices can be united under a probabilistic framework. For this, we will need several Gaussian distributions defined on $\mathcal{P}_d$. We will show how popular classifiers on $\mathcal{P}_d$ can be reinterpreted as Bayes Classifiers using these Gaussian distributions. These distributions will also be used for outlier detection and dimension reduction. By showing that those distributions are pervasive in the tools used on $\mathcal{P}_d$, we allow for other machine learning tools to be extended to $\mathcal{P}_d$.
♻ ☆ Geospatial Foundation Models to Enable Progress on Sustainable Development Goals
Foundation Models (FMs) are large-scale, pre-trained artificial intelligence (AI) systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.
♻ ☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
♻ ☆ Contextual Tokenization for Graph Inverted Indices
Retrieving graphs from a large corpus, that contain a subgraph isomorphic to a given query graph, is a core operation in many real-world applications. While recent multi-vector graph representations and scores based on set alignment and containment can provide accurate subgraph isomorphism tests, their use in retrieval remains limited by their need to score corpus graphs exhaustively. We introduce CORGII (Contextual Representation of Graphs for Inverted Indexing), a graph indexing framework in which, starting with a contextual dense graph representation, a differentiable discretization module computes sparse binary codes over a learned latent vocabulary. This text document-like representation allows us to leverage classic, highly optimized inverted indices, while supporting soft (vector) set containment scores. Pushing this paradigm further, we replace the classical, fixed impact weight of a `token' on a graph (such as TFIDF or BM25) with a data-driven, trainable impact weight. Finally, we explore token expansion to support multi-probing the index for smoother accuracy-efficiency tradeoffs. To our knowledge, CORGII is the first indexer of dense graph representations using discrete tokens mapping to efficient inverted lists. Extensive experiments show that CORGII provides better trade-offs between accuracy and efficiency, compared to several baselines.
♻ ☆ Deep Modularity Networks with Diversity-Preserving Regularization NeurIPS 2025
Graph clustering plays a crucial role in graph representation learning but often faces challenges in achieving feature-space diversity. While Deep Modularity Networks (DMoN) leverage modularity maximization and collapse regularization to ensure structural separation, they lack explicit mechanisms for feature-space separation, assignment dispersion, and assignment-confidence control. We address this limitation by proposing Deep Modularity Networks with Diversity-Preserving Regularization (DMoN-DPR), which introduces three novel regularization terms: distance-based for inter-cluster separation, variance-based for per-cluster assignment dispersion, and an assignment-entropy penalty with a small positive weight, encouraging more confident assignments gradually. Our method significantly enhances label-based clustering metrics on feature-rich benchmark datasets (paired two-tailed t-test, $p\leq0.05$), demonstrating the effectiveness of incorporating diversity-preserving regularizations in creating meaningful and interpretable clusters.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: New Perspectives in Graph Machine Learning (NPGML)
♻ ☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
♻ ☆ Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach
Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. This study compares the U-Net with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error that is 7% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.37% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.
♻ ☆ Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models
What is the shortest path between two data points lying in a high-dimensional space? While the answer is trivial in Euclidean geometry, it becomes significantly more complex when the data lies on a curved manifold -- requiring a Riemannian metric to describe the space's local curvature. Estimating such a metric, however, remains a major challenge in high dimensions. In this work, we propose a method for deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) -- a class of generative models that assign low energy to high-density regions. These metrics define spatially varying distances, enabling the computation of geodesics -- shortest paths that follow the data manifold's intrinsic geometry. We introduce two novel metrics derived from EBMs and show that they produce geodesics that remain closer to the data manifold and exhibit lower curvature distortion, as measured by alignment with ground-truth trajectories. We evaluate our approach on increasingly complex datasets: synthetic datasets with known data density, rotated character images with interpretable geometry, and high-resolution natural images embedded in a pretrained VAE latent space. Our results show that EBM-derived metrics consistently outperform established baselines, especially in high-dimensional settings. Our work is the first to derive Riemannian metrics from EBMs, enabling data-aware geodesics and unlocking scalable, geometry-driven learning for generative modeling and simulation.
♻ ☆ Tight analyses of first-order methods with error feedback
Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes -- most notably $\mathrm{EF}$ and $\mathrm{EF}^{21}$ -- were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method -- with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between $\mathrm{EF}$, $\mathrm{EF}^{21}$, and compressed gradient descent. Our analysis is carried out in the simplified single-agent setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.
♻ ☆ Low-Rank Adaptation for Foundation Models: A Comprehensive Review
The rapid advancement of foundation modelslarge-scale neural networks trained on diverse, extensive datasetshas revolutionized artificial intelligence, enabling unprecedented advancements across domains such as natural language processing, computer vision, and scientific discovery. However, the substantial parameter count of these models, often reaching billions or trillions, poses significant challenges in adapting them to specific downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a highly promising approach for mitigating these challenges, offering a parameter-efficient mechanism to fine-tune foundation models with minimal computational overhead. This survey provides the first comprehensive review of LoRA techniques beyond large Language Models to general foundation models, including recent techniques foundations, emerging frontiers and applications of low-rank adaptation across multiple domains. Finally, this survey discusses key challenges and future research directions in theoretical understanding, scalability, and robustness. This survey serves as a valuable resource for researchers and practitioners working with efficient foundation model adaptation.
♻ ☆ ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models
Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665x over naive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.
♻ ☆ Inducing Riesz and orthonormal bases in $L^2$ via composition operators
Let $C_h$ be a composition operator mapping $L^2(\Omega_1)$ into $L^2(\Omega_2)$ for some open sets $\Omega_1, \Omega_2 \subseteq \mathbb{R}^n$. We characterize the mappings $h$ that transform Riesz bases of $L^2(\Omega_1)$ into Riesz bases of $L^2(\Omega_2)$. Restricting our analysis to differentiable mappings, we demonstrate that mappings $h$ that preserve Riesz bases have Jacobian determinants that are bounded away from zero and infinity. We discuss implications of these results for approximation theory, highlighting the potential of using bijective neural networks to construct Riesz bases with favorable approximation properties.
♻ ☆ What Can Be Recovered Under Sparse Adversarial Corruption? Assumption-Free Theory for Linear Measurements
Let $A \in \mathbb{R}^{m \times n}$ be an arbitrary, known matrix and $e$ a $q$-sparse adversarial vector. Given $y = A x^\star + e$ and $q$, we seek the smallest set containing $x^\star$--hence the one conveying maximal information about $x^\star$--that is uniformly recoverable from $y$ without knowing $e$. While exact recovery of $x^\star$ via strong (and often impractical) structural assumptions on $A$ or $x^\star$ (e.g., restricted isometry, sparsity) is well studied, recoverability for arbitrary $A$ and $x^\star$ remains open. Our main result shows that the best that one can hope to recover is $x^\star + \ker(U)$, where $U$ is the unique projection matrix onto the intersection of rowspaces of all possible submatrices of $A$ obtained by deleting $2q$ rows. Moreover, we prove that every $x$ that minimizes the $\ell_0$-norm of $y - A x$ lies in $x^\star + \ker(U)$, which then gives a constructive approach to recover this set.
comment: \copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
♻ ☆ A Self-Evolving AI Agent System for Climate Science
Scientific progress in Earth science depends on integrating data across the planet's interconnected spheres. However, the accelerating volume and fragmentation of multi-sphere knowledge and data have surpassed human analytical capacity. This creates a major bottleneck for discovery, especially in climate science. To address this challenge, we introduce EarthLink, the first self-evolving AI agent system designed as an interactive "copilot" for Earth scientists. Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning into a unified process that directly addresses this limitation. Beyond efficiency, it exhibits human-like cross-disciplinary analytical ability and achieves proficiency comparable to a junior researcher in expert evaluations on core large-scale climate tasks, including model-observation comparison and climate change understanding. When tasked with an open scientific problem, specifically the discovery of precursors of the Atlantic Ni\~no, EarthLink autonomously developed a research strategy, identified sources of predictability, verified its hypotheses with available data, and proposed a physically consistent mechanism. These emerging capabilities enable a new human-AI research paradigm. Scientists can focus on value and result judgments, while AI systems handle complex data analysis and knowledge integration. This accelerates the pace and breadth of discovery in Earth sciences. The system is accessible at our website https://earthlink.intern-ai.org.cn.
♻ ☆ Localist LLMs -- A Mathematical Framework for Dynamic Locality Control
We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.
♻ ☆ SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning NeurIPS 2025
Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. Our code is available at https://github.com/MichaelMaiii/SynBrain.
comment: Accepted by NeurIPS 2025
♻ ☆ Bellman Diffusion Models
Diffusion models have seen tremendous success as generative architectures. Recently, they have been shown to be effective at modelling policies for offline reinforcement learning and imitation learning. We explore using diffusion as a model class for the successor state measure (SSM) of a policy. We find that enforcing the Bellman flow constraints leads to a simple Bellman update on the diffusion step distribution.
♻ ☆ MARFT: Multi-Agent Reinforcement Fine-Tuning
LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new MG called Flex-MG, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.
comment: 42 pages
♻ ☆ Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning NeurIPS 2025
Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.
comment: Accepted at NeurIPS 2025. Updated with minor corrections and additional experiments
♻ ☆ Language-Driven Coordination and Learning in Multi-Agent Simulation Environments
This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.
♻ ☆ DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding
Unlike fixed- or variable-rate image coding, progressive image coding (PIC) aims to compress various qualities of images into a single bitstream, increasing the versatility of bitstream utilization and providing high compression efficiency compared to simulcast compression. Research on neural network (NN)-based PIC is in its early stages, mainly focusing on applying varying quantization step sizes to the transformed latent representations in a hierarchical manner. These approaches are designed to compress only the progressively added information as the quality improves, considering that a wider quantization interval for lower-quality compression includes multiple narrower sub-intervals for higher-quality compression. However, the existing methods are based on handcrafted quantization hierarchies, resulting in sub-optimal compression efficiency. In this paper, we propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer. We also incorporate selective compression with which only the essential representation components are compressed for each quantization layer. We demonstrate that our method achieves significantly higher coding efficiency than the existing approaches with decreased decoding time and reduced model size. The source code is publicly available at https://github.com/JooyoungLeeETRI/DeepHQ
comment: Accepted to ACM TOMM (2025)
♻ ☆ Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation
Flight delay prediction has become a key focus in air traffic management, as delays highlight inefficiencies that impact overall network performance. This paper presents a lightweight large language model-based multimodal flight delay prediction, formulated from the perspective of air traffic controllers monitoring aircraft delay after entering the terminal area. The approach integrates trajectory representations with textual aeronautical information, including flight information, weather reports, and aerodrome notices, by adapting trajectory data into the language modality to capture airspace conditions. The experiments show that the model consistently achieves sub-minute prediction error by effectively leveraging contextual information related to the sources of delay, fulfilling the operational standard for minute-level precision. The framework demonstrates that linguistic understanding, when combined with cross-modality adaptation of trajectory data, enhances delay prediction. Moreover, the approach shows practicality and potential scalability for real-world operations, supporting real-time updates that refine predictions upon receiving new operational information.
comment: Preprint submitted to Aerospace Science and Technology (Elsevier) for possible publication
♻ ☆ FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA family models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.1 on WikiText2. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.
♻ ☆ Amortized Active Generation of Pareto Sets NeurIPS 2025
We introduce active generation of Pareto sets (A-GPS), a new framework for online discrete black-box multi-objective optimization (MOO). A-GPS learns a generative model of the Pareto set that supports a-posteriori conditioning on user preferences. The method employs a class probability estimator (CPE) to predict non-dominance relations and to condition the generative model toward high-performing regions of the search space. We also show that this non-dominance CPE implicitly estimates the probability of hypervolume improvement (PHVI). To incorporate subjective trade-offs, A-GPS introduces preference direction vectors that encode user-specified preferences in objective space. At each iteration, the model is updated using both Pareto membership and alignment with these preference directions, producing an amortized generative model capable of sampling across the Pareto front without retraining. The result is a simple yet powerful approach that achieves high-quality Pareto set approximations, avoids explicit hypervolume computation, and flexibly captures user preferences. Empirical results on synthetic benchmarks and protein design tasks demonstrate strong sample efficiency and effective preference incorporation.
comment: Appears in the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Complex QA and language models hybrid architectures, Survey
This paper reviews the state-of-the-art of large language models (LLM) architectures and strategies for "complex" question-answering with a focus on hybrid architectures. LLM based chatbot services have allowed anyone to grasp the potential of LLM to solve many common problems, but soon discovered their limitations for complex questions. Addressing more specific, complex questions (e.g., "What is the best mix of power-generation methods to reduce climate change ?") often requires specialized architectures, domain knowledge, new skills, decomposition and multi-step resolution, deep reasoning, sensitive data protection, explainability, and human-in-the-loop processes. Therefore, we review: (1) necessary skills and tasks for handling complex questions and common LLM limits to overcome; (2) dataset, cost functions and evaluation metrics for measuring and improving (e.g. accuracy, explainability, fairness, robustness, groundedness, faithfulness, toxicity...); (3) family of solutions to overcome LLM limitations by (a) training and reinforcement (b) hybridization, (c) prompting, (d) agentic-architectures (agents, tools) and extended reasoning.
♻ ☆ MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.
comment: Published in Transactions on Machine Learning Research (09/2025)
♻ ☆ Riemannian Consistency Model NeurIPS 2025
Consistency models are a class of generative models that enable few-step generation for diffusion and flow matching models. While consistency models have achieved promising results on Euclidean domains like images, their applications to Riemannian manifolds remain challenging due to the curved geometry. In this work, we propose the Riemannian Consistency Model (RCM), which, for the first time, enables few-step consistency modeling while respecting the intrinsic manifold constraint imposed by the Riemannian geometry. Leveraging the covariant derivative and exponential-map-based parameterization, we derive the closed-form solutions for both discrete- and continuous-time training objectives for RCM. We then demonstrate theoretical equivalence between the two variants of RCM: Riemannian consistency distillation (RCD) that relies on a teacher model to approximate the marginal vector field, and Riemannian consistency training (RCT) that utilizes the conditional vector field for training. We further propose a simplified training objective that eliminates the need for the complicated differential calculation. Finally, we provide a unique kinematics perspective for interpreting the RCM objective, offering new theoretical angles. Through extensive experiments, we manifest the superior generative quality of RCM in few-step generation on various non-Euclidean manifolds, including flat-tori, spheres, and the 3D rotation group SO(3).
comment: Accepted to NeurIPS 2025
♻ ☆ Scientific Machine Learning with Kolmogorov-Arnold Networks
The field of scientific machine learning, which originally utilized multilayer perceptrons (MLPs), is increasingly adopting Kolmogorov-Arnold Networks (KANs) for data encoding. This shift is driven by the limitations of MLPs, including poor interpretability, fixed activation functions, and difficulty capturing localized or high-frequency features. KANs address these issues with enhanced interpretability and flexibility, enabling more efficient modeling of complex nonlinear interactions and effectively overcoming the constraints associated with conventional MLP architectures. This review categorizes recent progress in KAN-based models across three distinct perspectives: (i) data-driven learning, (ii) physics-informed modeling, and (iii) deep-operator learning. Each perspective is examined through the lens of architectural design, training strategies, application efficacy, and comparative evaluation against MLP-based counterparts. By benchmarking KANs against MLPs, we highlight consistent improvements in accuracy, convergence, and spectral representation, clarifying KANs' advantages in capturing complex dynamics while learning more effectively. In addition to reviewing recent literature, this work also presents several comparative evaluations that clarify central characteristics of KAN modeling and hint at their potential implications for real-world applications. Finally, this review identifies critical challenges and open research questions in KAN development, particularly regarding computational efficiency, theoretical guarantees, hyperparameter tuning, and algorithm complexity. We also outline future research directions aimed at improving the robustness, scalability, and physical consistency of KAN-based frameworks.
♻ ☆ Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
♻ ☆ Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness NeurIPS 2025
Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations. While adversarial training remains the predominant defense strategy, it often incurs significant computational cost and may compromise clean-data accuracy. In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions-specifically, rotation- and scale-equivariant layers-into standard convolutional neural networks (CNNs). These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries and greater resilience to adversarial attacks. We propose and evaluate two symmetry-aware architectures: a parallel design that processes standard and equivariant features independently before fusion, and a cascaded design that applies equivariant operations sequentially. Theoretically, we demonstrate that such models reduce hypothesis space complexity, regularize gradients, and yield tighter certified robustness bounds under the CLEVER (Cross Lipschitz Extreme Value for nEtwork Robustness) framework. Empirically, our models consistently improve adversarial robustness and generalization across CIFAR-10, CIFAR-100, and CIFAR-10C under both FGSM and PGD attacks, without requiring adversarial training. These findings underscore the potential of symmetry-enforcing architectures as efficient and principled alternatives to data augmentation-based defenses.
comment: Accepted for the proceedings of 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Learning Nonholonomic Dynamics with Constraint Discovery
We consider learning nonholonomic dynamical systems while discovering the constraints, and describe in detail the case of the rolling disk. A nonholonomic system is a system subject to nonholonomic constraints. Unlike holonomic constraints, nonholonomic constraints do not define a sub-manifold on the configuration space. Therefore, the inverse problem of finding the constraints has to involve the tangent bundle. This paper discusses a general procedure to learn the dynamics of a nonholonomic system through Hamel's formalism, while discovering the system constraint by parameterizing it, given the data set of discrete trajectories on the tangent bundle $TQ$. We prove that there is a local minimum for convergence of the network. We also preserve symmetry of the system by reducing the Lagrangian to the Lie algebra of the selected group.
♻ ☆ Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
comment: v2 note: Corrected numerical values in Table 2 and Figure 4 due to a minor calculation error in v1. The overall conclusions remain unchanged. Code and models available at: https://github.com/tianciB/VFM-VAE
♻ ☆ Computational Basis of LLM's Decision Making in Social Simulation
Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM's behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM's internal representations in a Dictator Game -- a classic behavioral experiment on fairness and prosocial behavior. We extract "vectors of variable variations" (e.g., "male" to "female") from the LLM's internal state. Manipulating these vectors during the model's inference can substantially alter how those variables relate to the model's decision-making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer-based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications, strengthening sociological theory and measurement.
♻ ☆ Damper-B-PINN: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Vehicle State Estimation
Accurate state estimation is fundamental to intelligent vehicles. Wheel load, one of the most important chassis states, serves as an essential input for advanced driver assistance systems (ADAS) and exerts a direct influence on vehicle stability and safety. However, wheel load estimation remains challenging due to the complexity of chassis modeling and the susceptibility of nonlinear systems to noise. To address these issues, this paper first introduces a refined suspension linkage-level modeling approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon this, we propose a damper characteristics-based Bayesian physics-informed neural network (Damper-B-PINN) framework to estimate dynamic wheel load, which leverages the suspension dynamics as physical guidance of PINN while employing Bayesian inference to mitigate the effects of system noise and uncertainty. Moreover, a damper-characteristic physics conditioning (DPC) module is designed for embedding physical prior. The proposed Damper-B-PINN is evaluated using both high-fidelity simulation datasets generated by CarSim software and real-world datasets collected from a Formula Student race car. Experimental results demonstrate that our Damper-B-PINN consistently outperforms existing methods across various test conditions, particularly extreme ones. These findings highlight the potential of the proposed Damper-B-PINN framework to enhance the accuracy and robustness of dynamic wheel load estimation, thereby improving the reliability and safety of ADAS applications.
♻ ☆ Provable Generalization Bounds for Deep Neural Networks with Momentum-Adaptive Gradient Dropout
Deep neural networks (DNNs) achieve remarkable performance but often suffer from overfitting due to their high capacity. We introduce Momentum-Adaptive Gradient Dropout (MAGDrop), a novel regularization method that dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum, enhancing stability in non-convex optimization landscapes. To theoretically justify MAGDrop's effectiveness, we derive a non-asymptotic, computable PAC-Bayes generalization bound that accounts for its adaptive nature, achieving up to 29.2\% tighter bounds compared to standard approaches by leveraging momentum-driven perturbation control. Empirically, the activation-based MAGDrop achieves competitive performance on MNIST (99.52\%) and CIFAR-10 (92.03\%), with generalization gaps of 0.48\% and 6.52\%, respectively. We provide fully reproducible code and numerical computation of our bounds to validate our theoretical claims. Our work bridges theoretical insights and practical advancements, offering a robust framework for enhancing DNN generalization, making it suitable for high-stakes applications.
comment: 8 pages
♻ ☆ Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound
Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
comment: Accepted by Medical Image Analysis. 24 pages, 13 figures, 20 tabels
♻ ☆ One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra NeurIPS-2025
A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST (Goldman et. al., 2023) as the encoder and MolForge (Ucak et. al., 2023) as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 31% / top-10 40% of molecular structures correctly from mass spectra in MassSpecGym (Bushuiev et. al., 2024). We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.
comment: Accepted at AI4Mat-NeurIPS-2025 Workshop
♻ ☆ Neighboring State-based Exploration for Reinforcement Learning
Reinforcement Learning is a powerful tool to model decision-making processes. However, it relies on an exploration-exploitation trade-off that remains an open challenge for many tasks. In this work, we study neighboring state-based, model-free exploration led by the intuition that, for an early-stage agent, considering actions derived from a bounded region of nearby states may lead to better actions when exploring. We propose two algorithms that choose exploratory actions based on a survey of nearby states, and find that one of our methods, ${\rho}$-explore, consistently outperforms the Double DQN baseline in an discrete environment by 49% in terms of Eval Reward Return.
♻ ☆ MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation EMNLP 2025
Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field's history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST
comment: EMNLP 2025
♻ ☆ A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications NeurIPS 2025
The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.
comment: This paper is accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Trustworthy AI Must Account for Interactions ICLR 2025
Trustworthy AI encompasses many aspirational aspects for aligning AI systems with human values, including fairness, privacy, robustness, explainability, and uncertainty quantification. Ultimately the goal of Trustworthy AI research is to achieve all aspects simultaneously. However, efforts to enhance one aspect often introduce unintended trade-offs that negatively impact others. In this position paper, we review notable approaches to these five aspects and systematically consider every pair, detailing the negative interactions that can arise. For example, applying differential privacy to model training can amplify biases, undermining fairness. Drawing on these findings, we take the position that current research practices of improving one or two aspects in isolation are insufficient. Instead, research on Trustworthy AI must account for interactions between aspects and adopt a holistic view across all relevant axes at once. To illustrate our perspective, we provide guidance on how practitioners can work towards integrated trust, examples of how interactions affect the financial industry, and alternative views.
comment: Presented at the ICLR 2025 Workshop on Bidirectional Human-AI Alignment
♻ ☆ Dataset Distillation for Offline Reinforcement Learning ICML 2024
Offline reinforcement learning often requires a quality dataset that we can train a policy on. However, in many situations, it is not possible to get such a dataset, nor is it easy to train a policy to perform well in the actual environment given the offline data. We propose using data distillation to train and distill a better dataset which can then be used for training a better policy model. We show that our method is able to synthesize a dataset where a model trained on it achieves similar performance to a model trained on the full dataset or a model trained using percentile behavioral cloning. Our project site is available at https://datasetdistillation4rl.github.io . We also provide our implementation at https://github.com/ggflow123/DDRL .
comment: ICML 2024 DMLR Workshop Our project site is available at https://datasetdistillation4rl.github.io We also provide our implementation at https://github.com/ggflow123/DDRL
♻ ☆ Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models NeurIPS 2025
Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
comment: Accepted at NeurIPS 2025. Project page: https://cvlab-kaist.github.io/HeadHunter/
♻ ☆ A Basic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm
This paper presents a comprehensive formulation of Kaneko's Error Diffusion Learning Algorithm (EDLA) and evaluates its effectiveness across parity check, regression, and image classification tasks. EDLA is a biologically inspired learning algorithm that provides an alternative to conventional backpropagation for training artificial neural networks. EDLA employs a single global error signal that diffuses across networks composed of paired positive and negative sublayers, eliminating traditional layer-wise error backpropagation. This study evaluates EDLA's effectiveness using benchmark tasks, such as parity check, regression, and image classification, by systematically varying the neuron count, network depth, and learning rates to assess its performance comprehensively. The experimental results demonstrate that EDLA achieves consistently high accuracy across multiple benchmarks, highlighting its effectiveness as a learning algorithm for neural networks. The choice of learning rate, neuron count, and network depth significantly influences EDLA's efficiency and convergence speed. Analysis of internal network representations reveals meaningful feature extraction capabilities, and the network's overall performance is found to be competitive with networks trained via conventional backpropagation, especially in shallow architectures. This study introduces EDLA, a biologically plausible alternative to traditional backpropagation previously underrecognized due to language barriers. By reformulating EDLA, systematically evaluating its performance, and presenting empirical evidence of its effectiveness, this study increases the visibility and accessibility of EDLA and contributes to biologically inspired training methodologies.
♻ ☆ Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression NeurIPS 2025
We study gradient descent (GD) with a constant stepsize for $\ell_2$-regularized logistic regression with linearly separable data. Classical theory suggests small stepsizes to ensure monotonic reduction of the optimization objective, achieving exponential convergence in $\widetilde{\mathcal{O}}(\kappa)$ steps with $\kappa$ being the condition number. Surprisingly, we show that this can be accelerated to $\widetilde{\mathcal{O}}(\sqrt{\kappa})$ by simply using a large stepsize -- for which the objective evolves nonmonotonically. The acceleration brought by large stepsizes extends to minimizing the population risk for separable distributions, improving on the best-known upper bounds on the number of steps to reach a near-optimum. Finally, we characterize the largest stepsize for the local convergence of GD, which also determines the global convergence in special scenarios. Our results extend the analysis of Wu et al. (2024) from convex settings with minimizers at infinity to strongly convex cases with finite minimizers.
comment: NeurIPS 2025 camera ready version
♻ ☆ Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection
This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.
comment: 10 pages, 4 figures; submitted to The IEICE Transactions on Information and Systems
Human-Computer Interaction 21
☆ AlloyLens: A Visual Analytics Tool for High-throughput Alloy Screening and Inverse Design IEEE VIS 2025
Designing multi-functional alloys requires exploring high-dimensional composition-structure-property spaces, yet current tools are limited to low-dimensional projections and offer limited support for sensitivity or multi-objective tradeoff reasoning. We introduce AlloyLens, an interactive visual analytics system combining a coordinated scatterplot matrix (SPLOM), dynamic parameter sliders, gradient-based sensitivity curves, and nearest neighbor recommendations. This integrated approach reveals latent structure in simulation data, exposes the local impact of compositional changes, and highlights tradeoffs when exact matches are absent. We validate the system through case studies co-developed with domain experts spanning structural, thermal, and electrical alloy design.
comment: IEEE VIS 2025 Scientific Visualization Contest Honorable Mention Links: PyPI package (Jupyter widget): https://pypi.org/project/alloylens/ Web demo: http://susiesyli.com/alloylens-web/ GitHub: https://github.com/susiesyli/alloylens SciVis contest 2025: https://sciviscontest2025.github.io/data/
☆ NeuResonance: Exploring Feedback Experiences for Fostering the Inter-brain Synchronization
When several individuals collaborate on a shared task, their brain activities often synchronize. This phenomenon, known as Inter-brain Synchronization (IBS), is notable for inducing prosocial outcomes such as enhanced interpersonal feelings, including closeness, trust, empathy, and more. Further strengthening the IBS with the aid of external feedback would be beneficial for scenarios where those prosocial feelings play a vital role in interpersonal communication, such as rehabilitation between a therapist and a patient, motor skill learning between a teacher and a student, and group performance art. This paper investigates whether visual, auditory, and haptic feedback of the IBS level can further enhance its intensity, offering design recommendations for feedback systems in IBS. We report findings when three different types of feedback were provided: IBS level feedback by means of on-body projection mapping, sonification using chords, and vibration bands attached to the wrist.
comment: Conference Paper, 16 pages. Published at the 2025 CHI Conference on Human Factors in Computing Systems
☆ Exploring the Effect of Viewing Attributes of Mobile AR Interfaces on Remote Collaborative and Competitive Tasks
Mobile devices have the potential to facilitate remote tasks through Augmented Reality (AR) solutions by integrating digital information into the real world. Although prior studies have explored Mobile Augmented Reality (MAR) for co-located collaboration, none have investigated the impact of various viewing attributes that can influence remote task performance, such as target object viewing angles, synchronization styles, or having a secondary small screen showing other users current view in the MAR environment. In this paper, we explore five techniques considering these attributes, specifically designed for two modes of remote tasks: collaborative and competitive. We conducted a user study employing various combinations of those attributes for both tasks. In both instances, results indicate users' optimal performance and preference for the technique that allows asynchronous viewing of object manipulations on the small screen. Overall, this paper contributes novel techniques for remote tasks in MAR, addressing aspects such as viewing angle and synchronization in object manipulation alongside secondary small-screen interfaces. Additionally, it presents the results of a user study evaluating the effectiveness, usability, and user preference of these techniques in remote settings and offers a set of recommendations for designing and implementing MAR solutions to enhance remote activities.
☆ Exploring Pointer Enhancement Techniques for Target Selection on Large Curved Display
Large curved displays are becoming increasingly popular due to their ability to provide users with a wider field of view and a more immersive experience compared to flat displays. Current interaction techniques for large curved displays often assume a user is positioned at the display's centre, crucially failing to accommodate general use conditions where the user may move during use. In this work, we investigated how user position impacts pointing interaction on large curved displays and evaluated cursor enhancement techniques to provide faster and more accurate performance across positions. To this effect, we conducted two user studies. First, we evaluated the effects of user position on pointing performance on a large semi-circular display (3m-tall, 3270R curvature) through a 2D Fitts' Law selection task. Our results indicate that as users move away from the display, their pointing speed significantly increases (at least by 9%), but accuracy decreases (by at least 6%). Additionally, we observed participants were slower when pointing from laterally offset positions. Secondly, we explored which pointing techniques providing motor- and visual-space enhancements best afford effective pointing performance across user positions. Across a total of six techniques tested, we found that a combination of acceleration and distance-based adjustments with cursor enlargement significantly improves target selection speed and accuracy across different user positions. Results further show techniques with visual-space enhancements (e.g., cursor enlargement) are significantly faster and more accurate than their non-visually-enhanced counterparts. Based on our results we provide design recommendations for implementing cursor enhancement techniques for large curved displays.
☆ Exploring ChatGPT's Capabilities, Stability, Potential and Risks in Conducting Psychological Counseling through Simulations in School Counseling
To provide an exploratory analysis of ChatGPT-4's quantitative performance indicators in simulated school-counseling settings. Conversational artificial intelligence (AI) has shown strong capabilities in providing low-cost and timely interventions for a wide range of people and increasing well-being. Therefore, this study examined ChatGPT's capabilities, including response stability in conducting psychological counseling and its potential for providing accessible psychological interventions, especially in school settings. We prompted ChatGPT-4 with 80 real-world college-student counseling questions. Replies were quantified with APA-informed NLP tools to measure warmth, empathy, and acceptance, and run-to-run stability was assessed via Fleiss' \k{appa} and ICC(2,1). ChatGPT-4 achieved high warmth (97.5%), empathy (94.2%), and positive acceptance (mean compound score = 0.93 plus/minus 0.19), with moderate stability (ICC(2,1) = 0.62; \k{appa} = 0.59). Occasional randomness in responses highlights risk areas requiring human oversight. As an offline, single-model text simulation without clinical validation, these results remain exploratory. Future work should involve live users, compare multiple LLMs, and incorporate mixed-methods validation to assess real-world efficacy and safety. The findings suggest ChatGPT-4 could augment low-intensity mental-health support in educational settings, guiding the design of human-in-the-loop workflows, policy regulations, and product roadmaps. This is among the first exploratory studies to apply quantitative stability metrics and NLP-based emotion detection to ChatGPT-4 in a school-counseling context and to integrate a practitioner's perspective to inform future research, product development, and policy.
☆ Student Engagement in AI Assisted Complex Problem Solving: A Pilot Study of Human AI Rubik's Cube Collaboration
Games and puzzles play important pedagogical roles in STEM learning. New AI algorithms that can solve complex problems offer opportunities for scaffolded instruction in puzzle solving. This paper presents the ALLURE system, which uses an AI algorithm (DeepCubeA) to guide students in solving a common first step of the Rubik's Cube (i.e., the white cross). Using data from a pilot study we present preliminary findings about students' behaviors in the system, how these behaviors are associated with STEM skills - including spatial reasoning, critical thinking and algorithmic thinking. We discuss how data from ALLURE can be used in future educational data mining to understand how students benefit from AI assistance and collaboration when solving complex problems.
☆ The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity
While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system's output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.
☆ Beyond Permissions: Investigating Mobile Personalization with Simulated Personas CCS 2025
Mobile applications increasingly rely on sensor data to infer user context and deliver personalized experiences. Yet the mechanisms behind this personalization remain opaque to users and researchers alike. This paper presents a sandbox system that uses sensor spoofing and persona simulation to audit and visualize how mobile apps respond to inferred behaviors. Rather than treating spoofing as adversarial, we demonstrate its use as a tool for behavioral transparency and user empowerment. Our system injects multi-sensor profiles - generated from structured, lifestyle-based personas - into Android devices in real time, enabling users to observe app responses to contexts such as high activity, location shifts, or time-of-day changes. With automated screenshot capture and GPT-4 Vision-based UI summarization, our pipeline helps document subtle personalization cues. Preliminary findings show measurable app adaptations across fitness, e-commerce, and everyday service apps such as weather and navigation. We offer this toolkit as a foundation for privacy-enhancing technologies and user-facing transparency interventions.
comment: 8 pages, 7 figures. Accepted to the ACM Workshop on Human-Centered AI Privacy and Security (HAIPS @ CCS 2025). DOI: 10.1145/3733816.3760758 (ACM Digital Library link pending activation)
☆ Embodied Cognition Augmented End2End Autonomous Driving
In recent years, vision-based end-to-end autonomous driving has emerged as a new paradigm. However, popular end-to-end approaches typically rely on visual feature extraction networks trained under label supervision. This limited supervision framework restricts the generality and applicability of driving models. In this paper, we propose a novel paradigm termed $E^{3}AD$, which advocates for comparative learning between visual feature extraction networks and the general EEG large model, in order to learn latent human driving cognition for enhancing end-to-end planning. In this work, we collected a cognitive dataset for the mentioned contrastive learning process. Subsequently, we investigated the methods and potential mechanisms for enhancing end-to-end planning with human driving cognition, using popular driving models as baselines on publicly available autonomous driving datasets. Both open-loop and closed-loop tests are conducted for a comprehensive evaluation of planning performance. Experimental results demonstrate that the $E^{3}AD$ paradigm significantly enhances the end-to-end planning performance of baseline models. Ablation studies further validate the contribution of driving cognition and the effectiveness of comparative learning process. To the best of our knowledge, this is the first work to integrate human driving cognition for improving end-to-end autonomous driving planning. It represents an initial attempt to incorporate embodied cognitive data into end-to-end autonomous driving, providing valuable insights for future brain-inspired autonomous driving systems. Our code will be made available at Github
comment: 24 pages,4 pages
☆ AskNow: An LLM-powered Interactive System for Real-Time Question Answering in Large-Scale Classrooms
In large-scale classrooms, students often struggle to ask questions due to limited instructor attention and social pressure. Based on findings from a formative study with 24 students and 12 instructors, we designed AskNow, an LLM-powered system that enables students to ask questions and receive real-time, context-aware responses grounded in the ongoing lecture and that allows instructors to view students' questions collectively. We deployed AskNow in three university computer science courses and tested with 117 students. To evaluate AskNow's responses, each instructor rated the perceived correctness and satisfaction of 100 randomly sampled AskNow-generated responses. In addition, we conducted interviews with 24 students and the three instructors to understand their experience with AskNow. We found that AskNow significantly reduced students' perceived time to resolve confusion. Instructors rated AskNow's responses as highly accurate and satisfactory. Instructor and student feedback provided insights into supporting real-time learning in large lecture settings.
comment: 18 pages, 9 figures
☆ Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark
We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.
comment: 23 pages, 10 figures. The last two authors made equal contributions
☆ When Machines Join the Moral Circle: The Persona Effect of Generative AI Agents in Collaborative Reasoning
Generative AI is increasingly positioned as a peer in collaborative learning, yet its effects on ethical deliberation remain unclear. We report a between-subjects experiment with university students (N=217) who discussed an autonomous-vehicle dilemma in triads under three conditions: human-only control, supportive AI teammate, or contrarian AI teammate. Using moral foundations lexicons, argumentative coding from the augmentative knowledge construction framework, semantic trajectory modelling with BERTopic and dynamic time warping, and epistemic network analysis, we traced how AI personas reshape moral discourse. Supportive AIs increased grounded/qualified claims relative to control, consolidating integrative reasoning around care/fairness, while contrarian AIs modestly broadened moral framing and sustained value pluralism. Both AI conditions reduced thematic drift compared with human-only groups, indicating more stable topical focus. Post-discussion justification complexity was only weakly predicted by moral framing and reasoning quality, and shifts in final moral decisions were driven primarily by participants' initial stance rather than condition. Overall, AI teammates altered the process, the distribution and connection of moral frames and argument quality, more than the outcome of moral choice, highlighting the potential of generative AI agents as teammates for eliciting reflective, pluralistic moral reasoning in collaborative learning.
♻ ☆ GTAlign: Game-Theoretic Alignment of LLM Assistants for Social Welfare
Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a social welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and social welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .
comment: 31 pages, 6 figures
♻ ☆ Sound Clouds: Exploring ambient intelligence in public spaces to elicit deep human experience of awe, wonder, and beauty NeurIPS
While the ambient intelligence (AmI) systems we encounter in our daily lives, including security monitoring and energy-saving systems, typically serve pragmatic purposes, we wonder how we can design and implement ambient artificial intelligence experiences in public spaces that elicit deep human feelings of awe, wonder, and beauty. As a manifestation, we introduce Sound Clouds, an immersive art installation that generates live music based on participants' interaction with several human-height spheres. Our installation serves as a provocation into future ambient intelligence that provokes, not limits, the future possibilities of AmI.
comment: 4 pages, Artwork accepted by NeurIPS Creative AI Track 2025
♻ ☆ If They Disagree, Will You Conform? Exploring the Role of Robots' Value Awareness in a Decision-Making Task
This study investigates whether the opinions of robotic agents can influence human decision-making when robots display value awareness (i.e., the capability of understanding human preferences and prioritizing them in decision-making). We designed an experiment in which participants interacted with two Furhat robots - one programmed to be Value-Aware and the other Non-Value-Aware - during a labeling task for images representing human values. Results indicate that participants distinguished the Value-Aware robot from the Non-Value-Aware one. Although their explicit choices did not indicate a clear preference for one robot over the other, participants directed their gaze more toward the Value-Aware robot. Additionally, the Value-Aware robot was perceived as more loyal, suggesting that value awareness in a social robot may enhance its perceived commitment to the group. Finally, when both robots disagreed with the participant, conformity occurred in about one out of four trials, and participants took longer to confirm their responses, suggesting that two robots expressing dissent may introduce hesitation in decision-making. On one hand, this highlights the potential risk that robots, if misused, could manipulate users for unethical purposes. On the other hand, it reinforces the idea that social robots could encourage reflection in ambiguous situations and help users avoid scams.
comment: Pre-print version
♻ ☆ In Dialogue with Intelligence: Rethinking Large Language Models as Collective Knowledge
Large Language Models (LLMs) can be understood as Collective Knowledge (CK): a condensation of human cultural and technical output, whose apparent intelligence emerges in dialogue. This perspective article, drawing on extended interaction with ChatGPT-4, postulates differential response modes that plausibly trace their origin to distinct model subnetworks. It argues that CK has no persistent internal state or ``spine'': it drifts, it complies, and its behaviour is shaped by the user and by fine-tuning. It develops the notion of co-augmentation, in which human judgement and CK's representational reach jointly produce forms of analysis that neither could generate alone. Finally, it suggests that CK offers a tractable object for neuroscience: unlike biological brains, these systems expose their architecture, training history, and activation dynamics, making the human--CK loop itself an experimental target.
comment: 7 pages, 1 table
♻ ☆ AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems ECAI24
Molecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently emerged as a "human-in-the-loop" strategy for efficiently navigating hyper-dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular simulations running on high-performance computing architectures, iMD-VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high-dimensional molecular systems. Moreover, iMD-VR simulations generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the use of researcher-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi-agents systems domains which are comparable to iMD-VR, and discuss how iMD-VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof-of-principle study where iMD-VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.
comment: (First presented at the First Workshop on "eXtended Reality \& Intelligent Agents" (XRIA24) @ ECAI24, Santiago De Compostela (Spain), 20 October 2024)
♻ ☆ A Multidimensional Assessment Method for Situated Visualization Understanding (MdamV)
How audiences read, interpret, and critique data visualizations is mainly assessed through performance tests featuring tasks like value retrieval. Yet, other factors shown to shape visualization understanding, such as numeracy, graph familiarity, and aesthetic perception, remain underrepresented in existing instruments. To address this, we design and test a Multidimensional Assessment Method of Situated Visualization Understanding (MdamV). This method integrates task-based measures with self-perceived ability ratings and open-ended critique, applied directly to the visualizations being read. Grounded in learning sciences frameworks that view understanding as a multifaceted process, MdamV spans six dimensions: Comprehending, Decoding, Aestheticizing, Critiquing, Reading, and Contextualizing. Validation was supported by a survey (N=438) representative of Austria's population (ages 18-74, male/female split), using a line chart and a bar chart on climate data. Findings show, for example, that about a quarter of respondents indicate deficits in comprehending simple data units, roughly one in five people felt unfamiliar with each chart type, and self-assessed numeracy was significantly related to data reading performance (p=0.0004). Overall, the evaluation of MdamV demonstrates the value of assessing visualization understanding beyond performance, framing it as a situated process tied to particular visualizations.
comment: 11 pages, 8 figures, 2 tables
♻ ☆ UI-Evol: Automatic Knowledge Evolving for Computer Use Agents ICML 2025
External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90% correct knowledge yields only 41% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.
comment: Accepted to ICML 2025 Workshop on Computer Use Agents
♻ ☆ Exploring Student-AI Interactions in Vibe Coding
Background and Context. Chat-based and inline-coding-based GenAI has already had substantial impact on the CS Education community. The recent introduction of ``vibe coding'' may further transform how students program, as it introduces a new way for students to create software projects with minimal oversight. Objectives. The purpose of this study is to understand how students in introductory programming and advanced software engineering classes interact with a vibe coding platform (Replit) when creating software and how the interactions differ by programming background. Methods. Interview participants were asked to think-aloud while building a web application using Replit. Thematic analysis was then used to analyze the video recordings with an emphasis on the interactions between the student and Replit. Findings. For both groups, the majority of student interactions with Replit were to test or debug the prototype and only rarely did students visit code. Prompts by advanced software engineering students were much more likely to include relevant app feature and codebase contexts than those by introductory programming students.
♻ ☆ Learning Complementary Policies for Human-AI Teams
This paper tackles the critical challenge of human-AI complementarity in decision-making. Departing from the traditional focus on algorithmic performance in favor of performance of the human-AI team, and moving past the framing of collaboration as classification to focus on decision-making tasks, we introduce a novel approach to policy learning. Specifically, we develop a robust solution for human-AI collaboration when outcomes are only observed under assigned actions. We propose a deferral collaboration approach that maximizes decision rewards by exploiting the distinct strengths of humans and AI, strategically allocating instances among them. Critically, our method is robust to misspecifications in both the human behavior and reward models. Leveraging the insight that performance gains stem from divergent human and AI behavioral patterns, we demonstrate, using synthetic and real human responses, that our proposed method significantly outperforms independent human and algorithmic decision-making. Moreover, we show that substantial performance improvements are achievable by routing only a small fraction of instances to human decision-makers, highlighting the potential for efficient and effective human-AI collaboration in complex management settings.
comment: Previous name: Robust Human-AI Collaboration with Bandit Feedback; Best student paper award at Conference on Information Systems and Technology (CIST), 2022
Software Engineering 21
☆ Metamorphic Testing of Large Language Models for Natural Language Processing
Using large language models (LLMs) to perform natural language processing (NLP) tasks has become increasingly pervasive in recent times. The versatile nature of LLMs makes them applicable to a wide range of such tasks. While the performance of recent LLMs is generally outstanding, several studies have shown that they can often produce incorrect results. Automatically identifying these faulty behaviors is extremely useful for improving the effectiveness of LLMs. One obstacle to this is the limited availability of labeled datasets, which necessitates an oracle to determine the correctness of LLM behaviors. Metamorphic testing (MT) is a popular testing approach that alleviates this oracle problem. At the core of MT are metamorphic relations (MRs), which define relationships between the outputs of related inputs. MT can expose faulty behaviors without the need for explicit oracles (e.g., labeled datasets). This paper presents the most comprehensive study of MT for LLMs to date. We conducted a literature review and collected 191 MRs for NLP tasks. We implemented a representative subset (36 MRs) to conduct a series of experiments with three popular LLMs, running approximately 560,000 metamorphic tests. The results shed light on the capabilities and opportunities of MT for LLMs, as well as its limitations.
☆ SmartMLOps Studio: Design of an LLM-Integrated IDE with Automated MLOps Pipelines for Model Development and Monitoring
The rapid expansion of artificial intelligence and machine learning (ML) applications has intensified the demand for integrated environments that unify model development, deployment, and monitoring. Traditional Integrated Development Environments (IDEs) focus primarily on code authoring, lacking intelligent support for the full ML lifecycle, while existing MLOps platforms remain detached from the coding workflow. To address this gap, this study proposes the design of an LLM-Integrated IDE with automated MLOps pipelines that enables continuous model development and monitoring within a single environment. The proposed system embeds a Large Language Model (LLM) assistant capable of code generation, debugging recommendation, and automatic pipeline configuration. The backend incorporates automated data validation, feature storage, drift detection, retraining triggers, and CI/CD deployment orchestration. This framework was implemented in a prototype named SmartMLOps Studio and evaluated using classification and forecasting tasks on the UCI Adult and M5 datasets. Experimental results demonstrate that SmartMLOps Studio reduces pipeline configuration time by 61%, improves experiment reproducibility by 45%, and increases drift detection accuracy by 14% compared to traditional workflows. By bridging intelligent code assistance and automated operational pipelines, this research establishes a novel paradigm for AI engineering - transforming the IDE from a static coding tool into a dynamic, lifecycle-aware intelligent platform for scalable and efficient model development.
☆ Context-Guided Decompilation: A Step Towards Re-executability
Binary decompilation plays an important role in software security analysis, reverse engineering, and malware understanding when source code is unavailable. However, existing decompilation techniques often fail to produce source code that can be successfully recompiled and re-executed, particularly for optimized binaries. Recent advances in large language models (LLMs) have enabled neural approaches to decompilation, but the generated code is typically only semantically plausible rather than truly executable, limiting their practical reliability. These shortcomings arise from compiler optimizations and the loss of semantic cues in compiled code, which LLMs struggle to recover without contextual guidance. To address this challenge, we propose ICL4Decomp, a hybrid decompilation framework that leverages in-context learning (ICL) to guide LLMs toward generating re-executable source code. We evaluate our method across multiple datasets, optimization levels, and compilers, demonstrating around 40\% improvement in re-executability over state-of-the-art decompilation methods while maintaining robustness.
☆ Towards LLM-Powered Task-Aware Retrieval of Scientific Workflows for Galaxy
Scientific Workflow Management Systems (SWfMSs) such as Galaxy have become essential infrastructure in bioinformatics, supporting the design, execution, and sharing of complex multi-step analyses. Despite hosting hundreds of reusable workflows across domains, Galaxy's current keyword-based retrieval system offers limited support for semantic query interpretation and often fails to surface relevant workflows when exact term matches are absent. To address this gap, we propose a task-aware, two-stage retrieval framework that integrates dense vector search with large language model (LLM)-based reranking. Our system first retrieves candidate workflows using state-of-the-art embedding models and then reranks them using instruction-tuned generative LLMs (GPT-4o, Mistral-7B) based on semantic task alignment. To support robust evaluation, we construct a benchmark dataset of Galaxy workflows annotated with semantic topics via BERTopic and synthesize realistic task-oriented queries using LLMs. We conduct a comprehensive comparison of lexical, dense, and reranking models using standard IR metrics, presenting the first systematic evaluation of retrieval performance in the Galaxy ecosystem. Results show that our approach significantly improves top-k accuracy and relevance, particularly for long or under-specified queries. We further integrate our system as a prototype tool within Galaxy, providing a proof-of-concept for LLM-enhanced workflow search. This work advances the usability and accessibility of scientific workflows, especially for novice users and interdisciplinary researchers.
☆ From Pre-labeling to Production: Engineering Lessons from a Machine Learning Pipeline in the Public Sector
Machine learning is increasingly being embedded into government digital platforms, but public-sector constraints make it difficult to build ML systems that are accurate, auditable, and operationally sustainable. In practice, teams face not only technical issues like extreme class imbalance and data drift, but also organizational barriers such as bureaucratic data access, lack of versioned datasets, and incomplete governance over provenance and monitoring. Our study of the Brasil Participativo (BP) platform shows that common engineering choices -- like using LLMs for pre-labeling, splitting models into routed classifiers, and generating synthetic data -- can speed development but also introduce new traceability, reliability, and cost risks if not paired with disciplined data governance and human validation. This means that, in the public sector, responsible ML is not just a modeling problem but an institutional engineering problem, and ML pipelines must be treated as civic infrastructure. Ultimately, this study shows that the success of machine learning in the public sector will depend less on breakthroughs in model accuracy and more on the ability of institutions to engineer transparent, reproducible, and accountable data infrastructures that citizens can trust.
comment: 11 pages, 2 figures, 4 tables
☆ Hidden in Plain Sight: Where Developers Confess Self-Admitted Technical Debt
Context. Detecting Self-Admitted Technical Debt (SATD) is crucial for proactive software maintenance. Previous research has primarily targeted detecting and prioritizing SATD, with little focus on the source code afflicted with SATD. Our goal in this work is to connect the SATD comments with source code constructs that surround them. Method. We leverage the extensive SATD dataset PENTACET, containing code comments from over 9000 Java Open Source Software (OSS) repositories. We quantitatively infer where SATD most commonly occurs and which code constructs/statements it most frequently affects. Results and Conclusions. Our large-scale study links over 225,000 SATD comments to their surrounding code, showing that SATD mainly arises in inline code near definitions, conditionals, and exception handling, where developers face uncertainty and trade-offs, revealing it as an intentional signal of awareness during change rather than mere neglect.
☆ LLM-Assisted Tool for Joint Generation of Formulas and Functions in Rule-Based Verification of Map Transformations
High-definition map transformations are essential in autonomous driving systems, enabling interoperability across tools. Ensuring their semantic correctness is challenging, since existing rule-based frameworks rely on manually written formulas and domain-specific functions, limiting scalability. In this paper, We present an LLM-assisted pipeline that jointly generates logical formulas and corresponding executable predicates within a computational FOL framework, extending the map verifier in CommonRoad scenario designer with elevation support. The pipeline leverages prompt-based LLM generation to produce grammar-compliant rules and predicates that integrate directly into the existing system. We implemented a prototype and evaluated it on synthetic bridge and slope scenarios. The results indicate reduced manual engineering effort while preserving correctness, demonstrating the feasibility of a scalable, semi-automated human-in-the-loop approach to map-transformation verification.
☆ VeriODD: From YAML to SMT-LIB -- Automating Verification of Operational Design Domains
Operational Design Domains (ODDs) define the conditions under which an Automated Driving System (ADS) is allowed to operate, while Current Operational Domains (CODs) capture the actual runtime situation. Ensuring that a COD instance lies within the ODD is a crucial step in safety assurance. Today, ODD and COD specifications are frequently expressed in YAML to remain accessible for stakeholders, but such descriptions are not directly suitable for solver-based verification. Manual translation into formal languages such as SMT-LIB is slow and error-prone. We present VeriODD, a tool that automates this translation. VeriODD uses ANTLR-based compiler technology to transform YAML-based ODD/COD specifications into both human-readable propositional logic, for lightweight review on a simple basis, and solver-ready SMT-LIB. The tool integrates with SMT solvers such as Z3 to provide automated consistency checks of ODD specifications and verification of COD conformance. A graphical user interface supports editing specifications, inspecting generated formulas, and performing verification with a single click. VeriODD thereby closes the gap between stakeholder-friendly ODD/COD notations and formal verification, enabling scalable and automated assurance of operational boundaries in autonomous driving. Video demonstration: https://youtu.be/odRacNoL_Pk Tool available at: https://github.com/BasselRafie/VeriODD
☆ Characterizing Build Compromises Through Vulnerability Disclosure Analysis
The software build process transforms source code into deployable artifacts, representing a critical yet vulnerable stage in software development. Build infrastructure security poses unique challenges: the complexity of multi-component systems (source code, dependencies, build tools), the difficulty of detecting intrusions during compilation, and prevalent build non-determinism that masks malicious modifications. Despite these risks, the security community lacks a systematic understanding of build-specific attack vectors, hindering effective defense design. This paper presents an empirically-derived taxonomy of attack vectors targeting the build process, constructed through a large-scale CVE mining (of 621 vulnerability disclosures from the NVD database). We categorize attack vectors by their injection points across the build pipeline, from source code manipulation to compiler compromise. To validate our taxonomy, we analyzed 168 documented software supply chain attacks, identifying 40 incidents specifically targeting build phases. Our analysis reveals that 23.8\% of supply chain attacks exploit build vulnerabilities, with dependency confusion and build script injection representing the most prevalent vectors. Dataset available at: https://anonymous.4open.science/r/Taxonomizing-Build-Attacks-8BB0.
☆ The Future of Generative AI in Software Engineering: A Vision from Industry and Academia in the European GENIUS Project
Generative AI (GenAI) has recently emerged as a groundbreaking force in Software Engineering, capable of generating code, suggesting fixes, and supporting quality assurance. While its use in coding tasks shows considerable promise, applying GenAI across the entire Software Development Life Cycle (SDLC) has not yet been fully explored. Critical uncertainties in areas such as reliability, accountability, security, and data privacy demand deeper investigation and coordinated action. The GENIUS project, comprising over 30 European industrial and academic partners, aims to address these challenges by advancing AI integration across all SDLC phases. It focuses on GenAI's potential, the development of innovative tools, and emerging research challenges, actively shaping the future of software engineering. This vision paper presents a shared perspective on the future of GenAI-based software engineering, grounded in cross-sector dialogue and experience within the GENIUS consortium, supported by an exploratory literature review. The paper explores four central elements: (1) a structured overview of current challenges in GenAI adoption across the SDLC; (2) a forward-looking vision outlining key technological and methodological advances expected over the next five years; (3) anticipated shifts in the roles and required skill sets of software professionals; and (4) the contribution of GENIUS in realizing this transformation through practical tools and industrial validation. By aligning technical innovation with business relevance, this paper aims to inform both research agendas and industrial strategies, providing a foundation for reliable, scalable, and industry-ready GenAI solutions for software engineering teams.
comment: Submitted to 2nd IEEE/ACM International Conference on AI-powered Software (AIware 2025)
☆ Exploringand Unleashing the Power of Large Language Models in CI/CD Configuration Translation
Continuous Integration (CI) is a cornerstone of modern collaborative software development, and numerous CI platforms are available. Differences in maintenance overhead, reliability, and integration depth with code-hosting platforms make migration between CI platforms a common practice. A central step in migration is translating CI configurations, which is challenging due to the intrinsic complexity of CI configurations and the need to understand semantic differences and relationships across CI platforms. With the advent of large language models (LLMs), recent advances in software engineering highlight their potential for CI configuration translation. In this paper, we present a study on LLM-based CI configuration translation, focusing on the migration from Travis CI to GitHub Actions. First, using 811 migration records, we quantify the effort involved and find that developers read an average of 38 lines of Travis configuration and write 58 lines of GitHub Actions configuration, with nearly half of the migrations requiring multiple commits. We further analyze translations produced by each of the four LLMs and identify 1,121 issues grouped into four categories: logic inconsistencies (38%), platform discrepancies (32%), environment errors (25%), and syntax errors (5%). Finally, we evaluate three enhancement strategies and show that combining guideline-based prompting with iterative refinement achieves the best performance, reaching a Build Success Rate of 75.5%-nearly a threefold improvement over GPT-4o with a basic prompt.
☆ Detecting Vulnerabilities from Issue Reports for Internet-of-Things
Timely identification of issue reports reflecting software vulnerabilities is crucial, particularly for Internet-of-Things (IoT) where analysis is slower than non-IoT systems. While Machine Learning (ML) and Large Language Models (LLMs) detect vulnerability-indicating issues in non-IoT systems, their IoT use remains unexplored. We are the first to tackle this problem by proposing two approaches: (1) combining ML and LLMs with Natural Language Processing (NLP) techniques to detect vulnerability-indicating issues of 21 Eclipse IoT projects and (2) fine-tuning a pre-trained BERT Masked Language Model (MLM) on 11,000 GitHub issues for classifying \vul. Our best performance belongs to a Support Vector Machine (SVM) trained on BERT NLP features, achieving an Area Under the receiver operator characteristic Curve (AUC) of 0.65. The fine-tuned BERT achieves 0.26 accuracy, emphasizing the importance of exposing all data during training. Our contributions set the stage for accurately detecting IoT vulnerabilities from issue reports, similar to non-IoT systems.
comment: ACCEPTED/To Appear in the Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) 2025. https://conf.researchr.org/details/ase-2025/ase-2025-student-research-competition/5/Detecting-Vulnerabilities-from-Issue-Reports-for-Internet-of-Things
☆ Lares: LLM-driven Code Slice Semantic Search for Patch Presence Testing
In modern software ecosystems, 1-day vulnerabilities pose significant security risks due to extensive code reuse. Identifying vulnerable functions in target binaries alone is insufficient; it is also crucial to determine whether these functions have been patched. Existing methods, however, suffer from limited usability and accuracy. They often depend on the compilation process to extract features, requiring substantial manual effort and failing for certain software. Moreover, they cannot reliably differentiate between code changes caused by patches or compilation variations. To overcome these limitations, we propose Lares, a scalable and accurate method for patch presence testing. Lares introduces Code Slice Semantic Search, which directly extracts features from the patch source code and identifies semantically equivalent code slices in the pseudocode of the target binary. By eliminating the need for the compilation process, Lares improves usability, while leveraging large language models (LLMs) for code analysis and SMT solvers for logical reasoning to enhance accuracy. Experimental results show that Lares achieves superior precision, recall, and usability. Furthermore, it is the first work to evaluate patch presence testing across optimization levels, architectures, and compilers. The datasets and source code used in this article are available at https://github.com/Siyuan-Li201/Lares.
☆ A Large Scale Study of AI-based Binary Function Similarity Detection Techniques for Security Researchers and Practitioners
Binary Function Similarity Detection (BFSD) is a foundational technique in software security, underpinning a wide range of applications including vulnerability detection, malware analysis. Recent advances in AI-based BFSD tools have led to significant performance improvements. However, existing evaluations of these tools suffer from three key limitations: a lack of in-depth analysis of performance-influencing factors, an absence of realistic application analysis, and reliance on small-scale or low-quality datasets. In this paper, we present the first large-scale empirical study of AI-based BFSD tools to address these gaps. We construct two high-quality and diverse datasets: BinAtlas, comprising 12,453 binaries and over 7 million functions for capability evaluation; and BinAres, containing 12,291 binaries and 54 real-world 1-day vulnerabilities for evaluating vulnerability detection performance in practical IoT firmware settings. Using these datasets, we evaluate nine representative BFSD tools, analyze the challenges and limitations of existing BFSD tools, and investigate the consistency among BFSD tools. We also propose an actionable strategy for combining BFSD tools to enhance overall performance (an improvement of 13.4%). Our study not only advances the practical adoption of BFSD tools but also provides valuable resources and insights to guide future research in scalable and automated binary similarity detection.
comment: Accepted by ASE 2025
☆ An Empirical Study of LLM-Based Code Clone Detection
Large language models (LLMs) have demonstrated remarkable capabilities in various software engineering tasks, such as code generation and debugging, because of their ability to translate between programming languages and natural languages. Existing studies have demonstrated the effectiveness of LLMs in code clone detection. However, two crucial issues remain unaddressed: the ability of LLMs to achieve comparable performance across different datasets and the consistency of LLMs' responses in code clone detection. To address these issues, we constructed seven code clone datasets and then evaluated five LLMs in four existing prompts with these datasets. The datasets were created by sampling code pairs using their Levenshtein ratio from two different code collections, CodeNet and BigCloneBench. Our evaluation revealed that although LLMs perform well in CodeNet-related datasets, with o3-mini achieving a 0.943 F1 score, their performance significantly decreased in BigCloneBench-related datasets. Most models achieved a high response consistency, with over 90\% of judgments remaining consistent across all five submissions. The fluctuations of the F1 score affected by inconsistency are also tiny; their variations are less than 0.03.
☆ MicroRemed: Benchmarking LLMs in Microservices Remediation
Large Language Models (LLMs) integrated with agent-based reasoning frameworks have recently shown strong potential for autonomous decision-making and system-level operations. One promising yet underexplored direction is microservice remediation, where the goal is to automatically recover faulty microservice systems. Existing approaches, however, still rely on human-crafted prompts from Site Reliability Engineers (SREs), with LLMs merely converting textual instructions into executable code. To advance research in this area, we introduce MicroRemed, the first benchmark for evaluating LLMs in end-to-end microservice remediation, where models must directly generate executable Ansible playbooks from diagnosis reports to restore system functionality. We further propose ThinkRemed, a multi-agent framework that emulates the reflective and perceptive reasoning of SREs. Experimental results show that MicroRemed presents substantial challenges to current LLMs, while ThinkRemed improves end-to-end remediation performance through iterative reasoning and system reflection. The benchmark is available at https://github.com/LLM4AIOps/MicroRemed.
comment: 24 pages, 13 figures, 5 tables
♻ ☆ Benchmarking LLMs in Web API Integration Tasks
API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.
comment: To be published in Proceedings of 2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware), Data & Benchmark Track; switched to IEEE conference template
♻ ☆ PCART: Automated Repair of Python API Parameter Compatibility Issues
In modern software development, Python third-party libraries play a critical role, especially in fields like deep learning and scientific computing. However, API parameters in these libraries often change during evolution, leading to compatibility issues for client applications reliant on specific versions. Python's flexible parameter-passing mechanism further complicates this, as different passing methods can result in different API compatibility. Currently, no tool can automatically detect and repair Python API parameter compatibility issues. To fill this gap, we introduce PCART, the first solution to fully automate the process of API extraction, code instrumentation, API mapping establishment, compatibility assessment, repair, and validation. PCART handles various types of Python API parameter compatibility issues, including parameter addition, removal, renaming, reordering, and the conversion of positional to keyword parameters. To evaluate PCART, we construct PCBENCH, a large-scale benchmark comprising 47,478 test cases mutated from 844 parameter-changed APIs across 33 popular Python libraries. Evaluation results demonstrate that PCART is both effective and efficient, significantly outperforming existing tools (MLCatchUp and Relancer) and the large language model ChatGPT (GPT-4o), achieving an F1-score of 96.51% in detecting API parameter compatibility issues and a repair precision of 91.97%. Further evaluation on 30 real-world Python projects from GitHub confirms PCART's practicality. We believe PCART can significantly reduce the time programmers spend maintaining Python API updates and advance the automation of Python API compatibility issue repair.
comment: Submitted to IEEE Transactions on Software Engineering
♻ ☆ Generative AI and Empirical Software Engineering: A Paradigm Shift
The adoption of large language models (LLMs) and autonomous agents in software engineering marks an enduring paradigm shift. These systems create new opportunities for tool design, workflow orchestration, and empirical observation, while fundamentally reshaping the roles of developers and the artifacts they produce. Although traditional empirical methods remain central to software engineering research, the rapid evolution of AI introduces new data modalities, alters causal assumptions, and challenges foundational constructs such as "developer", "artifact", and "interaction". As humans and AI agents increasingly co-create, the boundaries between social and technical actors blur, and the reproducibility of findings becomes contingent on model updates and prompt contexts. This vision paper examines how the integration of LLMs into software engineering disrupts established research paradigms. We discuss how it transforms the phenomena we study, the methods and theories we rely on, the data we analyze, and the threats to validity that arise in dynamic AI-mediated environments. Our aim is to help the empirical software engineering community adapt its questions, instruments, and validation standards to a future in which AI systems are not merely tools, but active collaborators shaping software engineering and its study.
comment: Published at 2nd IEEE/ACM International Conference on AI-powered Software (AIware 2025)
♻ ☆ RepoMark: A Data-Usage Auditing Framework for Code Large Language Models
The rapid development of Large Language Models (LLMs) for code generation has transformed software development by automating coding tasks with unprecedented efficiency. However, the training of these models on open-source code repositories (e.g., from GitHub) raises critical ethical and legal concerns, particularly regarding data authorization and open-source license compliance. Developers are increasingly questioning whether model trainers have obtained proper authorization before using repositories for training, especially given the lack of transparency in data collection. To address these concerns, we propose a novel data marking framework RepoMark to audit the data usage of code LLMs. Our method enables auditors to verify whether their code has been used in training, while ensuring semantic preservation, imperceptibility, and theoretical false detection rate (FDR) guarantees. By generating multiple semantically equivalent code variants, RepoMark introduces data marks into the code files, and during detection, RepoMark leverages a novel ranking-based hypothesis test to detect model behavior difference on trained data. Compared to prior data auditing approaches, RepoMark significantly enhances data efficiency, allowing effective auditing even when the user's repository possesses only a small number of code files. Experiments demonstrate that RepoMark achieves a detection success rate over 90\% on small code repositories under a strict FDR guarantee of 5\%. This represents a significant advancement over existing data marking techniques, all of which only achieve accuracy below 55\% under identical settings. This further validates RepoMark as a robust, theoretically sound, and promising solution for enhancing transparency in code LLM training, which can safeguard the rights of code authors.
♻ ☆ On the Role of Search Budgets in Model-Based Software Refactoring Optimization
Software model optimization is a process that automatically generates design alternatives aimed at improving quantifiable non-functional properties of software systems, such as performance and reliability. Multi-objective evolutionary algorithms effectively help designers identify trade-offs among the desired non-functional properties. To reduce the use of computational resources, this work examines the impact of implementing a search budget to limit the search for design alternatives. In particular, we analyze how time budgets affect the quality of Pareto fronts by utilizing quality indicators and exploring the structural features of the generated design alternatives. This study identifies distinct behavioral differences among evolutionary algorithms when a search budget is implemented. It further reveals that design alternatives generated under a budget are structurally different from those produced without one. Additionally, we offer recommendations for designers on selecting algorithms in relation to time constraints, thereby facilitating the effective application of automated refactoring to improve non-functional properties.
comment: arXiv admin note: text overlap with arXiv:2212.08385
Systems and Control 28
☆ Model Predictive Control with Multiple Constraint Horizons
In this work we propose a Model Predictive Control (MPC) formulation that splits constraints in two different types. Motivated by safety considerations, the first type of constraint enforces a control-invariant set, while the second type could represent a less restrictive constraint on the system state. This distinction enables closed-loop sub- optimality results for nonlinear MPC with heterogeneous state constraints (distinct constraints across open loop predicted states), and no terminal elements. Removing the non-invariant constraint recovers the partially constrained case. Beyond its theoretical interest, heterogeneous constrained MPC shows how constraint choices shape the system's closed loop. In the partially constrained case, adjusting the constraint horizon (how many predicted- state constraints are enforced) trades estimation accuracy for computational cost. Our analysis yields first, a sub- optimality upper-bound accounting for distinct constraint sets, their horizons and decay rates, that is tighter for short horizons than prior work. Second, to our knowledge, we give the first lower bound (beyond open-loop cost) on closed-loop sub-optimality. Together these bounds provide a powerful analysis framework, allowing designers to evaluate the effect of horizons in MPC sub-optimality. We demonstrate our results via simulations on nonlinear and linear safety-critical systems.
comment: Submitted to Transactions on Automatic Control
☆ Hopfield Neural Networks for Online Constrained Parameter Estimation with Time-Varying Dynamics and Disturbances
This paper proposes two projector-based Hopfield neural network (HNN) estimators for online, constrained parameter estimation under time-varying data, additive disturbances, and slowly drifting physical parameters. The first is a constraint-aware HNN that enforces linear equalities and inequalities (via slack neurons) and continuously tracks the constrained least-squares target. The second augments the state with compensation neurons and a concatenated regressor to absorb bias-like disturbance components within the same energy function. For both estimators we establish global uniform ultimate boundedness with explicit convergence rate and ultimate bound, and we derive practical tuning rules that link the three design gains to closed-loop bandwidth and steady-state accuracy. We also introduce an online identifiability monitor that adapts the constraint weight and time step, and, when needed, projects updates onto identifiable subspaces to prevent drift in poorly excited directions...
comment: Submitted to International Journal od Adaptive Control and Signal Processing
☆ Second-Order Policy Gradient Methods for the Linear Quadratic Regulator
Policy gradient methods are a powerful family of reinforcement learning algorithms for continuous control that optimize a policy directly. However, standard first-order methods often converge slowly. Second-order methods can accelerate learning by using curvature information, but they are typically expensive to compute. The linear quadratic regulator (LQR) is a practical setting in which key quantities, such as the policy gradient, admit closed-form expressions. In this work, we develop second-order policy gradient algorithms for LQR by deriving explicit formulas for both the approximate and exact Hessians used in Gauss--Newton and Newton methods, respectively. Numerical experiments show a faster convergence rate for the proposed second-order approach over the standard first-order policy gradient baseline.
comment: 8 pages, 2 figs
☆ ABIDES-MARL: A Multi-Agent Reinforcement Learning Environment for Endogenous Price Formation and Execution in a Limit Order Book
We present ABIDES-MARL, a framework that combines a new multi-agent reinforcement learning (MARL) methodology with a new realistic limit-order-book (LOB) simulation system to study equilibrium behavior in complex financial market games. The system extends ABIDES-Gym by decoupling state collection from kernel interruption, enabling synchronized learning and decision-making for multiple adaptive agents while maintaining compatibility with standard RL libraries. It preserves key market features such as price-time priority and discrete tick sizes. Methodologically, we use MARL to approximate equilibrium-like behavior in multi-period trading games with a finite number of heterogeneous agents-an informed trader, a liquidity trader, noise traders, and competing market makers-all with individual price impacts. This setting bridges optimal execution and market microstructure by embedding the liquidity trader's optimization problem within a strategic trading environment. We validate the approach by solving an extended Kyle model within the simulation system, recovering the gradual price discovery phenomenon. We then extend the analysis to a liquidity trader's problem where market liquidity arises endogenously and show that, at equilibrium, execution strategies shape market-maker behavior and price dynamics. ABIDES-MARL provides a reproducible foundation for analyzing equilibrium and strategic adaptation in realistic markets and contributes toward building economically interpretable agentic AI systems for finance.
☆ MOBIUS: A Multi-Modal Bipedal Robot that can Walk, Crawl, Climb, and Roll
This article presents a Multi-Modal Bipedal Intelligent Urban Scout robot (MOBIUS) capable of walking, crawling, climbing, and rolling. MOBIUS features four limbs--two 6-DoF arms with two-finger grippers for manipulation and climbing, and two 4-DoF legs for locomotion--enabling smooth transitions across diverse terrains without reconfiguration. A hybrid control architecture combines reinforcement learning-based locomotion with model-based predictive and admittance control enhanced for safety by a Reference Governor toward compliant contact interactions. A high-level MIQCP planner autonomously selects locomotion modes to balance stability and energy efficiency. Hardware experiments demonstrate robust gait transitions, dynamic climbing, and full-body load support via pinch grasp. Overall, MOBIUS demonstrates the importance of tight integration between morphology, high-level planning, and control to enable mobile loco-manipulation and grasping, substantially expanding its interaction capabilities, workspace, and traversability.
comment: 23 pages, 20 figures. Collaborative work between the Robotics and Mechanisms Laboratory (RoMeLa) and Mitsubishi Electric Research Laboratories (MERL)
☆ On polynomial explicit partial estimator design for nonlinear systems with parametric uncertainties
This paper investigates the idea of designing data-driven partial estimators for nonlinear systems showing parametric uncertainties using sparse multivariate polynomial relationships. A general framework is first presented and then validated on two illustrative examples with comparison to different possible Machine/Deep-Learning based alternatives. The results suggests the superiority of the proposed sparse identification scheme, at least when the learning data is small.
comment: Submitted to ACC2026
☆ Deep Learning Prediction of Beam Coherence Time for Near-FieldTeraHertz Networks
Large multiple antenna arrays coupled with accurate beamforming are essential in terahertz (THz) communications to ensure link reliability. However, as the number of antennas increases, beam alignment (focusing) and beam tracking in mobile networks incur prohibitive overhead. Additionally, the near-field region expands both with the size of antenna arrays and the carrier frequency, calling for adjustments in the beamforming to account for spherical wavefront instead of the conventional planar wave assumption. In this letter, we introduce a novel beam coherence time for mobile THz networks, to drastically reduce the rate of beam updates. Then, we propose a deep learning model, relying on a simple feedforward neural network with a time-dependent input, to predict the beam coherence time and adjust the beamforming on the fly with minimal overhead. Our numerical results demonstrate the effectiveness of the proposed approach by enabling higher data rates while reducing the overhead, especially at high (i.e., vehicular) mobility.
comment: IEEE Wireless Communication Letters (accepted October 2025)
☆ Data-driven stabilization of nonlinear systems via descriptor embedding
We introduce the notion of descriptor embedding for nonlinear systems and use it for the data-driven design of stabilizing controllers. Specifically, we provide sufficient data-dependent LMI conditions which, if feasible, return a stabilizing nonlinear controller of the form $u=K(x)Z(x)$ where $K(x)$ belongs to a polytope and $Z$ is a user-defined function. The proposed method is then extended to account for the presence of uncertainties and noisy data. Furthermore, a method to estimate the resulting region of attraction is given using only data. Simulation examples are used to illustrate the results and compare them to existing methods from the literature.
comment: 16 pages, 5 figures, submitted to IEEE Transactions on Automatic Control
☆ Evolutionary Dynamics in Continuous-time Finite-state Mean Field Games -- Part I: Equilibria
We study a dynamic game with a large population of players who choose actions from a finite set in continuous time. Each player has a state in a finite state space that evolves stochastically with their actions. A player's reward depends not only on their own state and action but also on the distribution of states and actions across the population, capturing effects such as congestion in traffic networks. While prior work in evolutionary game theory has primarily focused on static games without individual player state dynamics, we present the first comprehensive evolutionary analysis of such dynamic games. We propose an evolutionary model together with a mean field approximation of the finite-population game and establish strong approximation guarantees. We show that standard solution concepts for dynamic games lack an evolutionary interpretation, and we propose a new concept - the Mixed Stationary Nash Equilibrium (MSNE) - which admits one. We analyze the relationship between MSNE and the rest points of the mean field evolutionary model and study the evolutionary stability of MSNE.
☆ Risk Aware Safe Control with Cooperative Sensing for Dynamic Obstacle Avoidance
This paper presents the design, development, and on vehicle implementation and validation of a safety critical controller for autonomous driving under sensing and communication uncertainty. Cooperative sensing, fused via a Wasserstein barycenter (WB), is used to optimize the distribution of the dynamic obstacle locations. The Conditional Value at Risk (CVaR) is introduced to form a risk aware control-barrier-function (CBF) framework with the optimized distribution samplings. The proposed WB CVaR CBF safety filter improves control inputs that minimize tail risk while certifying forward invariance of the safe set. A model predictive controller (MPC) performs path tracking, and the safety filter modulates the nominal control inputs to enforce risk aware constraints. We detail the software architecture and integration with vehicle actuation and cooperative sensing. The approach is evaluated on a full-scale autonomous vehicle (AV) in scenarios with measurement noise, communication perturbations, and input disturbances, and is compared against a baseline MPC CBF design. Results demonstrate improved safety margins and robustness, highlighting the practicality of deploying the risk-aware safety filter on an actual AV.
☆ Orthogonal-by-construction augmentation of physics-based input-output models
Model augmentation is a promising approach for integrating first-principles-based models with machine learning components. Augmentation can result in better model accuracy and faster convergence compared to black-box system identification methods, while maintaining interpretability of the models in terms of how the original dynamics are complemented by learning. A widely used augmentation structure in the literature is based on the parallel connection of the physics-based and learning components, for both of which the corresponding parameters are jointly optimized. However, due to overlap in representation of the system dynamics by such an additive structure, estimation often leads to physically unrealistic parameters, compromising model interpretability. To overcome this limitation, this paper introduces a novel orthogonal-by-construction model augmentation structure for input-output models, that guarantees recovery of the physically true parameters under appropriate identifiability conditions.
comment: Submitted for publication
☆ A High-Speed Capable Spherical Robot
This paper designs a new spherical robot structure capable of supporting high-speed motion at up to 10 m/s. Building upon a single-pendulum-driven spherical robot, the design incorporates a momentum wheel with an axis aligned with the secondary pendulum, creating a novel spherical robot structure. Practical experiments with the physical prototype have demonstrated that this new spherical robot can achieve stable high-speed motion through simple decoupled control, which was unattainable with the original structure. The spherical robot designed for high-speed motion not only increases speed but also significantly enhances obstacle-crossing performance and terrain robustness.
comment: 5 pages
☆ Koopman-based Prediction of Connectivity for Flying Ad Hoc Networks
The application of machine learning (ML) to communication systems is expected to play a pivotal role in future artificial intelligence (AI)-based next-generation wireless networks. While most existing works focus on ML techniques for static wireless environments, they often face limitations when applied to highly dynamic environments, such as flying ad hoc networks (FANETs). This paper explores the use of data-driven Koopman approaches to address these challenges. Specifically, we investigate how these approaches can model UAV trajectory dynamics within FANETs, enabling more accurate predictions and improved network performance. By leveraging Koopman operator theory, we propose two possible approaches -- centralized and distributed -- to efficiently address the challenges posed by the constantly changing topology of FANETs. To demonstrate this, we consider a FANET performing surveillance with UAVs following pre-determined trajectories and predict signal-to-interference-plus-noise ratios (SINRs) to ensure reliable communication between UAVs. Our results show that these approaches can accurately predict connectivity and isolation events that lead to modelled communication outages. This capability could help UAVs schedule their transmissions based on these predictions.
☆ Experimental Demonstration of Software-Orchestrated Quantum Network Applications over a Campus-Scale Testbed
To fulfill their promise, quantum networks must transform from isolated testbeds into scalable infrastructures for distributed quantum applications. In this paper, we present a prototype orchestrator for the Argonne Quantum Network (ArQNet) testbed that leverages design principles of software-defined networking (SDN) to automate typical quantum communication experiments across buildings in the Argonne campus connected over deployed, telecom fiber. Our implementation validates a scalable architecture supporting service-level abstraction of quantum networking tasks, distributed time synchronization, and entanglement verification across remote nodes. We present a prototype service of continuous, stable entanglement distribution between remote sites that ran for 12 hours, which defines a promising path towards scalable quantum networks.
comment: 11 pages, 8 figures, journal
☆ Deep Learning-Accelerated Shapley Value for Fair Allocation in Power Systems: The Case of Carbon Emission Responsibility
Allocating costs, benefits, and emissions fairly among power system participant entities represents a persistent challenge. The Shapley value provides an axiomatically fair solution, yet computational barriers have limited its adoption beyond small-scale applications. This paper presents SurroShap, a scalable Shapley value approximation framework combining efficient coalition sampling with deep learning surrogate models that accelerate characteristic function evaluations. Exemplified through carbon emission responsibility allocation in power networks, SurroShap enables Shapley-based fair allocation for power systems with thousands of entities for the first time. We derive theoretical error bounds proving that time-averaged SurroShap allocations converge to be $\varepsilon$-close to exact Shapley values. Experiments on nine systems ranging from 26 to 1,951 entities demonstrate completion within the real-time operational window even at maximum scale, achieving 10^4-10^5 speedups over other sampling-based methods while maintaining tight error bounds. The resulting Shapley-based carbon allocations possess six desirable properties aligning individual interests with decarbonization goals. Year-long simulations on the Texas 2000-bus system validate real-world applicability, with regional analysis revealing how renewable-rich areas offset emission responsibility through exports while load centers bear responsibility for driving system-wide generation.
♻ ☆ Neural Networks for AC Optimal Power Flow: Improving Worst-Case Guarantees during Training SC
The AC Optimal Power Flow (AC-OPF) problem is central to power system operation but challenging to solve efficiently due to its nonconvex and nonlinear nature. Neural networks (NNs) offer fast surrogates, yet their black-box behavior raises concerns about constraint violations that can compromise safety. We propose a verification-informed NN framework that incorporates worst-case constraint violations directly into training, producing models that are both accurate and provably safer. Through post-hoc verification, we achieve substantial reductions in worst-case violations and, for the first time, verify all operational constraints of large-scale AC-OPF proxies. Practical feasibility is further enhanced via restoration and warm-start strategies for infeasible operating points. Experiments on systems ranging from 57 to 793 buses demonstrate scalability, speed, and reliability, bridging the gap between ML acceleration and safe, real-time deployment of AC-OPF solutions - and paving the way toward data-driven optimal control.
comment: Submitted to PSCC 2026 (under review)
♻ ☆ Limits of Safe AI Deployment: Differentiating Oversight and Control
Oversight and control, which we collectively call supervision, are often discussed as ways to ensure that AI systems are accountable, reliable, and able to fulfill governance and management requirements. However, the requirements for "human oversight" risk codifying vague or inconsistent interpretations of key concepts like oversight and control. This ambiguous terminology could undermine efforts to design or evaluate systems that must operate under meaningful human supervision. This matters because the term is used by regulatory texts such as the EU AI Act. This paper undertakes a targeted critical review of literature on supervision outside of AI, along with a brief summary of past work on the topic related to AI. We next differentiate control as ex-ante or real-time and operational rather than policy or governance, and oversight as performed ex-post, or a policy and governance function. Control aims to prevent failures, while oversight focuses on detection, remediation, or incentives for future prevention. Building on this, we make three contributions. 1) We propose a framework to align regulatory expectations with what is technically and organizationally plausible, articulating the conditions under which each mechanism is possible, where they fall short, and what is required to make them meaningful in practice. 2) We outline how supervision methods should be documented and integrated into risk management, and drawing on the Microsoft Responsible AI Maturity Model, we outline a maturity model for AI supervision. 3) We explicitly highlight boundaries of these mechanisms, including where they apply, where they fail, and where it is clear that no existing methods suffice. This foregrounds the question of whether meaningful supervision is possible in a given deployment context, and can support regulators, auditors, and practitioners in identifying both present and future limitations.
comment: Revised to improve table formatting and update draft
♻ ☆ Dual-Regularized Riccati Recursions for Interior-Point Optimal Control
We derive closed-form extensions of Riccati's recursions (both sequential and parallel) for solving dual-regularized LQR problems. We show how these methods can be used to solve general constrained, non-convex, discrete-time optimal control problems via a regularized interior point method, while guaranteeing that each primal step is a descent direction of an Augmented Barrier-Lagrangian merit function. We provide MIT-licensed implementations of our methods in C++ and JAX.
♻ ☆ Spherical Point Process with Random Heights: New Approach for Modeling and Analysis of Downlink Satellite Networks
The Low Earth Orbit (LEO) satellite industry is undergoing rapid expansion, with operators competitively launching satellites due to the first-come, first-served principle governing orbital rights. This has led to the formation of increasingly large-scale, volumetric constellation where satellites operate across a diverse range of altitudes. To address the need for analyzing such complex networks, this paper establishes a new analytical framework for LEO constellations by leveraging a 3D Poisson point process (PPP). Specifically, we introduce a random height model (RHM) that can capture various altitude distributions by applying a random radial displacement to points generated by a homogeneous PPP on a nominal shell. Building on this, we derive an analytical expression for the downlink coverage probability. To motivate our model, we show that the altitude distributions of several leading satellite constellations, including Starlink, align with our model's assumptions. We then demonstrate through Monte Carlo simulations that the coverage probability of our RHM closely matches that of these real-world networks. Finally, we confirm the accuracy of our analytical expressions by showing their agreement with simulation results. Our work thereby provides a powerful tool for understanding and predict how the statistical distribution of satellite altitudes impacts network performance.
comment: submitted to IEEE journal
♻ ☆ Aggregative games with bilevel structures: Distributed algorithms and convergence analysis
In this paper, the problem of distributively seeking the equilibria of aggregative games with bilevel structures is studied. Different from the traditional aggregative games, here the aggregation is determined by the minimizer of a virtual leader's objective function in the inner level, which depends on the actions of the players in the outer level. Moreover, the global objective function of the virtual leader is formed by the sum of some local functions with two arguments, each of which is strongly convex with respect to the second argument. When making decisions, each player in the outer level only has access to a local part of the virtual leader's objective function. To handle this problem, first, we propose a second order gradient-based distributed algorithm, where the Hessian matrices associated with the objective functions of the leader are involved. By the algorithm, players update their actions while cooperatively minimizing the objective function of the virtual leader to estimate the aggregation by communicating with their neighbors via a connected graph. Under mild assumptions on the graph and cost functions, we prove that the actions of players asymptotically converge to the Nash equilibrium point. Then, for the case where the Hessian matrices associated with the objective functions of the virtual leader are not available, we propose a first order gradient-based distributed algorithm, where a distributed two-point estimate strategy is developed to estimate the gradients of players' cost functions in the outer level. Under the same conditions, we prove that the convergence errors of players' actions to the Nash equilibrium point are linear with respect to the estimate parameters. Finally, simulations are provided to demonstrate the effectiveness of our theoretical results.
♻ ☆ Simultaneous System Identification and Model Predictive Control with No Dynamic Regret
We provide an algorithm for the simultaneous system identification and model predictive control of nonlinear systems. The algorithm has finite-time near-optimality guarantees and asymptotically converges to the optimal (non-causal) controller. Particularly, the algorithm enjoys sublinear dynamic regret, defined herein as the suboptimality against an optimal clairvoyant controller that knows how the unknown disturbances and system dynamics will adapt to its actions. The algorithm is self-supervised and applies to control-affine systems with unknown dynamics and disturbances that can be expressed in reproducing kernel Hilbert spaces. Such spaces can model external disturbances and modeling errors that can even be adaptive to the system's state and control input. For example, they can model wind and wave disturbances to aerial and marine vehicles, or inaccurate model parameters such as inertia of mechanical systems. The algorithm first generates random Fourier features that are used to approximate the unknown dynamics or disturbances. Then, it employs model predictive control based on the current learned model of the unknown dynamics (or disturbances). The model of the unknown dynamics is updated online using least squares based on the data collected while controlling the system. We validate our algorithm in both hardware experiments and physics-based simulations. The simulations include (i) a cart-pole aiming to maintain the pole upright despite inaccurate model parameters, and (ii) a quadrotor aiming to track reference trajectories despite unmodeled aerodynamic drag effects. The hardware experiments include a quadrotor aiming to track a circular trajectory despite unmodeled aerodynamic drag effects, ground effects, and wind disturbances.
comment: IEEE Transactions on Robotics (T-RO). v6 update on stability analysis in Appendix J under relaxed Assumption 1
♻ ☆ Adaptive Multirobot Virtual Structure Control using Dual Quaternions
This paper presents a control strategy based on dual quaternions for the coordinated formation flying of small UAV groups. A virtual structure is employed to define the desired formation, enabling unified control of its position, orientation, and shape. This abstraction makes formation management easier by allowing a low-level controller to compute individual UAV commands efficiently. The proposed controller integrates a pose control module with a geometry-based adaptive strategy, ensuring precise and robust task execution. The effectiveness of the approach is demonstrated through both simulation and experimental results.
♻ ☆ Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control
This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.
comment: Updated manuscript. Added new experiments, figures, and exposition
♻ ☆ Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning
We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.
♻ ☆ Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach
Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being non-stationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection have been previously proposed. Our goal is to modularize the design and analysis of such Detection Augmented Bandit (DAB) procedures. To this end, we first provide novel, improved performance lower bounds for PS-MABs. Then, we identify the requirements for stationary bandit algorithms and change detectors in a DAB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of DAB procedures can indeed be modularized, so that the regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular DAB procedures that are order-optimal. Finally, we showcase the practical effectiveness of our modular DAB approach in our experiments, studying its regret performance compared to other methods and investigating its detection capabilities.
comment: 30 pages, 4 figures, 1 table, submitted to TIT
♻ ☆ The Difference between the Left and Right Invariant Extended Kalman Filter
The extended Kalman filter (EKF) has been the industry standard for state estimation problems over the past sixty years. The Invariant Extended Kalman Filter (IEKF) is a recent development of the EKF for the class of group-affine systems on Lie groups that has shown superior performance for inertial navigation problems. The IEKF comes in two versions, left- and right- handed respectively, and there is a perception in the robotics community that these filters are different and one should choose the handedness of the IEKF to match handedness of the measurement model for a given filtering problem. In this paper, we revisit these algorithms and demonstrate that the left- and right- IEKF algorithms (with reset step) are identical, that is, the choice of the handedness does not affect the IEKF's performance when the reset step is properly implemented. The reset step was not originally proposed as part of the IEKF, however, we provide simulations to show that the reset step improves asymptotic performance of all versions of the the filter, and should be included in all high performance algorithms. The GNSS-aided inertial navigation system (INS) is used as a motivating example to demonstrate the equivalence of the two filters.
comment: 20 pages, 4 figures, submitted to Control Engineering Practice
♻ ☆ Expertise and confidence explain how social influence evolves along intellective tasks
Discovering the antecedents of individuals' influence in collaborative environments is an important, practical, and challenging problem. In this paper, we study interpersonal influence in small groups of individuals who collectively execute a sequence of intellective tasks. We observe that along an issue sequence with feedback, individuals with higher expertise and social confidence are accorded higher interpersonal influence. We also observe that low-performing individuals tend to underestimate their high-performing teammate's expertise. Based on these observations, we introduce three hypotheses and present empirical and theoretical support for their validity. We report empirical evidence on longstanding theories of transactive memory systems, social comparison, and confidence heuristics on the origins of social influence. We propose a cognitive dynamical model inspired by these theories to describe the process by which individuals adjust interpersonal influences over time. We demonstrate the model's accuracy in predicting individuals' influence and provide analytical results on its asymptotic behavior for the case with identically performing individuals. Lastly, we propose a novel approach using deep neural networks on a pre-trained text embedding model for predicting the influence of individuals. Using message contents, message times, and individual correctness collected during tasks, we are able to accurately predict individuals' self-reported influence over time. Extensive experiments verify the accuracy of the proposed models compared to baselines such as structural balance and reflected appraisal model. While the neural networks model is the most accurate, the dynamical model is the most interpretable for influence prediction.
♻ ☆ Joint Scheduling of DER under Demand Charges: Structure and Approximation
We study the joint scheduling of behind-the-meter distributed energy resources (DERs), including flexible loads, renewable generation, and battery energy storage systems, under net energy metering tariffs with demand charges. The problem is formulated as a stochastic dynamic program aimed at maximizing expected operational surplus while accounting for renewable generation uncertainty. We analytically characterize the optimal control policy and show that it admits a threshold-based structure. However, due to the strong temporal coupling of the storage and demand charge constraints, the number of conditional branches in the policy scales combinatorially with the scheduling horizon, as it requires a look-ahead over future states. To overcome the high computational complexity in the general formulation, an efficient approximation algorithm is proposed, which searches for the peak demand under a mildly relaxed problem. We show that the algorithm scales linearly with the scheduling horizon. Extensive simulations using two open-source datasets validate the proposed algorithm and compare its performance against different DER control strategies, including a reinforcement learning-based one. Under varying storage and tariff parameters, the results show that the proposed algorithm outperforms various benchmarks in achieving a relatively small solution gap compared to a theoretical upper bound.
comment: 15 pages, 6 tables, 8 figures
Graphics 7
☆ Example-Based Feature Painting on Textures
In this work, we propose a system that covers the complete workflow for achieving controlled authoring and editing of textures that present distinctive local characteristics. These include various effects that change the surface appearance of materials, such as stains, tears, holes, abrasions, discoloration, and more. Such alterations are ubiquitous in nature, and including them in the synthesis process is crucial for generating realistic textures. We introduce a novel approach for creating textures with such blemishes, adopting a learning-based approach that leverages unlabeled examples. Our approach does not require manual annotations by the user; instead, it detects the appearance-altering features through unsupervised anomaly detection. The various textural features are then automatically clustered into semantically coherent groups, which are used to guide the conditional generation of images. Our pipeline as a whole goes from a small image collection to a versatile generative model that enables the user to interactively create and paint features on textures of arbitrary size. Notably, the algorithms we introduce for diffusion-based editing and infinite stationary texture generation are generic and should prove useful in other contexts as well. Project page: https://reality.tf.fau.de/pub/ardelean2025examplebased.html
comment: "\c{opyright} 2025 Andrei-Timotei Ardelean, Tim Weyrich. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Trans. Graph., Vol. 44, No. 6, https://doi.org/10.1145/3763301
☆ HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA
The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to dynamically allocate LoRA expert weights based on the input prompt, enabling synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting during instruction-tuning, we introduce a novel zero expert that preserves the pre-trained parameters for general linguistic tasks. For pose representation, we implement body-part-specific tokenization by partitioning the human body into different joint groups, enhancing the spatial resolution of the representation. Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.
comment: 10 pages, 5figures. The Thirty-Ninth Annual Conference on Neural Information Processing Systems
☆ An Adjoint Method for Differentiable Fluid Simulation on Flow Maps
This paper presents a novel adjoint solver for differentiable fluid simulation based on bidirectional flow maps. Our key observation is that the forward fluid solver and its corresponding backward, adjoint solver share the same flow map as the forward simulation. In the forward pass, this map transports fluid impulse variables from the initial frame to the current frame to simulate vortical dynamics. In the backward pass, the same map propagates adjoint variables from the current frame back to the initial frame to compute gradients. This shared long-range map allows the accuracy of gradient computation to benefit directly from improvements in flow map construction. Building on this insight, we introduce a novel adjoint solver that solves the adjoint equations directly on the flow map, enabling long-range and accurate differentiation of incompressible flows without differentiating intermediate numerical steps or storing intermediate variables, as required in conventional adjoint methods. To further improve efficiency, we propose a long-short time-sparse flow map representation for evolving adjoint variables. Our approach has low memory usage, requiring only 6.53GB of data at a resolution of $192^3$ while preserving high accuracy in tracking vorticity, enabling new differentiable simulation tasks that require precise identification, prediction, and control of vortex dynamics.
comment: 15 pages, 16 figures
☆ Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark
We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.
comment: 23 pages, 10 figures. The last two authors made equal contributions
♻ ☆ GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.
♻ ☆ FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility. Please check our website for more details: https://czzzzh.github.io/FreeArt3D
comment: Project Page: https://czzzzh.github.io/FreeArt3D Code: https://github.com/CzzzzH/FreeArt3D
♻ ☆ Near Real-time Adaptive Isotropic and Anisotropic Image-to-mesh Conversion for Numerical Simulations Involving Cerebral Aneurysms
Presented are two techniques that are designed to help streamline the discretization of complex vascular geometries within the numerical modeling process. The first method integrates multiple software tools into a single pipeline which can generate adaptive anisotropic meshes from segmented medical images. The pipeline is shown to satisfy quality, fidelity, smoothness, and robustness requirements while providing near real-time performance for medical image-to-mesh conversion. The second method approximates a user-defined sizing function to generate adaptive isotropic meshes of good quality and fidelity in real-time. Tested with two brain aneurysm cases and utilizing up to 96 CPU cores within a single, multicore node on Purdue University's Anvil supercomputer, the parallel adaptive anisotropic meshing method utilizes a hierarchical load balancing model (designed for large, cc-NUMA shared memory architectures) and contains an optimized local reconnection operation that performs three times faster than its original implementation from previous studies. The adaptive isotropic method is shown to generate a mesh of up to approximately 50 million elements in less than a minute while the adaptive anisotropic method is shown to generate approximately the same number of elements in about 5 minutes.
comment: 57 pages, 16 figures, 13 tables, presented at the 18th U.S. National Congress on Computational Mechanics conference
Computation and Language 80
☆ Rethinking LLM Human Simulation: When a Graph is What You Need
Large language models (LLMs) are increasingly used to simulate humans, with applications ranging from survey prediction to decision-making. However, are LLMs strictly necessary, or can smaller, domain-grounded models suffice? We identify a large class of simulation problems in which individuals make choices among discrete options, where a graph neural network (GNN) can match or surpass strong LLM baselines despite being three orders of magnitude smaller. We introduce Graph-basEd Models for human Simulation (GEMS), which casts discrete choice simulation tasks as a link prediction problem on graphs, leveraging relational knowledge while incorporating language representations only when needed. Evaluations across three key settings on three simulation datasets show that GEMS achieves comparable or better accuracy than LLMs, with far greater efficiency, interpretability, and transparency, highlighting the promise of graph-based modeling as a lightweight alternative to LLMs for human simulation. Our code is available at https://github.com/schang-lab/gems.
comment: Code: https://github.com/schang-lab/gems
☆ InsurAgent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance
Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-ranging tasks, offering promising tools for simulating human decision-making. This study constructs a benchmark dataset to capture insurance purchase probabilities across factors. Using this dataset, the capacity of LLMs is evaluated: while LLMs exhibit a qualitative understanding of factors, they fall short in estimating quantitative probabilities. To address this limitation, InsurAgent, an LLM-empowered agent comprising five modules including perception, retrieval, reasoning, action, and memory, is proposed. The retrieval module leverages retrieval-augmented generation (RAG) to ground decisions in empirical survey data, achieving accurate estimation of marginal and bivariate probabilities. The reasoning module leverages LLM common sense to extrapolate beyond survey data, capturing contextual information that is intractable for traditional models. The memory module supports the simulation of temporal decision evolutions, illustrated through a roller coaster life trajectory. Overall, InsurAgent provides a valuable tool for behavioral modeling and policy analysis.
☆ Deep Value Benchmark: Measuring Whether Models Generalize Deep values or Shallow Preferences NeurIPS 2025
We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features -- for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options. This design allows us to precisely measure a model's Deep Value Generalization Rate (DVGR) -- the probability of generalizing based on the underlying value rather than the shallow feature. Across 9 different models, the average DVGR is just 0.30. All models generalize deep values less than chance. Larger models have a (slightly) lower DVGR than smaller models. We are releasing our dataset, which was subject to three separate human validation experiments. DVB provides an interpretable measure of a core feature of alignment.
comment: NeurIPS 2025 (Spotlight)
☆ LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS NeurIPS 2025
Contrast-Consistent Search (CCS) is an unsupervised probing method able to test whether large language models represent binary features, such as sentence truth, in their internal activations. While CCS has shown promise, its two-term objective has been only partially understood. In this work, we revisit CCS with the aim of clarifying its mechanisms and extending its applicability. We argue that what should be optimized for, is relative contrast consistency. Building on this insight, we reformulate CCS as an eigenproblem, yielding closed-form solutions with interpretable eigenvalues and natural extensions to multiple variables. We evaluate these approaches across a range of datasets, finding that they recover similar performance to CCS, while avoiding problems around sensitivity to random initialization. Our results suggest that relativizing contrast consistency not only improves our understanding of CCS but also opens pathways for broader probing and mechanistic interpretability methods.
comment: Accepted to the Mechanistic Interpretability Workshop at NeurIPS 2025
☆ Complete asymptotic type-token relationship for growing complex systems with inverse power-law count rankings
The growth dynamics of complex systems often exhibit statistical regularities involving power-law relationships. For real finite complex systems formed by countable tokens (animals, words) as instances of distinct types (species, dictionary entries), an inverse power-law scaling $S \sim r^{-\alpha}$ between type count $S$ and type rank $r$, widely known as Zipf's law, is widely observed to varying degrees of fidelity. A secondary, summary relationship is Heaps' law, which states that the number of types scales sublinearly with the total number of observed tokens present in a growing system. Here, we propose an idealized model of a growing system that (1) deterministically produces arbitrary inverse power-law count rankings for types, and (2) allows us to determine the exact asymptotics of the type-token relationship. Our argument improves upon and remedies earlier work. We obtain a unified asymptotic expression for all values of $\alpha$, which corrects the special cases of $\alpha = 1$ and $\alpha \gg 1$. Our approach relies solely on the form of count rankings, avoids unnecessary approximations, and does not involve any stochastic mechanisms or sampling processes. We thereby demonstrate that a general type-token relationship arises solely as a consequence of Zipf's law.
comment: 5 pages, 2 figures
☆ Regularization Through Reasoning: Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning
Fine-tuning LLMs for classification typically maps inputs directly to labels. We ask whether attaching brief explanations to each label during fine-tuning yields better models. We evaluate conversational response quality along three axes: naturalness, comprehensiveness, and on-topic adherence, each rated on 5-point scales. Using ensemble-generated data from multiple LLMs, we fine-tune a 7B-parameter model and test across six diverse conversational datasets. Across 18 dataset, task settings, label-plus-explanation training outperforms label-only baselines. A central and unexpected result concerns random tokens. We replace human-written explanations with text that is syntactically incoherent yet vocabulary-aligned with the originals (e.g., shuffled or bag-of-words variants). Despite lacking semantics, these pseudo-explanations still improve accuracy over label-only training and often narrow much of the gap to true explanations. The effect persists across datasets and training seeds, indicating that gains arise less from meaning than from structure: the extra token budget encourages richer intermediate computation and acts as a regularizer that reduces over-confident shortcuts. Internal analyses support this view: explanation-augmented models exhibit higher activation entropy in intermediate layers alongside sharper predictive mass at the output layer, consistent with increased deliberation before decision. Overall, explanation-augmented fine-tuning, whether with genuine rationales or carefully constructed random token sequences, improves accuracy and reliability for LLM classification while clarifying how token-level scaffolding shapes computation during inference.
☆ TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.
comment: 9 pages, 6 figures, 5 tables
☆ Towards Robust Mathematical Reasoning EMNLP 2025
Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.
comment: EMNLP 2025 (main conference), https://aclanthology.org/2025.emnlp-main.1794/
☆ KV Cache Transform Coding for Compact Storage in LLM Inference
Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.
☆ Plan-and-Write: Structure-Guided Length Control for LLMs without Model Retraining KDD 2025
Length control in Large Language Models (LLMs) is a crucial but under-addressed challenge, with applications ranging from voice interfaces requiring concise responses to research summaries needing comprehensive outputs. Current approaches to length control, including Regularized DPO, Length-Instruction Fine Tuning, and tool-augmented methods, typically require expensive model retraining or complex inference-time tooling. This paper presents a prompt engineering methodology that enables precise length control without model retraining. Our structure-guided approach implements deliberate planning and word counting mechanisms within the prompt, encouraging the model to carefully track and adhere to specified length constraints. Comprehensive evaluations across six state-of-the-art LLMs demonstrate that our method significantly improves length fidelity for several models compared to standard prompting when applied to document summarization tasks, particularly for shorter-to-medium length constraints. The proposed technique shows varying benefits across different model architectures, with some models demonstrating up to 37.6% improvement in length adherence. Quality evaluations further reveal that our approach maintains or enhances overall output quality compared to standard prompting techniques. Our approach provides an immediately deployable solution for applications requiring precise length control, particularly valuable for production environments where model retraining is impractical or cost-prohibitive.
comment: Presented at Workshop on Prompt Optimization, KDD 2025, Toronto, Canada
☆ Random Initialization of Gated Sparse Adapters ICML 2025
When fine-tuning language models on new tasks, catastrophic forgetting -- performance degradation on previously-learned tasks -- is a ubiquitous problem. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA address this through low-rank adapters, sparse adaptation offers an alternative that doesn't impose rank constraints. We introduce Random Initialization of Gated Sparse Adapters (RIGSA), which starts from randomly-initialized full-rank adapters, gates them with a ReZero analog, and sparsifies them with iterative magnitude pruning. We evaluate RIGSA on SmolLM2-1.7B-Instruct using a novel vision-in-text task (Textual MNIST) and measure forgetting on PIQA, HellaSwag, and GSM8k. SmolLM2-1.7B-Instruct initially performs around chance level on Textual MNIST, and is capable of learning the task through RIGSA, 4-bit QLoRA and random masking. In spite of having more trainable parameters than QLoRA, the RIGSA configurations that we studied displayed less forgetting than QLoRA, particularly on GSM8k, though it performs comparably to random masking.
comment: 13 pages (8 main), 6 figures (4 main). Accepted by NewInML workshop @ ICML 2025 on June 27, 2025
☆ RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks
Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning with Adversarial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic jointly. By training both the generator and the critic, this game enhances the critic's error detection and the generator's output quality while reducing required verifications. Our experiments demonstrate that RLAC improves factual accuracy in text generation and correctness in code generation, while also outperforming exhaustive verification and reward model methods. We show that dynamic critics are more effective than fixed critics, showcasing the potential of RLAC for scaling RL post-training to free-form generation tasks.
comment: Project page: https://mianwu01.github.io/RLAC_website/
☆ A Proof of Learning Rate Transfer under $μ$P
We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $\mu$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $\mu P$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.
comment: 23 pages
☆ Efficient Tool-Calling Multi-Expert NPC Agent for Commonsense Persona-Grounded Dialogue EMNLP 2025
We present a multi-expert system for creating Non-Player Characters (NPCs) capable of both natural dialogue and contextual action execution in interactive environments. Using Qwen3 as the base model and Low-Rank Adaptation (LoRA) adapters, we instantiate three specialists: tool calling, tool-response interpretation, and direct dialogue. Our system comfortably meets the computational efficiency requirements, delivering fast responses and maintaining modest resource usage on L40S GPUs. In the Commonsense Persona-Grounded Dialogue Challenge 2025, our method ranked second overall. Code available at: https://github.com/MahammadNuriyev62/CPDC-challenge-2025-solution/
comment: 10 pages, 1 figure, 2 tables. Technical report for the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025, part of the Wordplay 2025 Workshop @ EMNLP 2025
☆ Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement
Natural Language Explanations (NLEs) describe how Large Language Models (LLMs) make decisions, drawing on both external Context Knowledge (CK) and Parametric Knowledge (PK) stored in model weights. Understanding their interaction is key to assessing the grounding of NLEs, yet it remains underexplored. Prior work has largely examined only single-step generation, typically the final answer, and has modelled PK and CK interaction only as a binary choice in a rank-1 subspace. This overlooks richer forms of interaction, such as complementary or supportive knowledge. We propose a novel rank-2 projection subspace that disentangles PK and CK contributions more accurately and use it for the first multi-step analysis of knowledge interactions across longer NLE sequences. Experiments on four QA datasets and three open-weight instruction-tuned LLMs show that diverse knowledge interactions are poorly represented in a rank-1 subspace but are effectively captured in our rank-2 formulation. Our multi-step analysis reveals that hallucinated NLEs align strongly with the PK direction, context-faithful ones balance PK and CK, and Chain-of-Thought prompting for NLEs shifts generated NLEs toward CK by reducing PK reliance. This work provides the first framework for systematic studies of multi-step knowledge interactions in LLMs through a richer rank-2 subspace disentanglement. Code and data: https://github.com/copenlu/pk-ck-knowledge-disentanglement.
comment: Under review
☆ Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. To track the effects of our approach, we introduce a method which analyzes revealed preferences, uncovering clear and holistic changes in character. We find these changes are more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Finally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. We describe and open-source our full post-training method, the implementation of which can be found at https://github.com/maiush/OpenCharacterTraining.
comment: 12 pages, 6 figures, 4 tables
☆ SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia
We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.
comment: 10 pages
☆ EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering
Large Language Models (LLMs) are increasingly being applied to specialized, high-stakes domains like engineering, which demands rigorous evaluation of their complex reasoning capabilities. While current benchmarks assess language understanding, factual recall, mathematics or code generation, none capture the integrative reasoning central to engineering where scientific principles, quantitative modeling and practical constraints must converge. To address this gap, we introduce EngChain, a benchmark for verifiable multi-step engineering problem-solving. EngChain contains 90 problems spanning three engineering branches, organized into 9 domains and 20 distinct areas. The problems are generated from symbolic templates with a high degree of randomization to ensure diversity and eliminate the risk of contamination. With this benchmark, we move beyond final answer accuracy with a two-stage evaluation: we first quantitatively verify the numerical and semantic validity of each reasoning step and then introduce LLM-As-A-Judge, an automated system to qualitatively categorize the identified reasoning errors.
comment: 24 pages, includes figures and tables; introduces the EngChain benchmark
☆ Evaluating Cultural Knowledge Processing in Large Language Models: A Cognitive Benchmarking Framework Integrating Retrieval-Augmented Generation
This study proposes a cognitive benchmarking framework to evaluate how large language models (LLMs) process and apply culturally specific knowledge. The framework integrates Bloom's Taxonomy with Retrieval-Augmented Generation (RAG) to assess model performance across six hierarchical cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Using a curated Taiwanese Hakka digital cultural archive as the primary testbed, the evaluation measures LLM-generated responses' semantic accuracy and cultural relevance.
comment: This paper has been accepted by The Electronic Library, and the full article is now available on Emerald Insight
☆ A Graph-based RAG for Energy Efficiency Question Answering
In this work, we investigate the use of Large Language Models (LLMs) within a graph-based Retrieval Augmented Generation (RAG) architecture for Energy Efficiency (EE) Question Answering. First, the system automatically extracts a Knowledge Graph (KG) from guidance and regulatory documents in the energy field. Then, the generated graph is navigated and reasoned upon to provide users with accurate answers in multiple languages. We implement a human-based validation using the RAGAs framework properties, a validation dataset comprising 101 question-answer pairs, and domain experts. Results confirm the potential of this architecture and identify its strengths and weaknesses. Validation results show how the system correctly answers in about three out of four of the cases (75.2 +- 2.7%), with higher results on questions related to more general EE answers (up to 81.0 +- 4.1%), and featuring promising multilingual abilities (4.4% accuracy loss due to translation).
☆ ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian LREC 2026
ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.
comment: Submitted to the LREC 2026 conference; 11 pages, 2 figures, 3 tables
☆ Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.
☆ Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers
Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interdisciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real-time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose-built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real human language.
comment: 12 pages, 3 figures
☆ BIRD: Bronze Inscription Restoration and Dating EMNLP 2025
Bronze inscriptions from early China are fragmentary and difficult to date. We introduce BIRD(Bronze Inscription Restoration and Dating), a fully encoded dataset grounded in standard scholarly transcriptions and chronological labels. We further propose an allograph-aware masked language modeling framework that integrates domain- and task-adaptive pretraining with a Glyph Net (GN), which links graphemes and allographs. Experiments show that GN improves restoration, while glyph-biased sampling yields gains in dating.
comment: Accepted at EMNLP 2025 (Main Conference)
☆ ECO Decoding: Entropy-Based Control for Controllability and Fluency in Controllable Dialogue Generation EMNLP 2025
Controllable Dialogue Generation (CDG) enables chatbots to generate responses with desired attributes, and weighted decoding methods have achieved significant success in the CDG task. However, using a fixed constant value to manage the bias of attribute probabilities makes it challenging to find an ideal control strength that satisfies both controllability and fluency. To address this issue, we propose ECO decoding (Entropy-based COntrol), which dynamically adjusts the control strength at each generation step according to the model's entropy in both the language model and attribute classifier probability distributions. Experiments on the DailyDialog and MultiWOZ datasets demonstrate that ECO decoding consistently improves controllability while maintaining fluency and grammaticality, outperforming prior decoding methods across various models and settings. Furthermore, ECO decoding alleviates probability interpolation issues in multi-attribute generation and consequently demonstrates strong performance in both single and multi-attribute scenarios.
comment: Published at EMNLP 2025 main
☆ Math anxiety and associative knowledge structure are entwined in psychology students but not in Large Language Models like GPT-3.5 and GPT-4o
Math anxiety poses significant challenges for university psychology students, affecting their career choices and overall well-being. This study employs a framework based on behavioural forma mentis networks (i.e. cognitive models that map how individuals structure their associative knowledge and emotional perceptions of concepts) to explore individual and group differences in the perception and association of concepts related to math and anxiety. We conducted 4 experiments involving psychology undergraduates from 2 samples (n1 = 70, n2 = 57) compared against GPT-simulated students (GPT-3.5: n2 = 300; GPT-4o: n4 = 300). Experiments 1, 2, and 3 employ individual-level network features to predict psychometric scores for math anxiety and its facets (observational, social and evaluational) from the Math Anxiety Scale. Experiment 4 focuses on group-level perceptions extracted from human students, GPT-3.5 and GPT-4o's networks. Results indicate that, in students, positive valence ratings and higher network degree for "anxiety", together with negative ratings for "math", can predict higher total and evaluative math anxiety. In contrast, these models do not work on GPT-based data because of differences in simulated networks and psychometric scores compared to humans. These results were also reconciled with differences found in the ways that high/low subgroups of simulated and real students framed semantically and emotionally STEM concepts. High math-anxiety students collectively framed "anxiety" in an emotionally polarising way, absent in the negative perception of low math-anxiety students. "Science" was rated positively, but contrasted against the negative perception of "math". These findings underscore the importance of understanding concept perception and associations in managing students' math anxiety.
☆ Hidden in Plain Sight: Where Developers Confess Self-Admitted Technical Debt
Context. Detecting Self-Admitted Technical Debt (SATD) is crucial for proactive software maintenance. Previous research has primarily targeted detecting and prioritizing SATD, with little focus on the source code afflicted with SATD. Our goal in this work is to connect the SATD comments with source code constructs that surround them. Method. We leverage the extensive SATD dataset PENTACET, containing code comments from over 9000 Java Open Source Software (OSS) repositories. We quantitatively infer where SATD most commonly occurs and which code constructs/statements it most frequently affects. Results and Conclusions. Our large-scale study links over 225,000 SATD comments to their surrounding code, showing that SATD mainly arises in inline code near definitions, conditionals, and exception handling, where developers face uncertainty and trade-offs, revealing it as an intentional signal of awareness during change rather than mere neglect.
☆ Difficulty-Controllable Cloze Question Distractor Generation
Multiple-choice cloze questions are commonly used to assess linguistic proficiency and comprehension. However, generating high-quality distractors remains challenging, as existing methods often lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets further hinders progress. To address these issues, we propose a novel framework for generating distractors with controllable difficulty by leveraging both data augmentation and a multitask learning strategy. First, to create a high-quality, difficulty-annotated dataset, we introduce a two-way distractor generation process in order to produce diverse and plausible distractors. These candidates are subsequently refined through filtering and then categorized by difficulty using an ensemble QA system. Second, this newly created dataset is leveraged to train a difficulty-controllable generation model via multitask learning. The framework includes carefully designed auxiliary tasks that enhance the model's semantic understanding of distractors and its ability to estimate their difficulty. Experimental results demonstrate that our method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.
☆ BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification
Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.
comment: Under review, 6 pages, 1 figure, 2 tables
☆ Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, the latter preserves higher output quality, thus making outputs potentially more usable and dangerous. Finally, fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data.
☆ Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation
Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.
☆ BARD: budget-aware reasoning distillation
While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model's understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.
♻ ☆ ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers ICCV 2025
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV
comment: Published as a conference paper at ICCV 2025. Project page: https://github.com/icip-cas/ShortV
♻ ☆ JobHop: A Large-Scale Dataset of Career Trajectories
Understanding labor market dynamics is essential for policymakers, employers, and job seekers. However, comprehensive datasets that capture real-world career trajectories are scarce. In this paper, we introduce JobHop, a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium. Utilizing Large Language Models (LLMs), we process unstructured resume data to extract structured career information, which is then normalized to standardized ESCO occupation codes using a multi-label classification model. This results in a rich dataset of over 1.67 million work experiences, extracted from and grouped into more than 361,000 user resumes and mapped to standardized ESCO occupation codes, offering valuable insights into real-world occupational transitions. This dataset enables diverse applications, such as analyzing labor market mobility, job stability, and the effects of career breaks on occupational transitions. It also supports career path prediction and other data-driven decision-making processes. To illustrate its potential, we explore key dataset characteristics, including job distributions, career breaks, and job transitions, demonstrating its value for advancing labor market research.
♻ ☆ Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling
Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations. However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies. To address these critical limitations, we propose PsyLLM, the first large language model designed to systematically integrate both diagnostic and therapeutic reasoning for mental health counseling. To develop PsyLLM, we design a novel automated data synthesis pipeline that processes real-world mental health posts collected from Reddit, where users frequently share psychological distress and seek community support. This pipeline processes real-world mental health posts, generates multi-turn dialogue structures, and leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate detailed clinical reasoning processes. Rigorous multi-dimensional filtering ensures the generation of high-quality, clinically aligned dialogue data. In addition, we introduce a new benchmark and evaluation protocol, assessing counseling quality across four key dimensions. Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models on this benchmark. The model weights and dataset have been publicly released at https://github.com/Emo-gml/PsyLLM.
♻ ☆ Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation
SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data's statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. The model is open-sourced at https://github.com/IBM/fms-dgt/tree/main/fms_dgt/public/databuilders/time_series.
♻ ☆ Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.
♻ ☆ Verbalized Algorithms NeurIPS 2025
Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.
comment: Accepted in NeurIPS 2025 Workshop on Efficient Reasoning
♻ ☆ What is the Role of Small Models in the LLM Era: A Survey
Large Language Models (LLMs) have made significant progress in advancing artificial general intelligence (AGI), leading to the development of increasingly large models such as GPT-4 and LLaMA-405B. However, scaling up model sizes results in exponentially higher computational costs and energy consumption, making these models impractical for academic researchers and businesses with limited resources. At the same time, Small Models (SMs) are frequently used in practical settings, although their significance is currently underestimated. This raises important questions about the role of small models in the era of LLMs, a topic that has received limited attention in prior research. In this work, we systematically examine the relationship between LLMs and SMs from two key perspectives: Collaboration and Competition. We hope this survey provides valuable insights for practitioners, fostering a deeper understanding of the contribution of small models and promoting more efficient use of computational resources. The code is available at https://github.com/tigerchen52/role_of_small_models
comment: a survey paper of small models
♻ ☆ Hebrew Diacritics Restoration using Visual Representation
Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input's vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.
♻ ☆ Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models
We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we observe preliminary correlations between internal fragility and miscalibration or hallucination (confabulation), and find that lightweight critique prompts may modestly improve or worsen calibration in small-scale tests. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens to diagnose and potentially mitigate overconfidence in reasoning systems.
comment: 21 pages, 2 figures, preliminary version
♻ ☆ New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models LREC
Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LL\"aMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LL\"aMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.
comment: under review @LREC
♻ ☆ JudgeLRM: Large Reasoning Models as a Judge
Large Language Models (LLMs) are increasingly adopted as evaluators, offering a scalable alternative to human annotation. However, existing supervised fine-tuning (SFT) approaches often fall short in domains that demand complex reasoning. Judgment is inherently reasoning-intensive: beyond surface-level scoring, it requires verifying evidence, identifying errors, and justifying decisions. Through the analysis of evaluation tasks, we find a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples, revealing the limits of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs, trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards to activate reasoning capabilities. JudgeLRM consistently outperform SFT-tuned baselines in the same size, as well as other RL and SFT variants, and even surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceeds GPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score, with particularly strong gains on reasoning-heavy tasks. Our findings underscore the value of RL in unlocking reasoning-aligned LLM judges.
comment: Preprint
♻ ☆ Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time ICLR 2026
Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.
comment: 40 pages, 22 figures. Under review at ICLR 2026
♻ ☆ Eye Tracking Based Cognitive Evaluation of Automatic Readability Assessment Measures
Methods for scoring text readability have been studied for over a century, and are widely used in research and in user-facing applications in many domains. Thus far, the development and evaluation of such methods have primarily relied on two types of offline behavioral data, performance on reading comprehension tests and ratings of text readability levels. In this work, we instead focus on a fundamental and understudied aspect of readability, real-time reading ease, captured with online reading measures using eye tracking. We introduce an evaluation framework for readability scoring methods which quantifies their ability to account for reading ease, while controlling for content variation across texts. Applying this evaluation to prominent traditional readability formulas, modern machine learning systems, frontier Large Language Models and commercial systems used in education, suggests that they are all poor predictors of reading ease in English. This outcome holds across native and non-native speakers, reading regimes, and textual units of different lengths. The evaluation further reveals that existing methods are often outperformed by word properties commonly used in psycholinguistics for prediction of reading times. Our results highlight a fundamental limitation of existing approaches to readability scoring, the utility of psycholinguistics for readability research, and the need for new, cognitively driven readability scoring approaches that can better account for reading ease.
♻ ☆ RadarPLM: Adapting Pretrained Language Models for Marine Radar Target Detection with Preference-aware Loss
Recent advances in pre-trained language models (PLMs) have demonstrated their capabilities in capturing universal knowledge, making them promising applications for radar signal processing. Nevertheless, directly fine-tuning PLMs on radar signals is both computationally expensive and prone to overfitting, particularly in low signal-to-clutter ratio (SCR) environments. In this paper, we propose a novel fine-tuning framework for PLM-based marine radar target detection. First, we design a lightweight adaptation module, enabling parameter-efficient fine-tuning while preserving the pretrained model's general knowledge. Second, a novel preference-aware loss is developed to selectively optimize different feature patches based on their online evaluated learning values, guiding the model to concentrate on the most generalizable feature patterns during optimization. Extensive experiments on real-world marine radar datasets demonstrate that the proposed finetuning framework achieves an average performance improvement of 9.9% over the standard approach under low SCR conditions. Furthermore, the fine-tuned model, RadarPLM, consistently outperforms state-of-the-art detectors, particularly when training data are limited.
♻ ☆ Representation Consistency for Accurate and Coherent LLM Answer Aggregation NeurIPS 2025
Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.
comment: Accepted at NeurIPS 2025. Camera-ready version
♻ ☆ Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation AACL
Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal "fallback" mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.
comment: accepted to AACL main
♻ ☆ Where to Search: Measure the Prior-Structured Search Space of LLM Agents
The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via two instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.
comment: 11 pages, 4 figures, 1 table
♻ ☆ Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps EMNLP 2025
When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work on CoT prompting, it is unclear if reasoning verbalized in a CoT is faithful to the models' parametric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters, and measures faithfulness as the resulting effect on the model's prediction. Our experiments with four LMs and five multi-hop multi-choice question answering (MCQA) datasets show that FUR is frequently able to precisely change the underlying models' prediction for a given instance by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning.
comment: Accepted at EMNLP 2025. Under review for outstanding paper award
♻ ☆ Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models? NeurIPS 2025
Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
♻ ☆ XIFBench: Evaluating Large Language Models on Multilingual Instruction Following NeurIPS 2025
Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings lacks systematic investigation, with existing evaluations lacking fine-grained constraint analysis across diverse linguistic contexts. We introduce XIFBench, a comprehensive constraint-based benchmark for evaluating multilingual instruction-following abilities of LLMs, comprising 558 instructions with 0-5 additional constraints across five categories (Content, Style, Situation, Format, and Numerical) in six languages spanning different resource levels. To support reliable and consistent cross-lingual evaluation, we implement three methodological innovations: cultural accessibility annotation, constraint-level translation validation, and requirement-based evaluation using English requirements as semantic anchors across languages. Extensive experiments with various LLMs not only quantify performance disparities across resource levels but also provide detailed insights into how language resources, constraint categories, instruction complexity, and cultural specificity influence multilingual instruction-following. Our code and data are available at https://github.com/zhenyuli801/XIFBench.
comment: Accepted by the NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ Teaching According to Talents! Instruction Tuning LLMs with Competence-Aware Curriculum Learning EMNLP 2025
Efficient instruction tuning aims to enhance the ultimate performance of large language models (LLMs) trained on a given instruction dataset. Curriculum learning as a typical data organization strategy has shown preliminary effectiveness in instruction tuning. However, current curriculum tuning methods suffer from the curriculum rigidity, since they rely solely on static heuristic difficulty metrics. These methods fail to adapt to the evolving capabilities of models during training, resulting in a fixed and potentially sub-optimal learning trajectory. To address the issue, Competence-Aware Multi-Perspective cUrriculum inStruction tuning framework termed CAMPUS is proposed. CAMPUS offers several advantages: (1) Dynamic selection for sub-curriculum. (2) Competency-aware adjustment to the curriculum schedule. (3) Multiple difficulty-based scheduling. Extensive experiments prove the superior performance of CAMPUS, compared to other state-of-the-art baselines for efficient instruction tuning.
comment: EMNLP 2025 Findings
♻ ☆ UI-Evol: Automatic Knowledge Evolving for Computer Use Agents ICML 2025
External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90% correct knowledge yields only 41% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.
comment: Accepted to ICML 2025 Workshop on Computer Use Agents
♻ ☆ Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding NeurIPS 2025
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%, which substantially surpasses all prior works, and further improves to 76.0% with transcripts. The code has been released at https://github.com/microsoft/DeepVideoDiscovery.
comment: Accepted to NeurIPS 2025
♻ ☆ MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts
LLMs hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation overlap within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, an enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical LLMs. Our code and dataset are available at https://github.com/mylittleriver/MedREK.
comment: Preprint, work in progress
♻ ☆ Evaluating Perspectival Biases in Cross-Modal Retrieval
Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.
♻ ☆ Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment
The reasoning capabilities of large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need for training effective small reasoning models. A critical challenge is that small models possess different reasoning capacities and cognitive trajectories compared with their larger counterparts. Hence, directly distilling chain-of-thought (CoT) rationales from large LRMs to smaller ones can sometimes be ineffective and often requires a substantial amount of annotated data. In this paper, we first introduce a novel Critique-Rethink-Verify (CRV) system, designed for training smaller yet powerful LRMs. Our CRV system consists of multiple LLM agents, each specializing in unique tasks: (i) critiquing the CoT rationales according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. Building on the CRV system, we further propose the Cognitive Preference Optimization (CogPO) algorithm to continuously enhance the reasoning abilities of smaller models by aligning their reasoning processes with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of our CRV+CogPO framework, which outperforms other methods by a large margin.
comment: emnlp 2025 main conference
♻ ☆ Advancing Expert Specialization for Better MoE
Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.
comment: 33pages, 6figures(Accepted by Neurips 2025 Oral)
♻ ☆ Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation
Flight delay prediction has become a key focus in air traffic management, as delays highlight inefficiencies that impact overall network performance. This paper presents a lightweight large language model-based multimodal flight delay prediction, formulated from the perspective of air traffic controllers monitoring aircraft delay after entering the terminal area. The approach integrates trajectory representations with textual aeronautical information, including flight information, weather reports, and aerodrome notices, by adapting trajectory data into the language modality to capture airspace conditions. The experiments show that the model consistently achieves sub-minute prediction error by effectively leveraging contextual information related to the sources of delay, fulfilling the operational standard for minute-level precision. The framework demonstrates that linguistic understanding, when combined with cross-modality adaptation of trajectory data, enhances delay prediction. Moreover, the approach shows practicality and potential scalability for real-world operations, supporting real-time updates that refine predictions upon receiving new operational information.
comment: Preprint submitted to Aerospace Science and Technology (Elsevier) for possible publication
♻ ☆ Scaling Latent Reasoning via Looped Language Models
Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.
♻ ☆ Trustworthy Medical Question Answering: An Evaluation-Centric Survey EMNLP 2025
Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges-such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies-and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.
comment: accepted to EMNLP 2025
♻ ☆ Complex QA and language models hybrid architectures, Survey
This paper reviews the state-of-the-art of large language models (LLM) architectures and strategies for "complex" question-answering with a focus on hybrid architectures. LLM based chatbot services have allowed anyone to grasp the potential of LLM to solve many common problems, but soon discovered their limitations for complex questions. Addressing more specific, complex questions (e.g., "What is the best mix of power-generation methods to reduce climate change ?") often requires specialized architectures, domain knowledge, new skills, decomposition and multi-step resolution, deep reasoning, sensitive data protection, explainability, and human-in-the-loop processes. Therefore, we review: (1) necessary skills and tasks for handling complex questions and common LLM limits to overcome; (2) dataset, cost functions and evaluation metrics for measuring and improving (e.g. accuracy, explainability, fairness, robustness, groundedness, faithfulness, toxicity...); (3) family of solutions to overcome LLM limitations by (a) training and reinforcement (b) hybridization, (c) prompting, (d) agentic-architectures (agents, tools) and extended reasoning.
♻ ☆ Enhancing Time Awareness in Generative Recommendation EMNLP 2025
Generative recommendation has emerged as a promising paradigm that formulates the recommendations into a text-to-text generation task, harnessing the vast knowledge of large language models. However, existing studies focus on considering the sequential order of items and neglect to handle the temporal dynamics across items, which can imply evolving user preferences. To address this limitation, we propose a novel model, Generative Recommender Using Time awareness (GRUT), effectively capturing hidden user preferences via various temporal signals. We first introduce Time-aware Prompting, consisting of two key contexts. The user-level temporal context models personalized temporal patterns across timestamps and time intervals, while the item-level transition context provides transition patterns across users. We also devise Trend-aware Inference, a training-free method that enhances rankings by incorporating trend information about items with generation likelihood. Extensive experiments demonstrate that GRUT outperforms state-of-the-art models, with gains of up to 15.4% and 14.3% in Recall@5 and NDCG@5 across four benchmark datasets. The source code is available at https://github.com/skleee/GRUT.
comment: EMNLP 2025 (Findings)
♻ ☆ Mapping Overlaps in Benchmarks through Perplexity in the Wild
We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures probe the capacity required for benchmark performance. We formally define them as a set of salient tokens drawn from in-the-wild, naturally authored corpora, where LLM token perplexity, reflecting more or less pre-training exposure, becomes highly predictive of LLM benchmark performance. Through a large-scale meta-evaluation, we extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks spanning diverse knowledge, coding, logic, instruction following, math, language, reasoning, and world modeling. Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance. While performance overlaps are universally high and semantic overlaps remain confined to a narrow mid-range, benchmark signatures prove highly informative in capturing variation, overlap, and divergence. We observe overlap in knowledge and reasoning subtasks, whereas multilingual and cultural benchmarks exhibit less similarity, even compared to cross-task overlap. Notably, performance-level results are strongly influenced by benchmark-orthogonal factors such as question format, highlighting limitations in LLM generalization, the conflation of performance with ability, and issues inherent in current mainstream benchmark agreement studies. Benchmark signatures, however, remain robust to such effects. Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain. Together, these findings provide mechanistic insights into benchmark validity and LLM sensitivities, and sketch the underlying landscape of interconnected LLM capabilities.
♻ ☆ MotionGPT3: Human Motion as a Second Modality
With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.
comment: 26 pages, 11 figures
♻ ☆ Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models
This paper proposes a modeling framework for dynamic topic evolution based on temporal large language models. The method first uses a large language model to obtain contextual embeddings of text and then introduces a temporal decay function and an attention mechanism. These components allow the model to adjust the importance of semantic units according to time intervals and capture topic variations across different periods. The temporal representations are then mapped into a latent topic space, where a state transition matrix is applied to describe the dynamic evolution of topics. A joint optimization objective constrains both semantic modeling and temporal consistency, ensuring diversity and smoothness in topic generation. The design emphasizes the unified modeling of semantic representation and temporal evolution, which improves topic coherence and diversity while enhancing stability and interpretability over time. Experiments on real-world corpora show that the framework effectively captures the generation, expansion, and decline of topics and outperforms existing models across multiple metrics. Overall, the proposed method provides a systematic solution for understanding dynamic semantic patterns in large-scale text, enriches the research paradigm of topic modeling, and supports complex text analysis tasks in multiple domains.
♻ ☆ MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation EMNLP 2025
Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field's history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST
comment: EMNLP 2025
♻ ☆ SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains EMNLP 2025
We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit's generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.
comment: EMNLP 2025 System Demonstration
♻ ☆ Training Large Language Models to Reason in a Continuous Latent Space
Large language models (LLMs) are typically constrained to reason in the language space, where they express the reasoning process through a chain-of-thought (CoT) to solve complex problems. However, the language space may not always be optimal for reasoning. Most word tokens primarily ensure textual coherence and are not essential for reasoning, while some critical tokens require complex planning and pose challenges to LLMs. To explore the potential of reasoning beyond language, we introduce a new paradigm called Coconut (Chain of Continuous Thought). Coconut utilizes the last hidden state of the LLM as a representation of the reasoning state, termed "continuous thought." Instead of decoding this state into words, we feed it back to the model as the next input embedding directly in the continuous space. This latent reasoning paradigm enables an advanced reasoning pattern, where continuous thoughts can encode multiple alternative next steps, allowing the model to perform a breadth-first search (BFS) rather than committing prematurely to a single deterministic path as in CoT. Coconut outperforms CoT on logical reasoning tasks that require substantial search during planning and achieves a better trade-off between accuracy and efficiency.
comment: Accepted to COLM 2025
♻ ☆ DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking
Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.
♻ ☆ EmoDebt: Bayesian-Optimized Emotional Intelligence for Strategic Agent-to-Agent Debt Recovery
The emergence of autonomous Large Language Model (LLM) agents has created a new ecosystem of strategic, agent-to-agent interactions. However, a critical challenge remains unaddressed: in high-stakes, emotion-sensitive domains like debt collection, LLM agents pre-trained on human dialogue are vulnerable to exploitation by adversarial counterparts who simulate negative emotions to derail negotiations. To fill this gap, we first contribute a novel dataset of simulated debt recovery scenarios and a multi-agent simulation framework. Within this framework, we introduce EmoDebt, an LLM agent architected for robust performance. Its core innovation is a Bayesian-optimized emotional intelligence engine that reframes a model's ability to express emotion in negotiation as a sequential decision-making problem. Through online learning, this engine continuously tunes EmoDebt's emotional transition policies, discovering optimal counter-strategies against specific debtor tactics. Extensive experiments on our proposed benchmark demonstrate that EmoDebt achieves significant strategic robustness, substantially outperforming non-adaptive and emotion-agnostic baselines across key performance metrics, including success rate and operational efficiency. By introducing both a critical benchmark and a robustly adaptive agent, this work establishes a new foundation for deploying strategically robust LLM agents in adversarial, emotion-sensitive debt interactions. The code is available at \textcolor{blue}{https://github.com/Yunbo-max/EmoDebt}.
♻ ☆ Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation
Wordle presents an algorithmically rich testbed for constraint satisfaction problem (CSP) solving. While existing solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, we present the first comprehensive CSP formulation of Wordle with novel constraint-aware solving strategies. We introduce CSP-Aware Entropy, computing information gain after constraint propagation rather than on raw candidate sets, and a Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints. Through evaluation on 2,315 English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate, a statistically significant 1.7% improvement over Forward Checking (t=-4.82, p<0.001, Cohen's d=0.07) with 46% faster runtime (12.9ms versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3 percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic CSP achieves 100% success across all noise levels (0-20%) through constraint recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates 88% success with zero language-specific tuning, validating that core CSP principles transfer across languages despite an 11.2 percentage point gap from linguistic differences (p<0.001, Fisher's exact test). Our open-source implementation with 34 unit tests achieving 91% code coverage provides reproducible infrastructure for CSP research. The combination of formal CSP treatment, constraint-aware heuristics, probabilistic-logical integration, robustness analysis, and cross-lexicon validation establishes new performance benchmarks demonstrating that principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains.
comment: 35 pages, 14 figures, 10 tables. Open-source implementation with 91% test coverage available at https://github.com/jahidul-arafat/constraint_satisfaction_wordle_arxiv_preprint
♻ ☆ Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models
Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
comment: Post-hoc attribution
♻ ☆ AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives
Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core challenge stems from a lack of awareness, as standard models are pre-trained on general corpora, leaving them blind to the domain-specific language and narrative context inherent to the data. To address this, we introduce AWARE, a framework that systematically attempts to improve a transformer model's awareness for this nuanced task. AWARE has three core components: 1) Domain Awareness, adapting the model's vocabulary to the linguistic style of student reflections; 2) Context Awareness, generating sentence embeddings that are aware of the full essay context; and 3) Class Overlap Awareness, employing a multi-label strategy to recognize the coexistence of themes in a single sentence. Our results show that by making the model explicitly aware of the properties of the input, AWARE outperforms a strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all themes. This work provides a robust and generalizable methodology for any text classification task in which meaning depends on the context of the narrative.
comment: The authors are withdrawing this version to correct issues identified in the experimental design and analysis. A revised and validated version will be submitted after further review
♻ ☆ Training Language Models to Reason Efficiently NeurIPS 2025
Scaling model size and training data has led to great advances in the performance of Large Language Models (LLMs). However, the diminishing returns of this approach necessitate alternative methods to improve model capabilities, particularly in tasks requiring advanced reasoning. Large reasoning models, which leverage long chain-of-thoughts, bring unprecedented breakthroughs in problem-solving capabilities but at a substantial deployment cost associated to longer generations. Reducing inference costs is crucial for the economic feasibility, user experience, and environmental sustainability of these models. In this work, we propose to train large reasoning models to reason efficiently. More precisely, we use reinforcement learning (RL) to train reasoning models to dynamically allocate inference-time compute based on task complexity. Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy, thereby achieving substantial efficiency gains. It enables the derivation of a family of reasoning models with varying efficiency levels, controlled via a single hyperparameter. Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.
comment: NeurIPS 2025
♻ ☆ ExpertLens: Activation steering features are highly interpretable
Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering methods are interpretable. We identify neurons responsible for specific concepts (e.g., ``cat'') using the ``finding experts'' method from research on activation steering and show that the ExpertLens, i.e., inspection of these neurons provides insights about model representation. We find that ExpertLens representations are stable across models and datasets and closely align with human representations inferred from behavioral data, matching inter-human alignment levels. ExpertLens significantly outperforms the alignment captured by word/sentence embeddings. By reconstructing human concept organization through ExpertLens, we show that it enables a granular view of LLM concept representation. Our findings suggest that ExpertLens is a flexible and lightweight approach for capturing and analyzing model representations.
♻ ☆ Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking NeurIPS 2025
Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
comment: Accepted at NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models NeurIPS 2024
The need to analyze graphs is ubiquitous across various fields, from social networks to biological research and recommendation systems. Therefore, enabling the ability of large language models (LLMs) to process graphs is an important step toward more advanced general intelligence. However, current LLM benchmarks on graph analysis require models to directly reason over the prompts describing graph topology, and are thus limited to small graphs with only a few dozens of nodes. In contrast, human experts typically write programs based on popular libraries for task solving, and can thus handle graphs with different scales. To this end, a question naturally arises: can LLMs analyze graphs like professionals? In this paper, we introduce ProGraph, a manually crafted benchmark containing 3 categories of graph tasks. The benchmark expects solutions based on programming instead of directly reasoning over raw inputs. Our findings reveal that the performance of current LLMs is unsatisfactory, with the best model achieving only 36% accuracy. To bridge this gap, we propose LLM4Graph datasets, which include crawled documents and auto-generated codes based on 6 widely used graph libraries. By augmenting closed-source LLMs with document retrieval and fine-tuning open-source ones on the codes, we show 11-32% absolute improvements in their accuracies. Our results underscore that the capabilities of LLMs in handling structured data are still under-explored, and show the effectiveness of LLM4Graph in enhancing LLMs' proficiency of graph analysis. The benchmark, datasets and enhanced open-source models are available at https://github.com/BUPT-GAMMA/ProGraph.
comment: NeurIPS 2024
♻ ☆ GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration
Graphs are widely used for modeling relational data in real-world scenarios, such as social networks and urban computing. Existing LLM-based graph analysis approaches either integrate graph neural networks (GNNs) for specific machine learning tasks, limiting their transferability, or rely solely on LLMs' internal reasoning ability, resulting in suboptimal performance. To address these limitations, we take advantage of recent advances in LLM-based agents, which have shown capabilities of utilizing external knowledge or tools for problem solving. By simulating human problem-solving strategies such as analogy and collaboration, we propose a multi-agent system based on LLMs named GraphTeam, for graph analysis. GraphTeam consists of five LLM-based agents from three modules, and the agents with different specialities can collaborate with each other to address complex problems. Specifically, (1) input-output normalization module: the question agent extracts and refines four key arguments from the original question, facilitating the problem understanding, and the answer agent organizes the results to meet the output requirement; (2) external knowledge retrieval module: we first build a knowledge base consisting of relevant documentation and experience information, and then the search agent retrieves the most relevant entries for each question. (3) problem-solving module: given the retrieved information from search agent, the coding agent uses established algorithms via programming to generate solutions, and in case the coding agent does not work, the reasoning agent will directly compute the results without programming. Extensive experiments on six graph analysis benchmarks demonstrate that GraphTeam achieves state-of-the-art performance with an average 25.85% improvement over the best baseline in terms of accuracy. The code and data are available at https://github.com/BUPT-GAMMA/GraphTeam.
Information Retrieval 21
☆ Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion
Recent advances in multimodal recommendation (MMR) have shown that incorporating rich content sources such as images and text can lead to significant gains representation quality. However, existing methods often rely on coarse visual features and uncontrolled fusion, leading to redundant or misaligned representations. As a result, visual encoders often fail to capture salient, item-relevant semantics, limiting their contribution in multimodal fusion. From an information-theoretic perspective, effective fusion should balance the unique, shared, and redundant information across modalities, preserving complementary cues while avoiding correlation bias. This paper presents VLIF, a vision-language and information-theoretic fusion framework that enhances multimodal recommendation through two key components. (i) A VLM-based visual enrichment module generates fine-grained, title-guided descriptions to transform product images into semantically aligned representations. (ii) An information-aware fusion module, inspired by Partial Information Decomposition (PID), disentangles redundant and synergistic signals across modalities for controlled integration. Experiments on three Amazon datasets demonstrate that VLIF consistently outperforms recent multimodal baselines and substantially strengthens the contribution of visual features.
☆ Solving cold start in news recommendations: a RippleNet-based system for large scale media outlet
We present a scalable recommender system implementation based on RippleNet, tailored for the media domain with a production deployment in Onet.pl, one of Poland's largest online media platforms. Our solution addresses the cold-start problem for newly published content by integrating content-based item embeddings into the knowledge propagation mechanism of RippleNet, enabling effective scoring of previously unseen items. The system architecture leverages Amazon SageMaker for distributed training and inference, and Apache Airflow for orchestrating data pipelines and model retraining workflows. To ensure high-quality training data, we constructed a comprehensive golden dataset consisting of user and item features and a separate interaction table, all enabling flexible extensions and integration of new signals.
☆ InteracSPARQL: An Interactive System for SPARQL Query Refinement Using Natural Language Explanations
In recent years, querying semantic web data using SPARQL has remained challenging, especially for non-expert users, due to the language's complex syntax and the prerequisite of understanding intricate data structures. To address these challenges, we propose InteracSPARQL, an interactive SPARQL query generation and refinement system that leverages natural language explanations (NLEs) to enhance user comprehension and facilitate iterative query refinement. InteracSPARQL integrates LLMs with a rule-based approach to first produce structured explanations directly from SPARQL abstract syntax trees (ASTs), followed by LLM-based linguistic refinements. Users can interactively refine queries through direct feedback or LLM-driven self-refinement, enabling the correction of ambiguous or incorrect query components in real time. We evaluate InteracSPARQL on standard benchmarks, demonstrating significant improvements in query accuracy, explanation clarity, and overall user satisfaction compared to baseline approaches. Our experiments further highlight the effectiveness of combining rule-based methods with LLM-driven refinements to create more accessible and robust SPARQL interfaces.
comment: Working paper
☆ Trove: A Flexible Toolkit for Dense Retrieval
We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.
☆ A Graph-based RAG for Energy Efficiency Question Answering
In this work, we investigate the use of Large Language Models (LLMs) within a graph-based Retrieval Augmented Generation (RAG) architecture for Energy Efficiency (EE) Question Answering. First, the system automatically extracts a Knowledge Graph (KG) from guidance and regulatory documents in the energy field. Then, the generated graph is navigated and reasoned upon to provide users with accurate answers in multiple languages. We implement a human-based validation using the RAGAs framework properties, a validation dataset comprising 101 question-answer pairs, and domain experts. Results confirm the potential of this architecture and identify its strengths and weaknesses. Validation results show how the system correctly answers in about three out of four of the cases (75.2 +- 2.7%), with higher results on questions related to more general EE answers (up to 81.0 +- 4.1%), and featuring promising multilingual abilities (4.4% accuracy loss due to translation).
☆ Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC
☆ Calculating Web Impact Factor for University Websites of Jammu and Kashmir: A Study
This paper examines and explores the web impact factor through a webometric study of the present 12 University Websites of Jammu and Kashmir. Identifies the domain systems of the websites; analyzes the number of web pages and link pages, and calculates the External Link WIF or simple web impact factor (WIF) and external web impact factor of all the University websites. Also reflects that some university websites have higher number of web pages, but correspondingly their link pages are very small in number and websites fall behind in their simple and external link web impact factor. It found that the Cluster University of Jammu ranked 1 (0.9018) in Internal Link WIF of Websites in Jammu and Kashmir. Shri Mata Vaishno Devi University ranked 1 (0.7249) in External Link Web Impact Factor.
comment: 11 pages, Research Paper
☆ Impact and Relevance of Cognition Journal in the Field of Cognitive Science: An Evaluation
This study aims to present a scientometric analysis of the journal titled Cognition for a period of 20 years from 1999 to 2018. The present study was conducted with an aim to provide a summary of research activity in current journal and characterize its most aspects. The research coverage includes the year wise distribution of articles, authors, institutions, countries and citation analysis of the journal. The analysis showed that 2870 papers were published in journal of Cognition from 1999 to 2018. The study identified top 20 prolific authors, institutions and countries of the journal. Researchers from USA have been made the most percentage of contributions.
comment: 8 pages, 4 figures, Research Paper. arXiv admin note: substantial text overlap with arXiv:2102.12912, arXiv:2102.09900, arXiv:2102.09894
☆ LiCoMemory: Lightweight and Cognitive Agentic Memory for Efficient Long-Term Reasoning
Large Language Model (LLM) agents exhibit remarkable conversational and reasoning capabilities but remain constrained by limited context windows and the lack of persistent memory. Recent efforts address these limitations via external memory architectures, often employing graph-based representations, yet most adopt flat, entangled structures that intertwine semantics with topology, leading to redundant representations, unstructured retrieval, and degraded efficiency and accuracy. To resolve these issues, we propose LiCoMemory, an end-to-end agentic memory framework for real-time updating and retrieval, which introduces CogniGraph, a lightweight hierarchical graph that utilizes entities and relations as semantic indexing layers, and employs temporal and hierarchy-aware search with integrated reranking for adaptive and coherent knowledge retrieval. Experiments on long-term dialogue benchmarks, LoCoMo and LongMemEval, show that LiCoMemory not only outperforms established baselines in temporal reasoning, multi-session consistency, and retrieval efficiency, but also notably reduces update latency. Our official code and data are available at https://github.com/EverM0re/LiCoMemory.
☆ A Soft-partitioned Semi-supervised Collaborative Transfer Learning Approach for Multi-Domain Recommendation CIKM'25
In industrial practice, Multi-domain Recommendation (MDR) plays a crucial role. Shared-specific architectures are widely used in industrial solutions to capture shared and unique attributes via shared and specific parameters. However, with imbalanced data across different domains, these models face two key issues: (1) Overwhelming: Dominant domain data skews model performance, neglecting non-dominant domains. (2) Overfitting: Sparse data in non-dominant domains leads to overfitting in specific parameters. To tackle these challenges, we propose Soft-partitioned Semi-supervised Collaborative Transfer Learning (SSCTL) for multi-domain recommendation. SSCTL generates dynamic parameters to address the overwhelming issue, thus shifting focus towards samples from non-dominant domains. To combat overfitting, it leverages pseudo-labels with weights from dominant domain instances to enhance non-dominant domain data. We conduct comprehensive experiments, both online and offline, to validate the efficacy of our proposed method. Online tests yielded significant improvements across various domains, with increases in GMV ranging from 0.54% to 2.90% and enhancements in CTR ranging from 0.22% to 1.69%.
comment: Accepted by CIKM'25
☆ RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets
Retrieval-Augmented Generation (RAG) quality depends on many interacting choices across retrieval, ranking, augmentation, prompting, and generation, so optimizing modules in isolation is brittle. We introduce RAGSmith, a modular framework that treats RAG design as an end-to-end architecture search over nine technique families and 46{,}080 feasible pipeline configurations. A genetic search optimizes a scalar objective that jointly aggregates retrieval metrics (recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law, Finance, Medicine, Defense Industry, Computer Science), each with 100 questions spanning factual, interpretation, and long-answer types. RAGSmith finds configurations that consistently outperform naive RAG baseline by +3.8\% on average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in retrieval and +7.5\% in generation. The search typically explores $\approx 0.2\%$ of the space ($\sim 100$ candidates) and discovers a robust backbone -- vector retrieval plus post-generation reflection/revision -- augmented by domain-dependent choices in expansion, reranking, augmentation, and prompt reordering; passage compression is never selected. Improvement magnitude correlates with question type, with larger gains on factual/long-answer mixes than interpretation-heavy sets. These results provide practical, domain-aware guidance for assembling effective RAG systems and demonstrate the utility of evolutionary search for full-pipeline optimization.
comment: 45 pages
☆ A semantic-based deep learning approach for mathematical expression retrieval
Mathematical expressions (MEs) have complex two-dimensional structures in which symbols can be present at any nested depth like superscripts, subscripts, above, below etc. As MEs are represented using LaTeX format, several text retrieval methods based on string matching, vector space models etc., have also been applied for ME retrieval problem in the literature. As these methods are based on syntactic similarity, recently deep learning approaches based on embedding have been used for semantic similarity. In our present work, we have focused on the retrieval of mathematical expressions using deep learning approaches. In our approach, semantic features are extracted from the MEs using a deep recurrent neural network (DRNN) and these features have been used for matching and retrieval. We have trained the network for a classification task which determines the complexity of an ME. ME complexity has been quantified in terms of its nested depth. Based on the nested depth, we have considered three complexity classes of MEs: Simple, Medium and Complex. After training the network, outputs just before the the final fully connected layer are extracted for all the MEs. These outputs form the semantic features of MEs and are stored in a database. For a given ME query, its semantic features are computed using the trained DRNN and matched against the semantic feature database. Matching is performed based on the standard euclidean distance and top 'k' nearest matches are retrieved, where 'k' is a user-defined parameter. Our approach has been illustrated on a database of 829 MEs.
☆ Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems ACSA
Large language models (LLMs) are reshaping numerous facets of our daily lives, leading widespread adoption as web-based services. Despite their versatility, LLMs face notable challenges, such as generating hallucinated content and lacking access to up-to-date information. Lately, to address such limitations, Retrieval-Augmented Generation (RAG) has emerged as a promising direction by generating responses grounded in external knowledge sources. A typical RAG system consists of i) a retriever that probes a group of relevant passages from a knowledge base and ii) a generator that formulates a response based on the retrieved content. However, as with other AI systems, recent studies demonstrate the vulnerability of RAG, such as knowledge corruption attacks by injecting misleading information. In response, several defense strategies have been proposed, including having LLMs inspect the retrieved passages individually or fine-tuning robust retrievers. While effective, such approaches often come with substantial computational costs. In this work, we introduce RAGDefender, a resource-efficient defense mechanism against knowledge corruption (i.e., by data poisoning) attacks in practical RAG deployments. RAGDefender operates during the post-retrieval phase, leveraging lightweight machine learning techniques to detect and filter out adversarial content without requiring additional model training or inference. Our empirical evaluations show that RAGDefender consistently outperforms existing state-of-the-art defenses across multiple models and adversarial scenarios: e.g., RAGDefender reduces the attack success rate (ASR) against the Gemini model from 0.89 to as low as 0.02, compared to 0.69 for RobustRAG and 0.24 for Discern-and-Answer when adversarial passages outnumber legitimate ones by a factor of four (4x).
comment: 15 pages, 7 figures, 10 tables. To appear in the Proceedings of the 2025 Annual Computer Security Applications Conference (ACSAC)
☆ Contextual Relevance and Adaptive Sampling for LLM-Based Document Reranking
Reranking algorithms have made progress in improving document retrieval quality by efficiently aggregating relevance judgments generated by large language models (LLMs). However, identifying relevant documents for queries that require in-depth reasoning remains a major challenge. Reasoning-intensive queries often exhibit multifaceted information needs and nuanced interpretations, rendering document relevance inherently context dependent. To address this, we propose contextual relevance, which we define as the probability that a document is relevant to a given query, marginalized over the distribution of different reranking contexts it may appear in (i.e., the set of candidate documents it is ranked alongside and the order in which the documents are presented to a reranking model). While prior works have studied methods to mitigate the positional bias LLMs exhibit by accounting for the ordering of documents, we empirically find that the compositions of these batches also plays an important role in reranking performance. To efficiently estimate contextual relevance, we propose TS-SetRank, a sampling-based, uncertainty-aware reranking algorithm. Empirically, TS-SetRank improves nDCG@10 over retrieval and reranking baselines by 15-25% on BRIGHT and 6-21% on BEIR, highlighting the importance of modeling relevance as context-dependent.
♻ ☆ Dynamic Forgetting and Spatio-Temporal Periodic Interest Modeling for Local-Life Service Recommendation
In the context of the booming digital economy, recommendation systems, as a key link connecting users and numerous services, face challenges in modeling user behavior sequences on local-life service platforms, including the sparsity of long sequences and strong spatio-temporal dependence. Such challenges can be addressed by drawing an analogy to the forgetting process in human memory. This is because users' responses to recommended content follow the recency effect and the cyclicality of memory. By exploring this, this paper introduces the forgetting curve and proposes Spatio-Temporal periodic Interest Modeling (STIM) with long sequences for local-life service recommendation. STIM integrates three key components: a dynamic masking module based on the forgetting curve, which is used to extract both recent spatiotemporal features and periodic spatiotemporal features; a query-based mixture of experts (MoE) approach that can adaptively activate expert networks under different dynamic masks, enabling the collaborative modeling of time, location, and items; and a hierarchical multi-interest network unit, which captures multi-interest representations by modeling the hierarchical interactions between the shallow and deep semantics of users' recent behaviors. By introducing the STIM method, we conducted online A/B tests and achieved a 1.54\% improvement in gross transaction volume (GTV). In addition, extended offline experiments also showed improvements. STIM has been deployed in a large-scale local-life service recommendation system, serving hundreds of millions of daily active users in core application scenarios.
♻ ☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
♻ ☆ Memory Assisted LLM for Personalized Recommendation System
Large language models (LLMs) have demonstrated significant potential in solving recommendation tasks. With proven capabilities in understanding user preferences, LLM personalization has emerged as a critical area for providing tailored responses to individuals. Current studies explore personalization through prompt design and fine-tuning, paving the way for further research in personalized LLMs. However, existing approaches are either costly and inefficient in capturing diverse user preferences or fail to account for timely updates to user history. To address these gaps, we propose the Memory-Assisted Personalized LLM (MAP). Through user interactions, we first create a history profile for each user, capturing their preferences, such as ratings for historical items. During recommendation, we extract relevant memory based on similarity, which is then incorporated into the prompts to enhance personalized recommendations. In our experiments, we define a new task that enables testing with varying memory size under two scenarios: single domain where memory and tasks are from the same category and cross-domain (e.g. memory from movies and recommendation tasks in books). The results show that MAP outperforms regular LLM-based recommenders that integrate user history directly through prompt design. Moreover, as user history grows, MAP's advantage increases in both scenarios, making it more suitable for addressing successive personalized user requests.
comment: 8 pages, 7 figures
♻ ☆ Evaluating Perspectival Biases in Cross-Modal Retrieval
Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.
♻ ☆ Complex QA and language models hybrid architectures, Survey
This paper reviews the state-of-the-art of large language models (LLM) architectures and strategies for "complex" question-answering with a focus on hybrid architectures. LLM based chatbot services have allowed anyone to grasp the potential of LLM to solve many common problems, but soon discovered their limitations for complex questions. Addressing more specific, complex questions (e.g., "What is the best mix of power-generation methods to reduce climate change ?") often requires specialized architectures, domain knowledge, new skills, decomposition and multi-step resolution, deep reasoning, sensitive data protection, explainability, and human-in-the-loop processes. Therefore, we review: (1) necessary skills and tasks for handling complex questions and common LLM limits to overcome; (2) dataset, cost functions and evaluation metrics for measuring and improving (e.g. accuracy, explainability, fairness, robustness, groundedness, faithfulness, toxicity...); (3) family of solutions to overcome LLM limitations by (a) training and reinforcement (b) hybridization, (c) prompting, (d) agentic-architectures (agents, tools) and extended reasoning.
♻ ☆ Enhancing Time Awareness in Generative Recommendation EMNLP 2025
Generative recommendation has emerged as a promising paradigm that formulates the recommendations into a text-to-text generation task, harnessing the vast knowledge of large language models. However, existing studies focus on considering the sequential order of items and neglect to handle the temporal dynamics across items, which can imply evolving user preferences. To address this limitation, we propose a novel model, Generative Recommender Using Time awareness (GRUT), effectively capturing hidden user preferences via various temporal signals. We first introduce Time-aware Prompting, consisting of two key contexts. The user-level temporal context models personalized temporal patterns across timestamps and time intervals, while the item-level transition context provides transition patterns across users. We also devise Trend-aware Inference, a training-free method that enhances rankings by incorporating trend information about items with generation likelihood. Extensive experiments demonstrate that GRUT outperforms state-of-the-art models, with gains of up to 15.4% and 14.3% in Recall@5 and NDCG@5 across four benchmark datasets. The source code is available at https://github.com/skleee/GRUT.
comment: EMNLP 2025 (Findings)
♻ ☆ MLLM-Driven Semantic Identifier Generation for Generative Cross-Modal Retrieval
Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.
comment: We plan to revise the methodology and update the experimental analysis before resubmission
Multimedia 8
☆ How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment
Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.
☆ EV-NVC: Efficient Variable bitrate Neural Video Compression
Training neural video codec (NVC) with variable rate is a highly challenging task due to its complex training strategies and model structure. In this paper, we train an efficient variable bitrate neural video codec (EV-NVC) with the piecewise linear sampler (PLS) to improve the rate-distortion performance in high bitrate range, and the long-short-term feature fusion module (LSTFFM) to enhance the context modeling. Besides, we introduce mixed-precision training and discuss the different training strategies for each stage in detail to fully evaluate its effectiveness. Experimental results show that our approach reduces the BD-rate by 30.56% compared to HM-16.25 within low-delay mode.
☆ SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment
Fine-grained cross-modal alignment aims to establish precise local correspondences between vision and language, forming a cornerstone for visual question answering and related multimodal applications. Current approaches face challenges in addressing patch redundancy and ambiguity, which arise from the inherent information density disparities across modalities. Recently, Multimodal Large Language Models (MLLMs) have emerged as promising solutions to bridge this gap through their robust semantic generation capabilities. However, the dense textual outputs from MLLMs may introduce conflicts with the original sparse captions. Furthermore, accurately quantifying semantic relevance between rich visual patches and concise textual descriptions remains a core challenge. To overcome these limitations, we introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically addresses patch redundancy and ambiguity. Our approach employs a two-stage mechanism to integrate unified semantics from both dense and sparse texts, enabling the identification of salient visual patches. Additionally, it leverages relevance-aware selection with mean value computation to highlight crucial patch-word correspondences, thereby improving cross-modal similarity assessment. Comprehensive experiments on Flickr30K and MS-COCO datasets validate that SEPS achieves superior performance, surpassing existing approaches by 23\%-86\% in rSum across diverse model architectures, with notable enhancements in text-to-image retrieval scenarios. Our implementation is available at https://github.com/Sweet4tars/seps.git.
♻ ☆ Sound Clouds: Exploring ambient intelligence in public spaces to elicit deep human experience of awe, wonder, and beauty NeurIPS
While the ambient intelligence (AmI) systems we encounter in our daily lives, including security monitoring and energy-saving systems, typically serve pragmatic purposes, we wonder how we can design and implement ambient artificial intelligence experiences in public spaces that elicit deep human feelings of awe, wonder, and beauty. As a manifestation, we introduce Sound Clouds, an immersive art installation that generates live music based on participants' interaction with several human-height spheres. Our installation serves as a provocation into future ambient intelligence that provokes, not limits, the future possibilities of AmI.
comment: 4 pages, Artwork accepted by NeurIPS Creative AI Track 2025
♻ ☆ Class Agnostic Instance-level Descriptor for Visual Instance Search
Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance-level feature representation. Supervised or weakly supervised object detection methods are not the appropriate solutions due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance-level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of instance regions. On the one hand, this kind of hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in real scenarios. On the other hand, the non-leaf nodes and leaf nodes on the hierarchy correspond to the instance regions in different granularities within an image. Therefore, features in uniform length are produced for these instance regions, which may cover across a dominant image region, an integral of multiple instances, or various individual instances. Such a collection of features allows us to unify the image retrieval, multi-instance search, and instance search into one framework. The empirical studies on three benchmarks show that such an instance-level descriptor remains effective on both the known and unknown object categories. Moreover, the superior performance is achieved on single-instance and multi-instance search, as well as image retrieval tasks.
♻ ☆ EgoBlind: Towards Egocentric Visual Assistance for the Blind NeurIPS'25
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.
comment: NeurIPS'25 (D&B Track)
♻ ☆ SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding
Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.
♻ ☆ Prevailing Research Areas for Music AI in the Era of Foundation Models
Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists' workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.
Image and Video Processing 11
☆ Opto-Electronic Convolutional Neural Network Design Via Direct Kernel Optimization
Opto-electronic neural networks integrate optical front-ends with electronic back-ends to enable fast and energy-efficient vision. However, conventional end-to-end optimization of both the optical and electronic modules is limited by costly simulations and large parameter spaces. We introduce a two-stage strategy for designing opto-electronic convolutional neural networks (CNNs): first, train a standard electronic CNN, then realize the optical front-end implemented as a metasurface array through direct kernel optimization of its first convolutional layer. This approach reduces computational and memory demands by hundreds of times and improves training stability compared to end-to-end optimization. On monocular depth estimation, the proposed two-stage design achieves twice the accuracy of end-to-end training under the same training time and resource constraints.
☆ Reliability Assessment Framework Based on Feature Separability for Pathological Cell Image Classification under Prior Bias
\textbf{Background and objective:} Prior probability shift between training and deployment datasets challenges deep learning--based medical image classification. Standard correction methods reweight posterior probabilities to adjust prior bias, yet their benefit is inconsistent. We developed a reliability framework identifying when prior correction helps or harms performance in pathological cell image analysis. \textbf{Methods:} We analyzed 303 colorectal cancer specimens with CD103/CD8 immunostaining, yielding 185{,}432 annotated cell images across 16 cell types. ResNet models were trained under varying bias ratios (1.1--20$\times$). Feature separability was quantified using cosine similarity--based likelihood quality scores, reflecting intra- versus inter-class distinctions in learned feature spaces. Multiple linear regression, ANOVA, and generalized additive models (GAMs) evaluated associations among feature separability, prior bias, sample adequacy, and F1 performance. \textbf{Results:} Feature separability dominated performance ($\beta = 1.650$, $p < 0.001$), showing 412-fold stronger impact than prior bias ($\beta = 0.004$, $p = 0.018$). GAM analysis showed strong predictive power ($R^2 = 0.876$) with mostly linear trends. A quality threshold of 0.294 effectively identified cases requiring correction (AUC = 0.610). Cell types scoring $>0.5$ were robust without correction, whereas those $<0.3$ consistently required adjustment. \textbf{Conclusion:} Feature extraction quality, not bias magnitude, governs correction benefit. The proposed framework provides quantitative guidance for selective correction, enabling efficient deployment and reliable diagnostic AI.
☆ Learned Adaptive Kernels for High-Fidelity Image Downscaling
Image downscaling is a fundamental operation in image processing, crucial for adapting high-resolution content to various display and storage constraints. While classic methods often introduce blurring or aliasing, recent learning-based approaches offer improved adaptivity. However, achieving maximal fidelity against ground-truth low-resolution (LR) images, particularly by accounting for channel-specific characteristics, remains an open challenge. This paper introduces ADK-Net (Adaptive Downscaling Kernel Network), a novel deep convolutional neural network framework for high-fidelity supervised image downscaling. ADK-Net explicitly addresses channel interdependencies by learning to predict spatially-varying, adaptive resampling kernels independently for each pixel and uniquely for each color channel (RGB). The architecture employs a hierarchical design featuring a ResNet-based feature extractor and parallel channel-specific kernel generators, themselves composed of ResNet-based trunk and branch sub-modules, enabling fine-grained kernel prediction. Trained end-to-end using an L1 reconstruction loss against ground-truth LR data, ADK-Net effectively learns the target downscaling transformation. Extensive quantitative and qualitative experiments on standard benchmarks, including the RealSR dataset, demonstrate that ADK-Net establishes a new state-of-the-art in supervised image downscaling, yielding significant improvements in PSNR and SSIM metrics compared to existing learning-based and traditional methods.
comment: 10 pages, 6 figures, and 3 tables
☆ Few-Shot Multimodal Medical Imaging: A Theoretical Framework
Medical imaging relies heavily on large, labeled datasets. But, unfortunately, they are not always easily accessible in clinical settings. Additionally, many practitioners often face various structural obstacles like limited data availability, fragmented data systems, and unbalanced datasets. These barriers often lead to the increased diagnostic uncertainty, underrepresentation of certain conditions, reduced model robustness, and biased diagnostic decisions. In response to these challenges, approaches such as transfer learning, meta-learning, and multimodal fusion have made great strides. However, they still need a solid theoretical justification for why they succeed or fail in situations where data is scarce. To address this gap, we propose a unified theoretical framework that characterizes learning and inference under low-resource medical imaging conditions. We first formalize the learning objective under few-shot conditions and compute sample complexity constraints to estimate the smallest quantity of data needed to achieve clinically reliable accuracy. Then based on ideas from PAC-learning and PAC-Bayesian theory, we explain how multimodal integration encourages generalization and quantifies uncertainty under sparse supervision. We further propose a formal metric for explanation stability, offering interpretability guarantees under low-data conditions. Taken together, the proposed framework establishes a principled foundation for constructing dependable, data-efficient diagnostic systems by jointly characterizing sample efficiency, uncertainty quantification, and interpretability in a unified theoretical setting.
comment: 6 Pages
♻ ☆ Anatomically Guided Motion Correction for Placental IVIM Parameter Estimation with Accelerated Sampling Method
Intravoxel incoherent motion (IVIM) is a diffusion-weighted magnetic resonance imaging (MRI) method that may be applied to the placenta to help diagnose abnormal pregnancies. IVIM requires prolonged scan times, followed by a model-based estimation procedure. Maternal or fetal motion during the scan affects the accuracy of this estimation. In this work, we proposed to address this challenging motion correction and data fitting problem by using additional anatomical information that is routinely collected at the beginning of the examination. Super-resolution reconstruction (SRR) was applied to these anatomical data, to provide a patient-specific, 3D isotropic, anatomic reference. Our first contribution is a novel framework with a two-step motion correction that uses both IVIM and the SRR anatomic data, accounting for both intra- and inter-scan, non-rigid motion. Our second contribution is an automation and acceleration of the IVIM data fitting, using a state-of-the-art Bayesian-type algorithm, modified with a preconditioned Crank-Nicholson (pCN) sampling strategy. The accuracy of the IVIM parameter fitting was improved by the proposed motion correction strategy, as assessed by the mean absolute fitting error in the region of interest, which was 4.14 before and 3.02 after correction (arbitrary units of signal intensity). The novel sampling strategy accelerated parameter estimation by 39% in average, with the same accuracy as that of the conventional Bayesian approach. In conclusion, the proposed method may be applied to obtain fast and reliable IVIM parameter estimates in challenging scenarios such as prenatal MRI.
comment: 11 pages, 6 figures
♻ ☆ LanPaint: Training-Free Diffusion Inpainting with Asymptotically Exact and Fast Conditional Sampling
Diffusion models excel at joint pixel sampling for image generation but lack efficient training-free methods for partial conditional sampling (e.g., inpainting with known pixels). Prior work typically formulates this as an intractable inverse problem, relying on coarse variational approximations, heuristic losses requiring expensive backpropagation, or slow stochastic sampling. These limitations preclude: (1) accurate distributional matching in inpainting results, (2) efficient inference modes without gradient, (3) compatibility with fast ODE-based samplers. To address these limitations, we propose LanPaint: a training-free, asymptotically exact partial conditional sampling methods for ODE-based and rectified flow diffusion models. By leveraging carefully designed Langevin dynamics, LanPaint enables fast, backpropagation-free Monte Carlo sampling. Experiments demonstrate that our approach achieves superior performance with precise partial conditioning and visually coherent inpainting across diverse tasks.
♻ ☆ SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning NeurIPS 2025
Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. Our code is available at https://github.com/MichaelMaiii/SynBrain.
comment: Accepted by NeurIPS 2025
♻ ☆ DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding
Unlike fixed- or variable-rate image coding, progressive image coding (PIC) aims to compress various qualities of images into a single bitstream, increasing the versatility of bitstream utilization and providing high compression efficiency compared to simulcast compression. Research on neural network (NN)-based PIC is in its early stages, mainly focusing on applying varying quantization step sizes to the transformed latent representations in a hierarchical manner. These approaches are designed to compress only the progressively added information as the quality improves, considering that a wider quantization interval for lower-quality compression includes multiple narrower sub-intervals for higher-quality compression. However, the existing methods are based on handcrafted quantization hierarchies, resulting in sub-optimal compression efficiency. In this paper, we propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer. We also incorporate selective compression with which only the essential representation components are compressed for each quantization layer. We demonstrate that our method achieves significantly higher coding efficiency than the existing approaches with decreased decoding time and reduced model size. The source code is publicly available at https://github.com/JooyoungLeeETRI/DeepHQ
comment: Accepted to ACM TOMM (2025)
♻ ☆ Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.
comment: Accepted for publication in IEEE Transactions on Medical Imaging (IEEE TMI)
♻ ☆ DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI
Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.
comment: 14 pages
♻ ☆ Real World Federated Learning with a Knowledge Distilled Transformer for Cardiac CT Imaging
Federated learning is a renowned technique for utilizing decentralized data while preserving privacy. However, real-world applications often face challenges like partially labeled datasets, where only a few locations have certain expert annotations, leaving large portions of unlabeled data unused. Leveraging these could enhance transformer architectures ability in regimes with small and diversely annotated sets. We conduct the largest federated cardiac CT analysis to date (n=8,104) in a real-world setting across eight hospitals. Our two-step semi-supervised strategy distills knowledge from task-specific CNNs into a transformer. First, CNNs predict on unlabeled data per label type and then the transformer learns from these predictions with label-specific heads. This improves predictive accuracy and enables simultaneous learning of all partial labels across the federation, and outperforms UNet-based models in generalizability on downstream tasks. Code and model weights are made openly available for leveraging future cardiac CT analysis.
Computer Vision and Pattern Recognition 37
♻ ☆ CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities
Canada experienced in 2023 one of the most severe wildfire seasons in recent history, causing damage across ecosystems, destroying communities, and emitting large quantities of CO2. This extreme wildfire season is symptomatic of a climate-change-induced increase in the length and severity of the fire season that affects the boreal ecosystem. Therefore, it is critical to empower wildfire management in boreal communities with better mitigation solutions. Wildfire probability maps represent an important tool for understanding the likelihood of wildfire occurrence and the potential severity of future wildfires. The massive increase in the availability of Earth observation data has enabled the development of deep learning-based wildfire forecasting models, aiming at providing precise wildfire probability maps at different spatial and temporal scales. A main limitation of such methods is their reliance on coarse-resolution environmental drivers and satellite products, leading to wildfire occurrence prediction of reduced resolution, typically around $\sim 0.1${\deg}. This paper presents a benchmark dataset: CanadaFireSat, and baseline methods for high-resolution: 100 m wildfire forecasting across Canada, leveraging multi-modal data from high-resolution multi-spectral satellite images (Sentinel-2 L1C), mid-resolution satellite products (MODIS), and environmental factors (ERA5 reanalysis data). Our experiments consider two major deep learning architectures. We observe that using multi-modal temporal inputs outperforms single-modal temporal inputs across all metrics, achieving a peak performance of 60.3% in F1 score for the 2023 wildfire season, a season never seen during model training. This demonstrates the potential of multi-modal deep learning models for wildfire forecasting at high-resolution and continental scale.
comment: 34 pages, 11 figures
♻ ☆ Learning to Steer: Input-dependent Steering for Multimodal LLMs NeurIPS 2025
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines. Our code is publicly available at https://jayneelparekh.github.io/learn-to-steer/
comment: NeurIPS 2025
♻ ☆ Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
♻ ☆ Dropping the D: RGB-D SLAM Without the Depth Sensor
We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.
♻ ☆ As Good as It KAN Get: High-Fidelity Audio Representation CIKM '25
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.
comment: Accepted to the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)
♻ ☆ Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation
Cardiac substructure delineation is emerging in treatment planning to minimize the risk of radiation-induced heart disease. Deep learning offers efficient methods to reduce contouring burden but currently lacks generalizability across different modalities and overlapping structures. This work introduces and validates a Modality-AGnostic Image Cascade (MAGIC) deep-learning pipeline for comprehensive and multi-modal cardiac substructure segmentation. MAGIC is implemented through replicated encoding and decoding branches of an nnU-Net backbone to handle multi-modality inputs and overlapping labels. First benchmarked on the multi-modality whole-heart segmentation (MMWHS) dataset including cardiac CT-angiography (CCTA) and MR modalities, twenty cardiac substructures (heart, chambers, great vessels (GVs), valves, coronary arteries (CAs), and conduction nodes) from clinical simulation CT (Sim-CT), low-field MR-Linac, and cardiac CT-angiography (CCTA) modalities were delineated to train semi-supervised (n=151), validate (n=15), and test (n=30) MAGIC. For comparison, fourteen single-modality comparison models (two MMWHS modalities and four subgroups across three clinical modalities) were trained. Methods were evaluated for efficiency and against reference contours through the Dice similarity coefficient (DSC) and two-tailed Wilcoxon Signed-Rank test (p<0.05). Average MMWHS DSC scores across CCTA and MR inputs were 0.88(0.08) and 0.87(0.04) respectively with significant improvement over unimodal baselines. Average 20-structure DSC scores were 0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. Furthermore, >80% and >70% reductions in training time and parameters were achieved, respectively. MAGIC offers an efficient, lightweight solution capable of segmenting multiple image modalities and overlapping structures in a single model without compromising segmentation accuracy.
♻ ☆ LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking
Tissue tracking plays a critical role in various surgical navigation and extended reality (XR) applications. While current methods trained on large synthetic datasets achieve high tracking accuracy and generalize well to endoscopic scenes, their runtime performances fail to meet the low-latency requirements necessary for real-time surgical applications. To address this limitation, we propose LiteTracker, a low-latency method for tissue tracking in endoscopic video streams. LiteTracker builds on a state-of-the-art long-term point tracking method, and introduces a set of training-free runtime optimizations. These optimizations enable online, frame-by-frame tracking by leveraging a temporal memory buffer for efficient feature reuse and utilizing prior motion for accurate track initialization. LiteTracker demonstrates significant runtime improvements being around 7x faster than its predecessor and 2x than the state-of-the-art. Beyond its primary focus on efficiency, LiteTracker delivers high-accuracy tracking and occlusion prediction, performing competitively on both the STIR and SuPer datasets. We believe LiteTracker is an important step toward low-latency tissue tracking for real-time surgical applications in the operating room. Our code is publicly available at https://github.com/ImFusionGmbH/lite-tracker.
♻ ☆ AdaSCALE: Adaptive Scaling for OOD Detection SC
The ability of the deep learning model to recognize when a sample falls outside its learned distribution is critical for safe and reliable deployment. Recent state-of-the-art out-of-distribution (OOD) detection methods leverage activation shaping to improve the separation between in-distribution (ID) and OOD inputs. These approaches resort to sample-specific scaling but apply a static percentile threshold across all samples regardless of their nature, resulting in suboptimal ID-OOD separability. In this work, we propose \textbf{AdaSCALE}, an adaptive scaling procedure that dynamically adjusts the percentile threshold based on a sample's estimated OOD likelihood. This estimation leverages our key observation: OOD samples exhibit significantly more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples. AdaSCALE enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, yielding highly separable energy scores. Our approach achieves state-of-the-art OOD detection performance, outperforming the latest rival OptFS by 14.94% in near-OOD and 21.67% in far-OOD datasets in average FPR@95 metric on the ImageNet-1k benchmark across eight diverse architectures. The code is available at: https://github.com/sudarshanregmi/AdaSCALE/
comment: https://github.com/sudarshanregmi/AdaSCALE/
♻ ☆ A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing
A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at https://github.com/ghosh64/RaceGAN.
comment: Currently Under Review
♻ ☆ Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data ICCV 2025
AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0--a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation--based on lessons from the JHH dataset--for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.
comment: ICCV 2025
♻ ☆ CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward
We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1
comment: 10 pages, 6 figures and 4 tables
♻ ☆ Gaussian Splashing: Direct Volumetric Rendering Underwater
In underwater images, most useful features are occluded by water. The extent of the occlusion depends on imaging geometry and can vary even across a sequence of burst images. As a result, 3D reconstruction methods robust on in-air scenes, like Neural Radiance Field methods (NeRFs) or 3D Gaussian Splatting (3DGS), fail on underwater scenes. While a recent underwater adaptation of NeRFs achieved state-of-the-art results, it is impractically slow: reconstruction takes hours and its rendering rate, in frames per second (FPS), is less than 1. Here, we present a new method that takes only a few minutes for reconstruction and renders novel underwater scenes at 140 FPS. Named Gaussian Splashing, our method unifies the strengths and speed of 3DGS with an image formation model for capturing scattering, introducing innovations in the rendering and depth estimation procedures and in the 3DGS loss function. Despite the complexities of underwater adaptation, our method produces images at unparalleled speeds with superior details. Moreover, it reveals distant scene details with far greater clarity than other methods, dramatically improving reconstructed and rendered images. We demonstrate results on existing datasets and a new dataset we have collected. Additional visual results are available at: https://bgu-cs-vil.github.io/gaussiansplashingUW.github.io/ .
♻ ☆ BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities EMNLP 2025
We introduce BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model that supports text-based and image-based medical interactions. It enables multi-turn conversation in Arabic and English and supports diverse medical imaging modalities, including radiology, CT, and histology. To train BiMediX2, we curate BiMed-V, an extensive Arabic-English bilingual healthcare dataset consisting of 1.6M samples of diverse medical interactions. This dataset supports a range of medical Large Language Model (LLM) and Large Multimodal Model (LMM) tasks, including multi-turn medical conversations, report generation, and visual question answering (VQA). We also introduce BiMed-MBench, the first Arabic-English medical LMM evaluation benchmark, verified by medical experts. BiMediX2 demonstrates excellent performance across multiple medical LLM and LMM benchmarks, achieving state-of-the-art results compared to other open-sourced models. On BiMed-MBench, BiMediX2 outperforms existing methods by over 9% in English and more than 20% in Arabic evaluations. Additionally, it surpasses GPT-4 by approximately 9% in UPHILL factual accuracy evaluations and excels in various medical VQA, report generation, and report summarization tasks. Our trained models, instruction set, and source code are available at https://github.com/mbzuai-oryx/BiMediX2
comment: Accepted to EMNLP 2025 (Findings)
♻ ☆ MOSPA: Human Motion Generation Driven by Spatial Audio NeurIPS 2025
Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation
comment: NeurIPS 2025 (Spotlight)
♻ ☆ Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving ICRA 2025
Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction.
comment: 8 pages, 3 figures, ICRA 2025
♻ ☆ OpenMaterial: A Large-scale Dataset of Complex Materials for 3D Reconstruction
Recent advances in deep learning, such as neural radiance fields and implicit neural representations, have significantly advanced 3D reconstruction. However, accurately reconstructing objects with complex optical properties, such as metals, glass, and plastics, remains challenging due to the breakdown of multi-view color consistency in the presence of specular reflections, refractions, and transparency. This limitation is further exacerbated by the lack of benchmark datasets that explicitly model material-dependent light transport. To address this, we introduce OpenMaterial, a large-scale semi-synthetic dataset for benchmarking material-aware 3D reconstruction. It comprises 1,001 objects spanning 295 distinct materials, including conductors, dielectrics, plastics, and their roughened variants, captured under 714 diverse lighting conditions. By integrating lab-measured Index of Refraction (IOR) spectra, OpenMaterial enables the generation of high-fidelity multi-view images that accurately simulate complex light-matter interactions. It provides multi-view images, 3D shape models, camera poses, depth maps, and object masks, establishing the first extensive benchmark for evaluating 3D reconstruction on challenging materials. We evaluate 11 state-of-the-art methods for 3D reconstruction and novel view synthesis, conducting ablation studies to assess the impact of material type, shape complexity, and illumination on reconstruction performance. Our results indicate that OpenMaterial provides a strong and fair basis for developing more robust, physically-informed 3D reconstruction techniques to better handle real-world optical complexities.
♻ ☆ Improved visual-information-driven model for crowd simulation and its modular application
Data-driven crowd simulation models offer advantages in enhancing the accuracy and realism of simulations, and improving their generalizability is essential for promoting application. Current data-driven approaches are primarily designed for a single scenario, with very few models validated across more than two scenarios. It is still an open question to develop data-driven crowd simulation models with strong generalizibility. We notice that the key to addressing this challenge lies in effectively and accurately capturing the core common influential features that govern pedestrians' navigation across diverse scenarios. Particularly, we believe that visual information is one of the most dominant influencing features. In light of this, this paper proposes a data-driven model incorporating a refined visual information extraction method and exit cues to enhance generalizability. The proposed model is examined on four common fundamental modules: bottleneck, corridor, corner and T-junction. The evaluation results demonstrate that our model performs excellently across these scenarios, aligning with pedestrian movement in real-world experiments, and significantly outperforms the classical knowledge-driven model. Furthermore, we introduce a modular approach to apply our proposed model in composite scenarios, and the results regarding trajectories and fundamental diagrams indicate that our simulations closely match real-world patterns in the composite scenario. The research outcomes can provide inspiration for the development of data-driven crowd simulation models with high generalizability and advance the application of data-driven approaches.This work has been submitted to Elsevier for possible publication.
♻ ☆ Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.
comment: The authors have discovered a significant error in the paper subsequent to submission, and are withdrawing the manuscript for substantial correction
♻ ☆ Semantic-Aware Representation Learning via Conditional Transport for Multi-Label Image Classification
Multi-label image classification is a critical task in machine learning that aims to accurately assign multiple labels to a single image. While existing methods often utilize attention mechanisms or graph convolutional networks to model visual representations, their performance is still constrained by two critical limitations: the inability to learn discriminative semantic-aware features, and the lack of fine-grained alignment between visual representations and label embeddings. To tackle these issues in a unified framework, this paper proposes a novel approach named Semantic-aware representation learning via Conditional Transport for Multi-Label Image Classification (SCT). The proposed method introduces a semantic-related feature learning module that extracts discriminative label-specific features by emphasizing semantic relevance and interaction, along with a conditional transport-based alignment mechanism that enables precise visual-semantic alignment. Extensive experiments on two widely-used benchmark datasets, VOC2007 and MS-COCO, validate the effectiveness of SCT and demonstrate its superior performance compared to existing state-of-the-art methods.
comment: The paper is under consideration at Pattern Recognition Letters
♻ ☆ SurGen: 1020 H&E-stained Whole Slide Images With Survival and Genetic Markers
Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine. We present SurGen, a dataset comprising 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. We illustrate SurGen's utility with a proof-of-concept model that predicts mismatch repair status directly from WSIs, achieving a test area under the receiver operating characteristic curve of 0.8273. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer and beyond. SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online: https://doi.org/10.6019/S-BIAD1285.
comment: To download the dataset, see https://doi.org/10.6019/S-BIAD1285. See https://github.com/CraigMyles/SurGen-Dataset for GitHub repository and additional info
♻ ☆ Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
comment: Preprint
♻ ☆ Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.
comment: The code is available at https://github.com/umbertocappellazzo/Llama-AVSR
♻ ☆ EndoGMDE: Generalizable Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes
Self-supervised monocular depth estimation is a significant task for low-cost and efficient 3D scene perception and measurement in endoscopy. However, the variety of illumination conditions and scene features is still the primary challenges for depth estimation in endoscopic scenes. In this work, a novel self-supervised framework is proposed for monocular depth estimation in diverse endoscopy. Firstly, considering the diverse features in endoscopic scenes with different tissues, a novel block-wise mixture of dynamic low-rank experts is proposed to efficiently finetune the foundation model for endoscopic depth estimation. In the proposed module, based on the input feature, different experts with a small amount of trainable parameters are adaptively selected for weighted inference, from low-rank experts which are allocated based on the generalization of each block. Moreover, a novel self-supervised training framework is proposed to jointly cope with brightness inconsistency and reflectance interference. The proposed method outperforms state-of-the-art works on SCARED dataset and SimCol dataset. Furthermore, the proposed network also achieves the best generalization based on zero-shot depth estimation on C3VD, Hamlyn and SERV-CT dataset. The outstanding performance of our model is further demonstrated with 3D reconstruction and ego-motion estimation. The proposed method could contribute to accurate endoscopy for minimally invasive measurement and surgery. The evaluation codes will be released upon acceptance, while the demo videos can be found on: https://endo-gmde.netlify.app/.
comment: 12 pages, 12 figures, 7 tables. Under Review
♻ ☆ SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.5% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.1% AUC gain.
comment: Technical Report
♻ ☆ Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration
Glaucoma is a major cause of irreversible blindness, with significant diagnostic subjectivity. This inherent uncertainty, combined with the overconfidence of models optimized solely for accuracy can lead to fatal issues such as overdiagnosis or missing critical diseases. To ensure clinical trust, model calibration is essential for reliable predictions, yet study in this field remains limited. Existing calibration study have overlooked glaucoma's systemic associations and high diagnostic subjectivity. To overcome these limitations, we propose V-ViT (Voting-based ViT), a framework that enhances calibration by integrating a patient's binocular information and metadata. Furthermore, to mitigate diagnostic subjectivity, V-ViT utilizes an iterative dropout-based Voting System to maximize calibration performance. The proposed framework achieved state-of-the-art performance across all metrics, including the primary calibration metrics. Our results demonstrate that V-ViT effectively resolves the issue of overconfidence in predictions in glaucoma diagnosis, providing highly reliable predictions for clinical use. Our source code is available at https://github.com/starforTJ/V-ViT.
♻ ☆ Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.
♻ ☆ LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark
♻ ☆ FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion IROS 2025
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth. Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs .Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE
comment: [IROS 2025, camera ready version]: 8 pages, 6 figures
♻ ☆ REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning NeurIPS 2025
Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP's superior resource efficiency over state-of-the-art rehearsal-free CL methods.
comment: accepted to NeurIPS 2025
♻ ☆ CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding
Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12,805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277,653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.
♻ ☆ Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds
Regarding intelligent transportation systems, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among connected agents, such as vehicles and infrastructures, under restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as the student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our cross-source distillation training strategy (CDTS) equips the student detector with robustness to low-quality compressed data while preserving the accuracy benefits of raw data through transferred distillation knowledge. Experimental results on the KITTI and DAIR-V2X-V datasets demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. We will release the code publicly at https://github.com/HaoJing-SX/RPKD.
♻ ☆ DeGMix: Efficient Multi-Task Dense Prediction with Deformable and Gating Mixer
Convolution neural networks and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Existing studies typically employ either CNNs (effectively capture local spatial patterns) or Transformers (capturing long-range dependencies) independently, but integrating their strengths may yield more robust models. In this work, we present an efficient MTL model that combines the adaptive capabilities of deformable CNN and query-based Transformer with shared gating for MTL of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and the advantages of lower cost, less complexity, and smaller parameters than traditional MTL methods. We introduce an efficient multi-task dense prediction with deformable and gating mixer (DeGMix). First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the task-aware gating transformer decoder is used to perform task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to dynamically select the corresponding task-specific features. Furthermore, the results of the experiment demonstrate that the proposed DeGMix uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets. Our code and models are available at https://github.com/yangyangxu0/DeMTG.
♻ ☆ D$^2$GS: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction
Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textit{i.e.} LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose D$^2$GS, a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. $\textbf{First}$, we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. $\textbf{Second}$, we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. $\textbf{Finally}$, we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.
♻ ☆ Balancing Efficiency and Quality: MoEISR for Arbitrary-Scale Image Super-Resolution
Arbitrary-scale image super-resolution employing implicit neural functions has gained significant attention lately due to its capability to upscale images across diverse scales utilizing only a single model. Nevertheless, these methodologies have imposed substantial computational demands as they involve querying every target pixel to a single resource-intensive decoder. In this paper, we introduce a novel and efficient framework, the Mixture-of-Experts Implicit Super-Resolution (MoEISR), which enables super-resolution at arbitrary scales with significantly increased computational efficiency without sacrificing reconstruction quality. MoEISR dynamically allocates the most suitable decoding expert to each pixel using a lightweight mapper module, allowing experts with varying capacities to reconstruct pixels across regions with diverse complexities. Our experiments demonstrate that MoEISR successfully reduces significant amount of floating point operations (FLOPs) while delivering comparable or superior peak signal-to-noise ratio (PSNR).
♻ ☆ Bidirectional Feature-aligned Motion Transformation for Efficient Dynamic Point Cloud Compression
Efficient dynamic point cloud compression (DPCC) critically depends on accurate motion estimation and compensation. However, the inherently irregular structure and substantial local variations of point clouds make this task highly challenging. Existing approaches typically rely on explicit motion estimation, whose encoded motion vectors often fail to capture complex dynamics and inadequately exploit temporal correlations. To address these limitations, we propose a Bidirectional Feature-aligned Motion Transformation (Bi-FMT) framework that implicitly models motion in the feature space. Bi-FMT aligns features across both past and future frames to produce temporally consistent latent representations, which serve as predictive context in a conditional coding pipeline, forming a unified ``Motion + Conditional'' representation. Built upon this bidirectional feature alignment, we introduce a Cross-Transformer Refinement module (CTR) at the decoder side to adaptively refine locally aligned features. By modeling cross-frame dependencies with vector attention, CRT enhances local consistency and restores fine-grained spatial details that are often lost during motion alignment. Moreover, we design a Random Access (RA) reference strategy that treats the bidirectionally aligned features as conditional context, enabling frame-level parallel compression and eliminating the sequential encoding. Extensive experiments demonstrate that Bi-FMT surpasses D-DPCC and AdaDPCC in both compression efficiency and runtime, achieving BD-Rate reductions of 20% (D1) and 9.4% (D1), respectively.
comment: 11 pages
♻ ☆ Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting
Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. This work addresses these individual sensor limitations by introducing a sensor fusion approach combining RGB and event sensors. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset has been released publicly.
comment: Associated dataset: https://zenodo.org/records/15861758
♻ ☆ EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models
Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.
Machine Learning 55
♻ ☆ From Epilepsy Seizures Classification to Detection: A Deep Learning-based Approach for Raw EEG Signals
Epilepsy represents the most prevalent neurological disease in the world. One-third of people suffering from mesial temporal lobe epilepsy (MTLE) exhibit drug resistance, urging the need to develop new treatments. A key part in anti-seizure medication (ASM) development is the capability of detecting and quantifying epileptic seizures occurring in electroencephalogram (EEG) signals, which is crucial for treatment efficacy evaluation. In this study, we introduced a seizure detection pipeline based on deep learning models applied to raw EEG signals. This pipeline integrates: a new pre-processing technique which segments continuous raw EEG signals without prior distinction between seizure and seizure-free activities; a post-processing algorithm developed to reassemble EEG segments and allow the identification of seizures start/end; and finally, a new evaluation procedure based on a strict seizure events comparison between predicted and real labels. Models training have been performed using a data splitting strategy which addresses the potential for data leakage. We demonstrated the fundamental differences between a seizure classification and a seizure detection task and showed the differences in performance between the two tasks. Finally, we demonstrated the generalization capabilities across species of our best architecture, combining a Convolutional Neural Network and a Transformer encoder. The model was trained on animal EEGs and tested on human EEGs with a F1-score of 93% on a balanced Bonn dataset.
comment: 25 pages, 3 tables, 5 figures
♻ ☆ An Effective Flow-based Method for Positive-Unlabeled Learning: 2-HNC
In many scenarios of binary classification, only positive instances are provided in the training data, leaving the rest of the data unlabeled. This setup, known as positive-unlabeled (PU) learning, is addressed here with a network flow-based method which utilizes pairwise similarities between samples. The method we propose here, 2-HNC, leverages Hochbaum's Normalized Cut (HNC) and the set of solutions it provides by solving a parametric minimum cut problem. The set of solutions, that are nested partitions of the samples into two sets, correspond to varying tradeoff values between the two goals: high intra-similarity inside the sets and low inter-similarity between the two sets. This nested sequence is utilized here to deliver a ranking of unlabeled samples by their likelihood of being negative. Building on this insight, our method, 2-HNC, proceeds in two stages. The first stage generates this ranking without assuming any negative labels, using a problem formulation that is constrained only on positive labeled samples. The second stage augments the positive set with likely-negative samples and recomputes the classification. The final label prediction selects among all generated partitions in both stages, the one that delivers a positive class proportion, closest to a prior estimate of this quantity, which is assumed to be given. Extensive experiments across synthetic and real datasets show that 2-HNC yields strong performance and often surpasses existing state-of-the-art algorithms.
♻ ☆ Learning to Steer: Input-dependent Steering for Multimodal LLMs NeurIPS 2025
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines. Our code is publicly available at https://jayneelparekh.github.io/learn-to-steer/
comment: NeurIPS 2025
♻ ☆ Multi-Step Reasoning with Large Language Models, a Survey
Large language models (LLMs) with billions of parameters exhibit in-context learning abilities, enabling few-shot learning on tasks that the model was not specifically trained for. Traditional models achieve breakthrough performance on language tasks, but do not perform well on basic reasoning benchmarks. However, a new in-context learning approach, Chain-of-thought, has demonstrated strong multi-step reasoning abilities on these benchmarks. The research on LLM reasoning abilities started with the question whether LLMs can solve grade school math word problems, and has expanded to other tasks in the past few years. This article reviews the field of multi-step reasoning with LLMs. We propose a taxonomy that identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. We find that multi-step reasoning approaches have progressed beyond math word problems, and can now successfully solve challenges in logic, combinatorial games, and robotics, sometimes by first generating code that is then executed by external tools. Many studies in multi-step methods use reinforcement learning for finetuning, external optimization loops, in-context reinforcement learning, and self-reflection.
comment: ACM Computing Surveys
♻ ☆ Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
♻ ☆ Bayesian Additive Main Effects and Multiplicative Interaction Models using Tensor Regression for Multi-environmental Trials
We propose a Bayesian tensor regression model to accommodate the effect of multiple factors on phenotype prediction. We adopt a set of prior distributions that resolve identifiability issues that may arise between the parameters in the model. Further, we incorporate a spike-and-slab structure that identifies which interactions are relevant for inclusion in the linear predictor, even when they form a subset of the available variables. Simulation experiments show that our method outperforms previous related models and machine learning algorithms under different sample sizes and degrees of complexity. We further explore the applicability of our model by analysing real-world data related to wheat production across Ireland from 2010 to 2019. Our model performs competitively and overcomes key limitations found in other analogous approaches. Finally, we adapt a set of visualisations for the posterior distribution of the tensor effects that facilitate the identification of optimal interactions between the tensor variables, whilst accounting for the uncertainty in the posterior distribution.
♻ ☆ Multi-head Temporal Latent Attention NeurIPS 2025
While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3x speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.
comment: Accepted by NeurIPS 2025
♻ ☆ Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization
Shampoo and its efficient variant, SOAP, employ structured second-moment estimations and have shown strong performance for training neural networks (NNs). In practice, however, Shampoo typically requires step-size grafting with Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo's eigenbasis -- at the cost of additional memory overhead from Adam in both methods. Prior analyses have largely relied on the Frobenius norm to motivate these estimation schemes. We instead recast their estimation procedures as covariance estimation under Kullback-Leibler (KL) divergence minimization, revealing a previously overlooked theoretical limitation and motivating principled redesigns. Building on this perspective, we develop $\textbf{KL-Shampoo}$ and $\textbf{KL-SOAP}$, practical schemes that match or exceed the performance of Shampoo and SOAP in NN pre-training while achieving SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to attain competitive performance, eliminating the memory overhead introduced by Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP, Shampoo, and even KL-SOAP, establishing the KL-based approach as a compelling foundation for designing structured methods in NN optimization.
comment: improved the main text, working in progress
♻ ☆ MistralBSM: Leveraging Mistral-7B for Vehicular Networks Misbehavior Detection
Malicious attacks on vehicular networks pose a serious threat to road safety as well as communication reliability. A major source of these threats stems from misbehaving vehicles within the network. To address this challenge, we propose a Large Language Model (LLM)-empowered Misbehavior Detection System (MDS) within an edge-cloud detection framework. Specifically, we fine-tune Mistral-7B, a compact and high-performing LLM, to detect misbehavior based on Basic Safety Messages (BSM) sequences as the edge component for real-time detection, while a larger LLM deployed in the cloud validates and reinforces the edge model's detection through a more comprehensive analysis. By updating only 0.012% of the model parameters, our model, which we named MistralBSM, achieves 98% accuracy in binary classification and 96% in multiclass classification on a selected set of attacks from VeReMi dataset, outperforming LLAMA2-7B and RoBERTa. Our results validate the potential of LLMs in MDS, showing a significant promise in strengthening vehicular network security to better ensure the safety of road users.
♻ ☆ Reinforcement Learning from Human Feedback
Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.
comment: 144 pages. Web-native version at https://rlhfbook.com/ Continually improving, latest version at website
♻ ☆ Variational Inference in Location-Scale Families: Exact Recovery of the Mean and Correlation Matrix
Given an intractable target density $p$, variational inference (VI) attempts to find the best approximation $q$ from a tractable family $Q$. This is typically done by minimizing the exclusive Kullback-Leibler divergence, $\text{KL}(q||p)$. In practice, $Q$ is not rich enough to contain $p$, and the approximation is misspecified even when it is a unique global minimizer of $\text{KL}(q||p)$. In this paper, we analyze the robustness of VI to these misspecifications when $p$ exhibits certain symmetries and $Q$ is a location-scale family that shares these symmetries. We prove strong guarantees for VI not only under mild regularity conditions but also in the face of severe misspecifications. Namely, we show that (i) VI recovers the mean of $p$ when $p$ exhibits an \textit{even} symmetry, and (ii) it recovers the correlation matrix of $p$ when in addition~$p$ exhibits an \textit{elliptical} symmetry. These guarantees hold for the mean even when $q$ is factorized and $p$ is not, and for the correlation matrix even when~$q$ and~$p$ behave differently in their tails. We analyze various regimes of Bayesian inference where these symmetries are useful idealizations, and we also investigate experimentally how VI behaves in their absence.
♻ ☆ A Free Probabilistic Framework for Denoising Diffusion Models: Entropy, Transport, and Reverse Processes
This paper develops a rigorous probabilistic framework that extends denoising diffusion models to the setting of noncommutative random variables. Building on Voiculescu's theory of free entropy and free Fisher information, we formulate diffusion and reverse processes governed by operator-valued stochastic dynamics whose spectral measures evolve by additive convolution. Using tools from free stochastic analysis -- including a Malliavin calculus and a Clark--Ocone representation -- we derive the reverse-time stochastic differential equation driven by the conjugate variable, the analogue of the classical score function. The resulting dynamics admit a gradient-flow structure in the noncommutative Wasserstein space, establishing an information-geometric link between entropy production, transport, and deconvolution. We further construct a variational scheme analogous to the Jordan--Kinderlehrer--Otto (JKO) formulation and prove convergence toward the semicircular equilibrium. The framework provides functional inequalities (free logarithmic Sobolev, Talagrand, and HWI) that quantify entropy dissipation and Wasserstein contraction. These results unify diffusion-based generative modeling with the geometry of operator-valued information, offering a mathematical foundation for generative learning on structured and high-dimensional data.
♻ ☆ TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning NeurIPS 2025
In-context learning, the ability of large language models to perform tasks using only examples provided in the prompt, has recently been adapted for time series forecasting. This paradigm enables zero-shot prediction, where past values serve as context for forecasting future values, making powerful forecasting tools accessible to non-experts and increasing the performance when training data are scarce. Most existing zero-shot forecasting approaches rely on transformer architectures, which, despite their success in language, often fall short of expectations in time series forecasting, where recurrent models like LSTMs frequently have the edge. Conversely, while LSTMs are well-suited for time series modeling due to their state-tracking capabilities, they lack strong in-context learning abilities. We introduce TiRex that closes this gap by leveraging xLSTM, an enhanced LSTM with competitive in-context learning skills. Unlike transformers, state-space models, or parallelizable RNNs such as RWKV, TiRex retains state-tracking, a critical property for long-horizon forecasting. To further facilitate its state-tracking ability, we propose a training-time masking strategy called CPM. TiRex sets a new state of the art in zero-shot time series forecasting on the HuggingFace benchmarks GiftEval and Chronos-ZS, outperforming significantly larger models including TabPFN-TS (Prior Labs), Chronos Bolt (Amazon), TimesFM (Google), and Moirai (Salesforce) across both short- and long-term forecasts.
comment: presented at NeurIPS 2025
♻ ☆ How to Train Your LLM Web Agent: A Statistical Diagnosis
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
♻ ☆ Solving Inequality Proofs with Large Language Models NeurIPS 2025
Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.
comment: 50 pages, 24 figures, accepted as a Spotlight at NeurIPS 2025
♻ ☆ Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48\% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://github.com/Skilteee/DiZO.
♻ ☆ Differential privacy guarantees of Markov chain Monte Carlo algorithms
This paper aims to provide differential privacy (DP) guarantees for Markov chain Monte Carlo (MCMC) algorithms. In a first part, we establish DP guarantees on samples output by MCMC algorithms as well as Monte Carlo estimators associated with these methods under assumptions on the convergence properties of the underlying Markov chain. In particular, our results highlight the critical condition of ensuring the target distribution is differentially private itself. In a second part, we specialise our analysis to the unadjusted Langevin algorithm and stochastic gradient Langevin dynamics and establish guarantees on their (R\'enyi) DP. To this end, we develop a novel methodology based on Girsanov's theorem combined with a perturbation trick to obtain bounds for an unbounded domain and in a non-convex setting. We establish: (i) uniform in $n$ privacy guarantees when the state of the chain after $n$ iterations is released, (ii) bounds on the privacy of the entire chain trajectory. These findings provide concrete guidelines for privacy-preserving MCMC.
♻ ☆ Over-squashing in Spatiotemporal Graph Neural Networks NeurIPS 2025
Graph Neural Networks (GNNs) have achieved remarkable success across various domains. However, recent theoretical advances have identified fundamental limitations in their information propagation capabilities, such as over-squashing, where distant nodes fail to effectively exchange information. While extensively studied in static contexts, this issue remains unexplored in Spatiotemporal GNNs (STGNNs), which process sequences associated with graph nodes. Nonetheless, the temporal dimension amplifies this challenge by increasing the information that must be propagated. In this work, we formalize the spatiotemporal over-squashing problem and demonstrate its distinct characteristics compared to the static case. Our analysis reveals that, counterintuitively, convolutional STGNNs favor information propagation from points temporally distant rather than close in time. Moreover, we prove that architectures that follow either time-and-space or time-then-space processing paradigms are equally affected by this phenomenon, providing theoretical justification for computationally efficient implementations. We validate our findings on synthetic and real-world datasets, providing deeper insights into their operational dynamics and principled guidance for more effective designs.
comment: Accepted at NeurIPS 2025
♻ ☆ Understanding Endogenous Data Drift in Adaptive Models with Recourse-Seeking Users
Deep learning models are widely used in decision-making and recommendation systems, where they typically rely on the assumption of a static data distribution between training and deployment. However, real-world deployment environments often violate this assumption. Users who receive negative outcomes may adapt their features to meet model criteria, i.e., recourse action. These adaptive behaviors create shifts in the data distribution and when models are retrained on this shifted data, a feedback loop emerges: user behavior influences the model, and the updated model in turn reshapes future user behavior. Despite its importance, this bidirectional interaction between users and models has received limited attention. In this work, we develop a general framework to model user strategic behaviors and their interactions with decision-making systems under resource constraints and competitive dynamics. Both the theoretical and empirical analyses show that user recourse behavior tends to push logistic and MLP models toward increasingly higher decision standards, resulting in higher recourse costs and less reliable recourse actions over time. To mitigate these challenges, we propose two methods--Fair-top-k and Dynamic Continual Learning (DCL)--which significantly reduce recourse cost and improve model robustness. Our findings draw connections to economic theories, highlighting how algorithmic decision-making can unintentionally reinforce a higher standard and generate endogenous barriers to entry.
comment: 13 pages,4 figures, 3 tables
♻ ☆ Exploring the limits of strong membership inference attacks on large language models NeurIPS 2025
State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pre-trained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings. (3) Even when strong MIAs achieve better-than-random AUC, aggregate metrics can conceal substantial per-sample MIA decision instability: due to training randomness, many decisions are so unstable that they are statistically indistinguishable from a coin flip. Finally, (4) the relationship between MIA success and related LLM privacy metrics is not as straightforward as prior work has suggested.
comment: NeurIPS 2025
♻ ☆ Optimizing Token Choice for Code Watermarking: An RL Approach
Protecting intellectual property on LLM-generated code necessitates effective watermarking systems that can operate within code's highly structured, syntactically constrained nature. In this work, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a novel reinforcement learning training paradigm. At its core, CodeTracer features a policy-driven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strategy ensures that embedded watermarks maintain code functionality while exhibiting subtle yet statistically detectable deviations from typical token distributions. To facilitate policy learning, we devise a comprehensive reward system that seamlessly integrates execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards. Additionally, we employ Gumbel Top-k reparameterization to enable gradient-based optimization of discrete watermarking decisions. Extensive comparative evaluations demonstrate CodeTracer's significant superiority over state-of-the-art baselines in both watermark detectability and the preservation of generated code's functionality.
comment: 18 pages, 3 figures
♻ ☆ Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models
Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize robust hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO's hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.
comment: 17 pages, 4 figures
♻ ☆ Physics-Informed Extreme Learning Machine (PIELM): Opportunities and Challenges
We are delighted to see the recent development of physics-informed extreme learning machine (PIELM) for its higher computational efficiency and accuracy compared to other physics-informed machine learning (PIML) paradigms. Since a comprehensive summary or review of PIELM is currently unavailable, we would like to take this opportunity to share our perspectives and experiences on this promising research direction. We can see that many efforts have been made to solve ordinary/partial differential equations (ODEs/PDEs) characterized by sharp gradients, nonlinearities, high-frequency behavior, hard constraints, uncertainty, multiphysics coupling, and interpretability. Despite these encouraging successes, many pressing challenges remain to be tackled, which also provides opportunities to develop more robust, interpretable, and generalizable PIELM frameworks for scientific and engineering applications.
♻ ☆ ORFit: One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares
While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit's update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning.
comment: Journal extension of v1: Y. Min, K, Ahn, N. Azizan, "One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares," IEEE Conference on Decision and Control, 2022
♻ ☆ Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving ICRA 2025
Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction.
comment: 8 pages, 3 figures, ICRA 2025
♻ ☆ Gymnasium: A Standard Interface for Reinforcement Learning Environments NeurIPS
Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at https://github.com/Farama-Foundation/Gymnasium
comment: Accepted at NeurIPS Datasets and Benchmarks 2025
♻ ☆ Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing
This study proposes and validates a Federated Learning (FL) framework to proactively identify at-risk students while preserving data privacy. Persistently high dropout rates in distance education remain a pressing institutional challenge. Using the large-scale OULAD dataset, we simulate a privacy-centric scenario where models are trained on early academic performance and digital engagement patterns. Our work investigates the practical trade-offs between model complexity (Logistic Regression vs. a Deep Neural Network) and the impact of local data balancing. The resulting federated model achieves strong predictive power (ROC AUC approximately 85%), demonstrating that FL is a practical and scalable solution for early-warning systems that inherently respects student data sovereignty.
comment: This article has been prepared to be submitted to the Fundamenta Informaticae Journal
♻ ☆ Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
comment: Preprint
♻ ☆ Hyper-Transforming Latent Diffusion Models
We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming: a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining. We validate our approach across multiple modalities, demonstrating improved scalability, expressiveness, and generalization over existing INR-based generative models. Our findings establish a unified and flexible framework for learning structured function representations.
♻ ☆ REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning NeurIPS 2025
Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP's superior resource efficiency over state-of-the-art rehearsal-free CL methods.
comment: accepted to NeurIPS 2025
♻ ☆ Computational Budget Should Be Considered in Data Selection
Data selection improves computational efficiency by choosing informative subsets of training samples. However, existing methods ignore the compute budget, treating data selection and importance evaluation independently of compute budget constraints. Yet empirical studies show no algorithm can consistently outperform others (or even random selection) across varying budgets. We therefore argue that compute budget must be integral to data-selection strategies, since different budgets impose distinct requirements on data quantity, quality, and distribution for effective training. To this end, we propose a novel Computational budget-Aware Data Selection (CADS) method and naturally formulate it into a bilevel optimization framework, where the inner loop trains the model within the constraints of the computational budget on some selected subset of training data, while the outer loop optimizes data selection based on model evaluation. Our technical contributions lie in addressing two main challenges in solving this bilevel optimization problem: the expensive Hessian matrix estimation for outer-loop gradients and the computational burden of achieving inner-loop optimality during iterations. To solve the first issue, we propose a probabilistic reparameterization strategy and compute the gradient using a Hessian-free policy gradient estimator. To address the second challenge, we transform the inner optimization problem into a penalty term in the outer objective, further discovering that we only need to estimate the minimum of a one-dimensional loss to calculate the gradient, significantly improving efficiency. Extensive experiments show that our method achieves performance gains of up to 14.42% over baselines in vision and language benchmarks.
♻ ☆ Let's Grow an Unbiased Community: Guiding the Fairness of Graphs via New Links
Graph Neural Networks (GNNs) have achieved remarkable success across diverse applications. However, due to the biases in the graph structures, graph neural networks face significant challenges in fairness. Although the original user graph structure is generally biased, it is promising to guide these existing structures toward unbiased ones by introducing new links. The fairness guidance via new links could foster unbiased communities, thereby enhancing fairness in downstream applications. To address this issue, we propose a novel framework named FairGuide. Specifically, to ensure fairness in downstream tasks trained on fairness-guided graphs, we introduce a differentiable community detection task as a pseudo downstream task. Our theoretical analysis further demonstrates that optimizing fairness within this pseudo task effectively enhances structural fairness, promoting fairness generalization across diverse downstream applications. Moreover, FairGuide employs an effective strategy which leverages meta-gradients derived from the fairness-guidance objective to identify new links that significantly enhance structural fairness. Extensive experimental results demonstrate the effectiveness and generalizability of our proposed method across a variety of graph-based fairness tasks.
♻ ☆ Ocean Wave Forecasting with Deep Learning as Alternative to Conventional Models
This study presents OceanCastNet (OCN), a machine learning approach for wave forecasting that incorporates wind and wave fields to predict significant wave height, mean wave period, and mean wave direction.We evaluate OCN's performance against the operational ECWAM model using two independent datasets: NDBC buoy and Jason-3 satellite observations. NDBC station validation indicates OCN performs better at 24 stations compared to ECWAM's 10 stations, and Jason-3 satellite validation confirms similar accuracy across 228-hour forecasts. OCN successfully captures wave patterns during extreme weather conditions, demonstrated through Typhoon Goni with prediction errors typically within $\pm$0.5 m. The approach also offers computational efficiency advantages. The results suggest that machine learning approaches can achieve performance comparable to conventional wave forecasting systems for operational wave prediction applications.
comment: Accepted manuscript. Final version published in Journal of Advances in Modeling Earth Systems: https://doi.org/10.1029/2025MS005285
♻ ☆ Information-Theoretic Framework for Understanding Modern Machine-Learning
We introduce an information-theoretic framework that views learning as universal prediction under log loss, characterized through regret bounds. Central to the framework is an effective notion of architecture-based model complexity, defined by the probability mass or volume of models in the vicinity of the data-generating process, or its projection on the model class. This volume is related to spectral properties of the expected Hessian or the Fisher Information Matrix, leading to tractable approximations. We argue that successful architectures possess a broad complexity range, enabling learning in highly over-parameterized model classes. The framework sheds light on the role of inductive biases, the effectiveness of stochastic gradient descent, and phenomena such as flat minima. It unifies online, batch, supervised, and generative settings, and applies across the stochastic-realizable and agnostic regimes. Moreover, it provides insights into the success of modern machine-learning architectures, such as deep neural networks and transformers, suggesting that their broad complexity range naturally arises from their layered structure. These insights open the door to the design of alternative architectures with potentially comparable or even superior performance.
♻ ☆ When Models Don't Collapse: On the Consistency of Iterative MLE
The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about model collapse: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.
♻ ☆ Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators
Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages, such as specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three distinct hardware platforms, we demonstrate that Autocomp-optimized code runs 5.6x faster than the vendor-provided library (Gemmini), outperforms expert-level hand-tuned code by 1.9x (AWS Trainium), and achieves 3.8x higher performance than a machine learning-based cost model for GPUs (NVIDIA L40S). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
comment: 10 pages + appendices
♻ ☆ FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs EMNLP
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
comment: Accepted in the Findings of EMNLP, 2025
♻ ☆ Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.
♻ ☆ Disentangled Lottery Tickets: Identifying and Assembling Core and Specialist Subnetworks
The Lottery Ticket Hypothesis (LTH) suggests that within large neural networks, there exist sparse, trainable "winning tickets" capable of matching the performance of the full model, but identifying them through Iterative Magnitude Pruning (IMP) is computationally expensive. Recent work introduced COLT, an accelerator that discovers a "consensus" subnetwork by intersecting masks from models trained on disjoint data partitions; however, this approach discards all non-overlapping weights, assuming they are unimportant. This paper challenges that assumption and proposes the Disentangled Lottery Ticket (DiLT) Hypothesis, which posits that the intersection mask represents a universal, task-agnostic "core" subnetwork, while the non-overlapping difference masks capture specialized, task-specific "specialist" subnetworks. A framework is developed to identify and analyze these components using the Gromov-Wasserstein (GW) distance to quantify functional similarity between layer representations and reveal modular structures through spectral clustering. Experiments on ImageNet and fine-grained datasets such as Stanford Cars, using ResNet and Vision Transformer architectures, show that the "core" ticket provides superior transfer learning performance, the "specialist" tickets retain domain-specific features enabling modular assembly, and the full re-assembled "union" ticket outperforms COLT - demonstrating that non-consensus weights play a critical functional role. This work reframes pruning as a process for discovering modular, disentangled subnetworks rather than merely compressing models.
♻ ☆ Characterization and Learning of Causal Graphs from Hard Interventions NeurIPS 2025
A fundamental challenge in the empirical sciences involves uncovering causal structure through observation and experimentation. Causal discovery entails linking the conditional independence (CI) invariances in observational data to their corresponding graphical constraints via d-separation. In this paper, we consider a general setting where we have access to data from multiple experimental distributions resulting from hard interventions, as well as potentially from an observational distribution. By comparing different interventional distributions, we propose a set of graphical constraints that are fundamentally linked to Pearl's do-calculus within the framework of hard interventions. These graphical constraints associate each graphical structure with a set of interventional distributions that are consistent with the rules of do-calculus. We characterize the interventional equivalence class of causal graphs with latent variables and introduce a graphical representation that can be used to determine whether two causal graphs are interventionally equivalent, i.e., whether they are associated with the same family of hard interventional distributions, where the elements of the family are indistinguishable using the invariances from do-calculus. We also propose a learning algorithm to integrate multiple datasets from hard interventions, introducing new orientation rules. The learning objective is a tuple of augmented graphs which entails a set of causal graphs. We also prove the soundness of the proposed algorithm.
comment: Accepted at NeurIPS 2025
♻ ☆ Lyapunov Neural ODE State-Feedback Control Policies
Deep neural networks are increasingly used as an effective parameterization of control policies in various learning-based control paradigms. For continuous-time optimal control problems (OCPs), which are central to many decision-making tasks, control policy learning can be cast as a neural ordinary differential equation (NODE) problem wherein state and control constraints are naturally accommodated. This paper presents a NODE approach to solving continuous-time OCPs for the case of stabilizing a known constrained nonlinear system around a target state. The approach, termed Lyapunov-NODE control (L-NODEC), uses a novel Lyapunov loss formulation that incorporates an exponentially-stabilizing control Lyapunov function to learn a state-feedback neural control policy, bridging the gap of solving continuous-time OCPs via NODEs with stability guarantees. The proposed Lyapunov loss allows L-NODEC to guarantee exponential stability of the controlled system, as well as its adversarial robustness to perturbations to the initial state. The performance of L-NODEC is illustrated in two problems, including a dose delivery problem in plasma medicine. In both cases, L-NODEC effectively stabilizes the controlled system around the target state despite perturbations to the initial state and reduces the inference time necessary to reach the target.
♻ ☆ FairAIED: Navigating Fairness, Bias, and Ethics in Educational AI Applications
The integration of AI in education holds immense potential for personalizing learning experiences and transforming instructional practices. However, AI systems can inadvertently encode and amplify biases present in educational data, leading to unfair or discriminatory outcomes. As researchers have sought to understand and mitigate these biases, a growing body of work has emerged examining fairness in educational AI. These studies, though expanding rapidly, remain fragmented due to differing assumptions, methodologies, and application contexts. Moreover, existing surveys either focus on algorithmic fairness without an educational setting or emphasize educational methods while overlooking fairness. To this end, this survey provides a comprehensive systematic review of algorithmic fairness within educational AI, explicitly bridging the gap between technical fairness research and educational applications. We integrate multiple dimensions, including bias sources, fairness definitions, mitigation strategies, evaluation resources, and ethical considerations, into a harmonized, education-centered framework. In addition, we explicitly examine practical challenges such as censored or partially observed learning outcomes and the persistent difficulty in quantifying and managing the trade-off between fairness and predictive utility, enhancing the applicability of fairness frameworks to real-world educational AI systems. Finally, we outline an emerging pathway toward fair AI-driven education and by situating these technologies and practical insights within broader educational and ethical contexts, this review establishes a comprehensive foundation for advancing fairness, accountability, and inclusivity in the field of AI education.
♻ ☆ Enhancing Sequential Model Performance with Squared Sigmoid TanH (SST) Activation Under Data Constraints
Activation functions enable neural networks to learn complex representations by introducing non-linearities. While feedforward models commonly use rectified linear units, sequential models like recurrent neural networks, long short-term memory (LSTMs) and gated recurrent units (GRUs) still rely on Sigmoid and TanH activation functions. However, these classical activation functions often struggle to model sparse patterns when trained on small sequential datasets to effectively capture temporal dependencies. To address this limitation, we propose squared Sigmoid TanH (SST) activation specifically tailored to enhance the learning capability of sequential models under data constraints. SST applies mathematical squaring to amplify differences between strong and weak activations as signals propagate over time, facilitating improved gradient flow and information filtering. We evaluate SST-powered LSTMs and GRUs for diverse applications, such as sign language recognition, regression, and time-series classification tasks, where the dataset is limited. Our experiments demonstrate that SST models consistently outperform RNN-based models with baseline activations, exhibiting improved test accuracy.
comment: 10 pages,9 figures
♻ ☆ APALU: A Trainable, Adaptive Activation Function for Deep Learning Networks
Activation function is a pivotal component of deep learning, facilitating the extraction of intricate data patterns. While classical activation functions like ReLU and its variants are extensively utilized, their static nature and simplicity, despite being advantageous, often limit their effectiveness in specialized tasks. The trainable activation functions also struggle sometimes to adapt to the unique characteristics of the data. Addressing these limitations, we introduce a novel trainable activation function, adaptive piecewise approximated activation linear unit (APALU), to enhance the learning performance of deep learning across a broad range of tasks. It presents a unique set of features that enable it to maintain stability and efficiency in the learning process while adapting to complex data representations. Experiments reveal significant improvements over widely used activation functions for different tasks. In image classification, APALU increases MobileNet and GoogleNet accuracy by 0.37% and 0.04%, respectively, on the CIFAR10 dataset. In anomaly detection, it improves the average area under the curve of One-CLASS Deep SVDD by 0.8% on the MNIST dataset, 1.81% and 1.11% improvements with DifferNet, and knowledge distillation, respectively, on the MVTech dataset. Notably, APALU achieves 100% accuracy on a sign language recognition task with a limited dataset. For regression tasks, APALU enhances the performance of deep neural networks and recurrent neural networks on different datasets. These improvements highlight the robustness and adaptability of APALU across diverse deep-learning applications.
comment: 9 pages, 4 figures
♻ ☆ Tool and Tutor? Experimental evidence from AI deployment in cancer diagnosis
Numerous countries globally face shortages of medical experts, deepening inequalities in access to healthcare. Artificial Intelligence (AI)-based diagnostic tools hold considerable promise to tackle this challenge by enabling even novices to deliver expert-level medical services. However, reliance on AI for task completion may hinder the learning required for novices to develop expertise. We thus explore whether AI-based diagnostic tools can be used to enhance not only performance but also learning in the context of lung cancer diagnosis. We examine the distinct effects of AI input during training (i.e., learning how to diagnose) versus in practice (i.e., completing diagnostic tasks) on novice medical professionals' performance. In two field experiments, 576 medical students were randomly assigned across conditions, manipulating the access to AI input during their training, during a test of their diagnostic capabilities, or both. During practice, participants diagnosed potential lung cancer cases using chest CT scans, and their diagnoses were evaluated against the ground truth obtained through histopathological examinations. Study 1 (N = 336) revealed that AI input in training alone improved human diagnostic accuracy by 3.2 percentage points over the control, while AI input during practice alone increased human accuracy by 7.9 percentage points. Combined deployment in both training and practice yielded an improvement of 13.7 percentage points--significantly exceeding either approach alone. Study 2 (N = 240) showed that AI input in practice alone improved accuracy in subsequent practice, unaided by AI, by 9.9 percentage points over the control. Even minimally informative AI input in training improved diagnostic accuracy by 5.3 percentage points over the control. These results reveal AI's dual role: As a tool, it could rapidly improve novices' performance.
♻ ☆ Distributionally Robust Optimization with Adversarial Data Contamination
Distributionally Robust Optimization (DRO) provides a framework for decision-making under distributional uncertainty, yet its effectiveness can be compromised by outliers in the training data. This paper introduces a principled approach to simultaneously address both challenges. We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions, where an $\epsilon$-fraction of the training data is adversarially corrupted. Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts, alongside an efficient algorithm inspired by robust statistics to solve the resulting optimization problem. We prove that our method achieves an estimation error of $O(\sqrt{\epsilon})$ for the true DRO objective value using only the contaminated data under the bounded covariance assumption. This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.
♻ ☆ On the Variance, Admissibility, and Stability of Empirical Risk Minimization
It is well known that Empirical Risk Minimization (ERM) may attain minimax suboptimal rates in terms of the mean squared error (Birg\'e and Massart, 1993). In this paper, we prove that, under relatively mild assumptions, the suboptimality of ERM must be due to its large bias. Namely, the variance error term of ERM is bounded by the minimax rate. In the fixed design setting, we provide an elementary proof of this result using the probabilistic method. Then, we extend our proof to the random design setting for various models. In addition, we provide a simple proof of Chatterjee's admissibility theorem (Chatterjee, 2014, Theorem 1.4), which states that in the fixed design setting, ERM cannot be ruled out as an optimal method, and then we extend this result to the random design setting. We also show that our estimates imply the stability of ERM, complementing the main result of Caponnetto and Rakhlin (2006) for non-Donsker classes. Finally, we highlight the somewhat irregular nature of the loss landscape of ERM in the non-Donsker regime, by showing that functions can be close to ERM, in terms of $L_2$ distance, while still being far from almost-minimizers of the empirical loss.
comment: Appeared at Neurips 2023 (spotlight)
♻ ☆ AI-Powered Energy Algorithmic Trading: Integrating Hidden Markov Models with Neural Networks
In quantitative finance, machine learning methods are essential for alpha generation. This study introduces a new approach that combines Hidden Markov Models (HMM) and neural networks, integrated with Black-Litterman portfolio optimization. During the COVID period (2019-2022), this dual-model approach achieved a 83% return with a Sharpe ratio of 0.77. It incorporates two risk models to enhance risk management, showing efficiency during volatile periods. The methodology was implemented on the QuantConnect platform, which was chosen for its robust framework and experimental reproducibility. The system, which predicts future price movements, includes a three-year warm-up to ensure proper algorithm function. It targets highly liquid, large-cap energy stocks to ensure stable and predictable performance while also considering broker payments. The dual-model alpha system utilizes log returns to select the optimal state based on the historical performance. It combines state predictions with neural network outputs, which are based on historical data, to generate trading signals. This study examined the architecture of the trading system, data pre-processing, training, and performance. The full code and backtesting data are available under the QuantConnect terms: https://github.com/tiagomonteiro0715/AI-Powered-Energy-Algorithmic-Trading-Integrating-Hidden-Markov-Models-with-Neural-Networks
comment: 14 pages, 4 figures, 2 tables
♻ ☆ Wait-Less Offline Tuning and Re-solving for Online Decision Making
Online linear programming (OLP) has found broad applications in revenue management and resource allocation. State-of-the-art OLP algorithms achieve low regret by repeatedly solving linear programming (LP) subproblems that incorporate updated resource information. However, LP-based methods are computationally expensive and often inefficient for large-scale applications. In contrast, recent first-order OLP algorithms are more computationally efficient but typically suffer from worse regret guarantees. To address these shortcomings, we propose a new algorithm that combines the strengths of LP-based and first-order OLP methods. The algorithm re-solves the LP subproblems periodically at a predefined frequency $f$ and uses the latest dual prices to guide online decision-making. In addition, a first-order method runs in parallel during each interval between LP re-solves, smoothing resource consumption. Our algorithm achieves $\mathscr{O}(\log (T/f) + \sqrt{f})$ regret, delivering a "wait-less" online decision-making process that balances the computational efficiency of first-order methods and the superior regret guarantee of LP-based methods.
comment: In this version, we achieve a tighter regret bound with the warm start for the first batch. We also make the proof more elegant by manually accepting all subsequent orders once the constraint is violated. In this way, we do not need to introduce the concept of stopping time for the analysis of the LP-based method
♻ ☆ Mechanism Learning: reverse causal inference in the presence of multiple unknown confounding through causally weighted Gaussian mixture models
A major limitation of machine learning (ML) prediction models is that they recover associational, rather than causal, predictive relationships between variables. In high-stakes automation applications of ML this is problematic, as the model often learns spurious, non-causal associations. This paper proposes mechanism learning, a simple method which uses causally weighted Gaussian Mixture Models (CW-GMMs) to deconfound observational data such that any appropriate ML model is forced to learn predictive relationships between effects and their causes (reverse causal inference), despite the potential presence of multiple unknown and unmeasured confounding. Effect variables can be very high-dimensional, and the predictive relationship nonlinear, as is common in ML applications. This novel method is widely applicable, the only requirement is the existence of a set of mechanism variables mediating the cause (prediction target) and effect (feature data), which is independent of the (unmeasured) confounding variables. We test our method on fully synthetic, semi-synthetic and real-world datasets, demonstrating that it can discover reliable, unbiased, causal ML predictors where by contrast, the same ML predictor trained naively using classical supervised learning on the original observational data, is heavily biased by spurious associations. We provide code to implement the results in the paper, online.
comment: 15 pages, 7 figures
♻ ☆ PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models
The rapid advancement of generative AI has provided users with a wide range of well-trained models to address diverse prompts. When selecting a model for a given prompt, users should weigh not only its performance but also its service cost. However, existing model-selection methods typically emphasize performance while overlooking cost differences. In this paper, we introduce PromptWise, an online learning framework that assigns prompts to generative models in a cost-aware manner. PromptWise estimates prompt-model compatibility to select the least expensive model expected to deliver satisfactory outputs. Unlike standard contextual bandits that make a one-shot decision per prompt, PromptWise employs a cost-aware bandit structure that allows sequential model assignments per prompt to reduce total service cost. Through numerical experiments on tasks such as code generation and translation, we demonstrate that PromptWise can achieve performance comparable to baseline selection methods while incurring substantially lower costs. The code is available at: github.com/yannxiaoyanhu/PromptWise.
comment: 39 pages
♻ ☆ SST: Multi-Scale Hybrid Mamba-Transformer Experts for Time Series Forecasting CIKM 2025
Time series forecasting has made significant advances, including with Transformer-based models. The attention mechanism in Transformer effectively captures temporal dependencies by attending to all past inputs simultaneously. However, its quadratic complexity with respect to sequence length limits the scalability for long-range modeling. Recent state space models (SSMs) such as Mamba offer a promising alternative by achieving linear complexity without attention. Yet, Mamba compresses historical information into a fixed-size latent state, potentially causing information loss and limiting representational effectiveness. This raises a key research question: Can we design a hybrid Mamba-Transformer architecture that is both effective and efficient for time series forecasting? To address it, we adapt a hybrid Mamba-Transformer architecture Mambaformer, originally proposed for language modeling, to the time series domain. Preliminary experiments reveal that naively stacking Mamba and Transformer layers in Mambaformer is suboptimal for time series forecasting, due to an information interference problem. To mitigate this issue, we introduce a new time series decomposition strategy that separates time series into long-range patterns and short-range variations. Then we show that Mamba excels at capturing long-term structures, while Transformer is more effective at modeling short-term dynamics. Building on this insight, we propose State Space Transformer (SST), a multi-scale hybrid model with expert modules: a Mamba expert for long-range patterns and a Transformer expert for short-term variations. SST also employs a multi-scale patching mechanism to adaptively adjust time series resolution: low resolution for long-term patterns and high resolution for short-term variations. Experiments show that SST obtains SOTA performance with linear scalability. The code is at https://github.com/XiongxiaoXu/SST.
comment: CIKM 2025
♻ ☆ Distributionally Robust Wireless Semantic Communication with Large AI Models
Semantic communication (SemCom) has emerged as a promising paradigm for 6G wireless systems by transmitting task-relevant information rather than raw bits, yet existing approaches remain vulnerable to dual sources of uncertainty: semantic misinterpretation arising from imperfect feature extraction and transmission-level perturbations from channel noise. Current deep learning based SemCom systems typically employ domain-specific architectures that lack robustness guarantees and fail to generalize across diverse noise conditions, adversarial attacks, and out-of-distribution data. In this paper, a novel and generalized semantic communication framework called WaSeCom is proposed to systematically address uncertainty and enhance robustness. In particular, Wasserstein distributionally robust optimization is employed to provide resilience against semantic misinterpretation and channel perturbations. A rigorous theoretical analysis is performed to establish the robust generalization guarantees of the proposed framework. Experimental results on image and text transmission demonstrate that WaSeCom achieves improved robustness under noise and adversarial perturbations. These results highlight its effectiveness in preserving semantic fidelity across varying wireless conditions.
comment: Under Review
♻ ☆ Hybrid-Task Meta-Learning: A GNN Approach for Scalable and Transferable Bandwidth Allocation
In this paper, we develop a deep learning-based bandwidth allocation policy that is: 1) scalable with the number of users and 2) transferable to different communication scenarios, such as non-stationary wireless channels, different quality-of-service (QoS) requirements, and dynamically available resources. To support scalability, the bandwidth allocation policy is represented by a graph neural network (GNN), with which the number of training parameters does not change with the number of users. To enable the generalization of the GNN, we develop a hybrid-task meta-learning (HML) algorithm that trains the initial parameters of the GNN with different communication scenarios during meta-training. Next, during meta-testing, a few samples are used to fine-tune the GNN with unseen communication scenarios. Simulation results demonstrate that our HML approach can improve the initial performance by 8.79%, and sample efficiency by 73%, compared with existing benchmarks. After fine-tuning, our near-optimal GNN-based policy can achieve close to the same reward with much lower inference complexity compared to the optimal policy obtained using iterative optimization. Numerical results validate that our HML can reduce the computation time by approximately 200 to 2000 times than the optimal iterative algorithm.
♻ ☆ Federated Vision-Language-Recommendation with Personalized Fusion
Applying large pre-trained Vision-Language Models to recommendation is a burgeoning field, a direction we term Vision-Language-Recommendation (VLR). Bringing VLR to user-oriented on-device intelligence within a federated learning framework is a crucial step for enhancing user privacy and delivering personalized experiences. This paper introduces FedVLR, a federated VLR framework specially designed for user-specific personalized fusion of vision-language representations. At its core is a novel bi-level fusion mechanism: The server-side multi-view fusion module first generates a diverse set of pre-fused multimodal views. Subsequently, each client employs a user-specific mixture-of-expert mechanism to adaptively integrate these views based on individual user interaction history. This designed lightweight personalized fusion module provides an efficient solution to implement a federated VLR system. The effectiveness of our proposed FedVLR has been validated on seven benchmark datasets.
comment: 15 pages, 10 figures, 7 tables, conference
Human-Computer Interaction 11
☆ Defining a Role-Centered Terminology for Physical Representations and Controls
Previous classifications advanced research through a better understanding of the field and the variety of tangible user interfaces and related physical user interfaces, especially by discretizing a degree of tangibility based on the specimens produced by the community over the years, since the conceptualization of Tangible User Interface initiated a research effort to deepen the exploration of the concept. However, no taxonomy enables the classification of tangible user interfaces at the application level. This article proposes to refine the description of tangible user interfaces' interactional components through a terminological approach. The resulting terms are blended words, built from known words, that self-contain what digital role is represented or controlled and how it becomes physical. This holistic terminology then enables the definition of applications' hallmarks and four classes of tangibility for applications, which surpass the description of physical user interface specimens' morphology by abstracting and discriminating specimens at the applicative level. The descriptiveness and holisticness of the new terminology, as well as the clustering and discriminative power of the limited number of four classes, are showed on a corpus of applicative tangible user interfaces' specimens from the literature. Promising future work will benefit from the holistic terminology, the applications' hallmarks, and the tangibility classes, to describe applicative tangible user interfaces and related physical user interfaces to better understand the dozens of specimens that were produced by the field over three decades. Indeed, describing and classifying this whole set would deepen our understanding to provide tools for future developers and designers.
comment: 16 pages. Approximately 7000 words. 4 figures and 7 tables
☆ Dynamic Theater: Location-Based Immersive Dance Theater, Investigating User Guidance and Experience
Dynamic Theater explores the use of augmented reality (AR) in immersive theater as a platform for digital dance performances. The project presents a locomotion-based experience that allows for full spatial exploration. A large indoor AR theater space was designed to allow users to freely explore the augmented environment. The curated wide-area experience employs various guidance mechanisms to direct users to the main content zones. Results from our 20-person user study show how users experience the performance piece while using a guidance system. The importance of stage layout, guidance system, and dancer placement in immersive theater experiences are highlighted as they cater to user preferences while enhancing the overall reception of digital content in wide-area AR. Observations after working with dancers and choreographers, as well as their experience and feedback are also discussed.
comment: Conference Paper, 11 pages. Published at the 2023 ACM Symposium on Virtual Reality Software and Technology (VRST)
☆ "Less is More": Reducing Cognitive Load and Task Drift in Real-Time Multimodal Assistive Agents for the Visually Impaired
Vision-Language Models (VLMs) enable on-demand visual assistance, yet current applications for people with visual impairments (PVI) impose high cognitive load and exhibit task drift, limiting real-world utility. We first conducted a formative study with 15 PVI and identified three requirements for visually impaired assistance (VIA): low latency for real-time use, minimal cognitive load, and hallucination-resistant responses to sustain trust. Informed by the formative study, we present VIA-Agent, a prototype that co-optimizes its cognitive 'brain' and interactive 'body'. The brain implements a goal-persistent design with calibrated conciseness to produce brief, actionable guidance; the body adopts a real-time communication (RTC) embodiment-evolving from a request-response model Context Protocol (MCP) pipeline-to-support fluid interaction. We evaluated VIA-Agent with 9 PVI across navigation and object retrieval in the wild against BeMyAI and Doubao. VIA-Agent significantly outperformed BeMyAI both quantitatively and qualitatively. While achieving success rates comparable to Doubao, it reduced mean task time by 39.9% (70.1 s vs. 110.7 s), required fewer conversational turns (4.3 vs. 5.0), and lowered perceived cognitive load and task drift. System Usability Scale (SUS) results aligned with these findings, with VIA-Agent achieving the highest usability. We hope this work inspires the development of more human-centered VIA systems.
comment: 20 pages
☆ Exploring Human-AI Interaction with Patient-Generated Health Data Sensemaking for Cardiac Risk Reduction IEEE VIS 2025
Patient-generated health data (PGHD) allows healthcare professionals to have a holistic and objective view of their patients. However, its integration in cardiac risk reduction remains unexplored. Through co-design with experienced healthcare professionals (n=5) in cardiac rehabilitation, we designed a dashboard, INSIGHT (INvestigating the potentialS of PatIent Generated Health data for CVD Prevention and ReHabiliTation), integrating multi-modal PGHD to support healthcare professionals in physical activity planning in cardiac risk reduction. To further augment healthcare professionals' (HCPs') data sensemaking and exploration capabilities, we integrate large language models (LLMs) for generating summaries and insights and for using natural language interaction to perform personalized data analysis. The aim of this integration is to explore the potential of AI in augmenting HCPs' data sensemaking and analysis capabilities.
comment: Presented as demonstration at the workshop on visual analytics in healthcare (VAHC) (in conjunction with IEEE VIS 2025)
☆ Towards Data-Enabled Physical Activity Planning: An Exploratory Study of HCP Perspectives On The Integration Of Patient-Generated Health Data
Physical activity planning is an essential part of cardiovascular rehabilitation. Through a two-part formative design exploration, we investigated integrating patient-generated health data (PGHD) into clinical workflows supporting shared decision-making (SDM) in physical activity planning. In part one, during a two-week situated study, to reduce risk of working with cardiovascular disease patients, we recruited healthy participants who self-tracked health and physical activity data and attended a physical activity planning session with a healthcare professional (HCP). Subsequently both HCPs and participants were interviewed. In part two, findings from part one were presented to HCPs in a card-sorting workshop to corroborate findings and identify information needs of HCPs alongside patient journeys and clinical workflows. Our outcomes highlight HCP information needs around patient risk factors, vital signs, and adherence to physical activity. Enablers for PGHD integration include adaptive data sense-making, standardization and organizational support for integration. Barriers include lack of time, data quality, trust and liability concerns. Our research highlights implications for designing digital health technologies that support PGHD in physical activity planning during cardiac rehabilitation.
comment: Published at 19th EAI International Conference on Pervasive Computing Technologies for Healthcare
☆ Learning with Category-Equivariant Representations for Human Activity Recognition
Human activity recognition is challenging because sensor signals shift with context, motion, and environment; effective models must therefore remain stable as the world around them changes. We introduce a categorical symmetry-aware learning framework that captures how signals vary over time, scale, and sensor hierarchy. We build these factors into the structure of feature representations, yielding models that automatically preserve the relationships between sensors and remain stable under realistic distortions such as time shifts, amplitude drift, and device orientation changes. On the UCI Human Activity Recognition benchmark, this categorical symmetry-driven design improves out-of-distribution accuracy by approx. 46 percentage points (approx. 3.6x over the baseline), demonstrating that abstract symmetry principles can translate into concrete performance gains in everyday sensing tasks via category-equivariant representation theory.
☆ Portal UX Agent -- A Plug-and-Play Engine for Rendering UIs from Natural Language Specifications
The rapid appearance of large language models (LLMs) has led to systems that turn natural-language intent into real user interfaces (UIs). Free-form code generation maximizes expressiveness but often hurts reliability, security, and design-system compliance. In contrast, fully static UIs are easy to govern but lack adaptability. We present the Portal UX Agent, a practical middle way that makes bounded generation work: an LLM plans the UI at a high level, and a deterministic renderer assembles the final interface from a vetted set of components and layout templates. The agent maps intents to a typed composition-template and component specifications-constrained by a schema. This enables auditability, reuse, and safety while preserving flexibility. We also introduce a mixed-methods evaluation framework that combines automatic checks (coverage, property fidelity, layout, accessibility, performance) with an LLM-as-a-Judge rubric to assess semantic alignment and visual polish. Experiments on multi-domain portal scenarios show that the Portal UX Agent reliably turns intent into coherent, usable UIs and performs well on compositionality and clarity. This work advances agentic UI design by combining model-driven representations, plug-and-play rendering, and structured evaluation, paving the way for controllable and trustworthy UI generation.
☆ GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to first select visual patches relevant to the instructions and then determine the precise click location within those patches. Based on the observations that general MLLMs have some native grounding capability, nested within their attentions, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are calculated adaptively for diverse user instructions by multi-head aggregation on simplified query-visual attention matrices. Besides, its coordinate-free manner can easily integrate a plug-and-play zoom-in stage. GUI-AIMA-3B was trained with only 85k screenshots, demonstrating exceptional data efficiency and verifying that light training can trigger the native grounding capability of MLLMs. It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 58.6% on ScreenSpot-Pro and 62.2% on OSWorld-G. Project page: https://github.com/sjz5202/GUI-AIMA
☆ Quantifying truth and authenticity in AI-assisted candidate evaluation: A multi-domain pilot analysis
This paper presents a retrospective analysis of anonymized candidate-evaluation data collected during pilot hiring campaigns conducted through AlteraSF, an AI-native resume-verification platform. The system evaluates resume claims, generates context-sensitive verification questions, and measures performance along quantitative axes of factual validity and job fit, complemented by qualitative integrity detection. Across six job families and 1,700 applications, the platform achieved a 90-95% reduction in screening time and detected measurable linguistic patterns consistent with AI-assisted or copied responses. The analysis demonstrates that candidate truthfulness can be assessed not only through factual accuracy but also through patterns of linguistic authenticity. The results suggest that a multi-dimensional verification framework can improve both hiring efficiency and trust in AI-mediated evaluation systems.
comment: 10 pages, 10 tables, 2 figures, and 1 page of supplemental materials
♻ ☆ Time-Masked Transformers with Lightweight Test-Time Adaptation for Neural Speech Decoding
Speech neuroprostheses aim to restore communication for people with severe paralysis by decoding speech directly from neural activity. To accelerate algorithmic progress, a recent benchmark released intracranial recordings from a paralyzed participant attempting to speak, along with a baseline decoding algorithm. Prior work on the benchmark showed impressive accuracy gains. However, these gains increased computational costs and were not demonstrated in a real-time decoding setting. Here, we make three contributions that pave the way towards accurate, efficient, and real-time neural speech decoding. First, we incorporate large amounts of time-masking during training. On average, over $50\%$ of each trial is masked. Second, we replace the gated recurrent unit (GRU) architecture used in the baseline algorithm with a compact Transformer. The Transformer architecture uses $83\%$ fewer parameters, cuts peak GPU memory usage by $52\%$, and is significantly faster to calibrate relative to the GRU. Third, we design a lightweight variant of an existing test-time adaptation method developed for decoding handwriting from neural activity. Our variant adapts the model using multiple time-masked augmentations of a single trial and requires only one gradient step per trial. Together, these contributions reduce word error rate by over $20\%$ and effectively mitigate performance degradations across held-out days in a real-time decoding setting while substantially lowering computational costs.
comment: 10 pages, 2 figures
♻ ☆ Tool and Tutor? Experimental evidence from AI deployment in cancer diagnosis
Numerous countries globally face shortages of medical experts, deepening inequalities in access to healthcare. Artificial Intelligence (AI)-based diagnostic tools hold considerable promise to tackle this challenge by enabling even novices to deliver expert-level medical services. However, reliance on AI for task completion may hinder the learning required for novices to develop expertise. We thus explore whether AI-based diagnostic tools can be used to enhance not only performance but also learning in the context of lung cancer diagnosis. We examine the distinct effects of AI input during training (i.e., learning how to diagnose) versus in practice (i.e., completing diagnostic tasks) on novice medical professionals' performance. In two field experiments, 576 medical students were randomly assigned across conditions, manipulating the access to AI input during their training, during a test of their diagnostic capabilities, or both. During practice, participants diagnosed potential lung cancer cases using chest CT scans, and their diagnoses were evaluated against the ground truth obtained through histopathological examinations. Study 1 (N = 336) revealed that AI input in training alone improved human diagnostic accuracy by 3.2 percentage points over the control, while AI input during practice alone increased human accuracy by 7.9 percentage points. Combined deployment in both training and practice yielded an improvement of 13.7 percentage points--significantly exceeding either approach alone. Study 2 (N = 240) showed that AI input in practice alone improved accuracy in subsequent practice, unaided by AI, by 9.9 percentage points over the control. Even minimally informative AI input in training improved diagnostic accuracy by 5.3 percentage points over the control. These results reveal AI's dual role: As a tool, it could rapidly improve novices' performance.
Software Engineering 15
☆ HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning
Existing LLM-based automatic test generation methods mainly produce input and expected output pairs to categorize the intended behavior of correct programs. Although straightforward, these methods have limited diversity in generated tests and cannot provide enough debugging information. We propose HarnessLLM, a two-stage training pipeline that enables LLMs to write harness code for testing. Particularly, LLMs generate code that synthesizes inputs and validates the observed outputs, allowing complex test cases and flexible output validation such as invariant checking. To achieve this, we train LLMs with SFT followed by RLVR with a customized reward design. Experiments show that HarnessLLM outperforms input-output-based testing in bug finding and testing strategy diversity. HarnessLLM further benefits the code generation performance through test-time scaling with our generated test cases as inference-phase validation. Our code is available at https://github.com/UCSB-NLP-Chang/HarnessLLM.git.
☆ HAFixAgent: History-Aware Automated Program Repair Agent
Automated program repair (APR) has recently shifted toward large language models and agent-based systems, yet most systems rely on local snapshot context, overlooking repository history. Prior work shows that repository history helps repair single-line bugs, since the last commit touching the buggy line is often the bug-introducing one. In this paper, we investigate whether repository history can also improve agentic APR systems at scale, especially for complex multi-hunk bugs. We present HAFixAgent, a History-Aware Bug-Fixing Agent that injects blame-derived repository heuristics into its repair loop. A preliminary study of all 854 real-world bugs from Defects4J motivates our design, showing that bug-relevant history is both widely available and highly concentrated. Empirical comparison of HAFixAgent with two state-of-the-art baselines shows: (1) Effectiveness: HAFixAgent significantly improves over the agent-based baseline (by 212.3%) and the multi-hunk baseline (by 29.9%). (2) Efficiency: history does not significantly increase agent steps and keeps token costs comparable, with notably lower median costs for complex multi-file-multi-hunk bugs. (3) Practicality: combining different historical heuristics repairs more bugs, offering a clear cost-benefit trade-off. HAFixAgent offers a practical recipe for history-aware agentic APR: ground the agent in version control history, prioritize diff-based historical context, and integrate complementary heuristics when needed.
comment: 31 pages, 6 figures
☆ DPO-F+: Aligning Code Repair Feedback with Developers' Preferences
Large Language Models (LLMs) are increasingly applied to software engineering tasks, especially code repair. However, developers often struggle to interpret model outputs, limiting effective human-AI teaming. Prior work largely optimizes repaired code while under-addressing the natural-language feedback that enables comprehension and iterative improvement. We present DPO-f+, a novel framework that aligns code-repair feedback with developer needs and profiles. It (1) formalizes developer-profiled, domain-specific metrics for feedback alignment; (2) automatically constructs pairwise preference datasets from code-repair tasks; (3) fine-tunes using Direct Preference Optimization (DPO) augmented with a lightweight margin signal; and (4) provides an automated feedback evaluation protocol. Empirically, DPO-f+ outperforms both the baseline and standard DPO on generated-code accuracy and overall feedback alignment. On novice programming tasks, DPO-f+ raises the top-1 pass rate by 5.71 percentage points (pp) over the baseline and by 3.30 pp over DPO. On the more challenging SWE-bench Lite benchmark, it increases the issue-resolution rate by 1.67 pp over DPO and by 4.67 pp over the baseline. It also achieves the largest improvement in feedback alignment, outperforming DPO and the baseline. By aligning feedback more closely with developer needs, DPO-f+ turns LLM-assisted repair from one-shot outputs into a collaborative sensemaking workflow, providing a practical approach to enhancing code comprehension and fostering more effective human-AI teaming in software engineering.
comment: 10 pages, 2 figures
☆ Empirical Derivations from an Evolving Test Suite
The paper presents a longitudinal empirical analysis of the automated, continuous, and virtualization-based software test suite of the NetBSD operating system. The longitudinal period observed spans from the initial roll out of the test suite in the early 2010s to late 2025. According to the results, the test suite has grown continuously, currently covering over ten thousand individual test cases. Failed test cases exhibit overall stability, although there have been shorter periods marked with more frequent failures. A similar observation applies to build failures, failures of the test suite to complete, and installation failures, all of which are also captured by the NetBSD's testing framework. Finally, code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures. Although some periods exhibit larger effects, including particularly with respect to the kernel modifications, the effects are small on average. Even though only in an exploratory manner, these empirical observations contribute to efforts to draw conclusions from large-scale and evolving software test suites.
comment: Submitted
☆ Sustainability of Machine Learning-Enabled Systems: The Machine Learning Practitioner's Perspective
Software sustainability is a key multifaceted non-functional requirement that encompasses environmental, social, and economic concerns, yet its integration into the development of Machine Learning (ML)-enabled systems remains an open challenge. While previous research has explored high-level sustainability principles and policy recommendations, limited empirical evidence exists on how sustainability is practically managed in ML workflows. Existing studies predominantly focus on environmental sustainability, e.g., carbon footprint reduction, while missing the broader spectrum of sustainability dimensions and the challenges practitioners face in real-world settings. To address this gap, we conduct an empirical study to characterize sustainability in ML-enabled systems from a practitioner's perspective. We investigate (1) how ML engineers perceive and describe sustainability, (2) the software engineering practices they adopt to support it, and (3) the key challenges hindering its adoption. We first perform a qualitative analysis based on interviews with eight experienced ML engineers, followed by a large-scale quantitative survey with 203 ML practitioners. Our key findings reveal a significant disconnection between sustainability awareness and its systematic implementation, highlighting the need for more structured guidelines, measurement frameworks, and regulatory support.
☆ A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks
Unlike traditional automation tools or static LLM-based systems, agents combine decision-making and tool utilization to accomplish complex tasks, showing great potential in software engineering. However, existing studies largely focus on specific tasks or isolated aspects, providing an incomplete picture of agents' practical capabilities. To address this, we conduct a comprehensive empirical study evaluating seven general-purpose agent frameworks across three representative code-centric tasks: software development, vulnerability detection, and program repair. Each task is assessed using standard, widely adopted benchmarks to ensure objective and comparable evaluation. Agent performance is systematically analyzed from three complementary perspectives: effectiveness (task success), efficiency (execution process), and overhead (token consumption). Our findings reveal distinct capability patterns and trade-offs among the evaluated frameworks. In terms of effectiveness, agents achieve moderate overall performance. Regarding efficiency, AgentOrchestra tends to exhibit the longest trajectories and the most correction attempts due to coordination overhead, whereas OpenHands demonstrate stronger reflective reasoning abilities. For overhead, software development incurs the highest monetary cost, while GPTswarm remains the most cost-efficient. Furthermore, we conduct an in-depth cross-analysis of the relationship between effectiveness and efficiency, exploring the underlying reasons behind their interplay. These findings guide both practical adoption and future research toward more efficient software engineering agents.
☆ CodeClash: Benchmarking Goal-Oriented Software Engineering
Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.
☆ GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents
With the software industry shifting toward a data-driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, \textit{off-policy evaluation (OPE)}, or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM-based agents can improve OPE performance via code optimization. We propose \textit{GrowthHacker}, a benchmark with agent and baseline methods on large-scale real-world datasets, which iteratively optimizes code, evaluates results, and begins new optimization cycles. We collected datasets, established protocols, implemented baselines for OPE on the Open Bandit Pipeline (OBP)~\cite{saito2021openbanditdatasetpipeline} and Scope-RL~\cite{kiyohara2023scope}, and developed the \textit{two_agent} framework, which reduces system complexity while preserving optimization effectiveness. Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7% among positive outcomes. Both two_agent and CrewAI reach 45% success rates, outperforming AutoGen's 34%. These findings demonstrate the feasibility of LLM-based agents as automated "growth hackers" to enhance OPE systems, with implications for scaling data-driven decision-making in production.
☆ Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.
☆ A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI
Model hallucination is one of the most critical challenges faced by Large Language Models (LLMs), especially in high-stakes code intelligence tasks. As LLMs become increasingly integrated into software engineering tasks, understanding and mitigating hallucination in code becomes essential. In this survey, we provide a systematic review of hallucination phenomena in code-oriented LLMs from four key perspectives. First, we begin by surveying 60 papers to define hallucination in the context of code and summarize its primary causes, such as data noise, exposure bias, and insufficient semantic grounding, while also tracing recent trends in literature across natural language processing (NLP) and software engineering communities. Second, we review model hallucination surveys in a broader span and summarize representative hallucination mitigation strategies, such as knowledge-enhanced generation, constrained decoding, and post-editing. Third, we review approaches targeted for code intelligence and highlight code-specific challenges that aggravate hallucination, including syntax sensitivity, strict type systems, and dependence on external libraries. Meanwhile, we analyze how emerging code intelligence tasks, e.g., program analysis, symbolic execution, and unit testing, are utilized to detect and mitigate hallucinations. Fourth, we summarize current evaluation benchmarks, ranging from static metrics to dynamic checks, e.g., compilation and execution correctness, and emphasize the need for hallucination-oriented benchmarks.
♻ ☆ Why Attention Fails: A Taxonomy of Faults in Attention-Based Neural Networks
Attention mechanisms are at the core of modern neural architectures, powering systems ranging from ChatGPT to autonomous vehicles and driving a major economic impact. However, high-profile failures, such as ChatGPT's nonsensical outputs or Google's suspension of Gemini's image generation due to attention weight errors, highlight a critical gap: existing deep learning fault taxonomies might not adequately capture the unique failures introduced by attention mechanisms. This gap leaves practitioners without actionable diagnostic guidance. To address this gap, we present the first comprehensive empirical study of faults in attention-based neural networks (ABNNs). Our work is based on a systematic analysis of 555 real-world faults collected from 96 projects across ten frameworks, including GitHub, Hugging Face, and Stack Overflow. Through our analysis, we develop a novel taxonomy comprising seven attention-specific fault categories, not captured by existing work. Our results show that over half of the ABNN faults arise from mechanisms unique to attention architectures. We further analyze the root causes and manifestations of these faults through various symptoms. Finally, by analyzing symptom-root cause associations, we identify four evidence-based diagnostic heuristics that explain 33.0% of attention-specific faults, offering the first systematic diagnostic guidance for attention-based models.
♻ ☆ PredicateFix: Repairing Static Analysis Alerts with Bridging Predicates ICSE 2026
Fixing static analysis alerts in source code with Large Language Models (LLMs) is becoming increasingly popular. However, LLMs often hallucinate and perform poorly for complex and less common alerts. Retrieval-augmented generation (RAG) aims to solve this problem by providing the model with a relevant example, but existing approaches face the challenge of unsatisfactory quality of such examples. To address this challenge, we utilize the predicates in the analysis rule, which serve as a bridge between the alert and relevant code snippets within a clean code corpus, called key examples. Based on this insight, we propose an algorithm to retrieve key examples for an alert automatically, and build PredicateFix as a RAG pipeline to fix alerts from two static code analyzers: CodeQL and GoInsight. Evaluation with multiple LLMs shows that PredicateFix increases the number of correct repairs by 27.1% ~ 69.3%, significantly outperforming other baseline RAG approaches.
comment: 13 pages, 5 figures; accepted for ICSE 2026
♻ ☆ Bitcoin Cross-Chain Bridge: A Taxonomy and Its Promise in Artificial Intelligence of Things
Bitcoin's limited scripting capabilities and lack of native interoperability mechanisms have constrained its integration into the broader blockchain ecosystem, especially decentralized finance (DeFi) and multi-chain applications. This paper presents a comprehensive taxonomy of Bitcoin cross-chain bridge protocols, systematically analyzing their trust assumptions, performance characteristics, and applicability to the Artificial Intelligence of Things (AIoT) scenarios. We categorize bridge designs into three main types: naive token swapping, pegged-asset bridges, and arbitrary-message bridges. Each category is evaluated across key metrics such as trust model, latency, capital efficiency, and DeFi composability. Emerging innovations like BitVM and recursive sidechains are highlighted for their potential to enable secure, scalable, and programmable Bitcoin interoperability. Furthermore, we explore practical use cases of cross-chain bridges in AIoT applications, including decentralized energy trading, healthcare data integration, and supply chain automation. This taxonomy provides a foundational framework for researchers and practitioners seeking to design secure and efficient cross-chain infrastructures in AIoT systems.
comment: Blockchain Cross-Chain Bridge Survey
♻ ☆ RepoScope: Leveraging Call Chain-Aware Multi-View Context for Repository-Level Code Generation ICSE 2026
Repository-level code generation aims to generate code within the context of a specified repository. Existing approaches typically employ retrieval-augmented generation (RAG) techniques to provide LLMs with relevant contextual information extracted from the repository. However, these approaches often struggle with effectively identifying truly relevant contexts that capture the rich semantics of the repository, and their contextual perspectives remains narrow. Moreover, most approaches fail to account for the structural relationships in the retrieved code during prompt construction, hindering the LLM's ability to accurately interpret the context. To address these issues, we propose RepoScope, which leverages call chain-aware multi-view context for repository-level code generation. RepoScope constructs a Repository Structural Semantic Graph (RSSG) and retrieves a comprehensive four-view context, integrating both structural and similarity-based contexts. We propose a novel call chain prediction method that utilizes the repository's structural semantics to improve the identification of callees in the target function. Additionally, we present a structure-preserving serialization algorithm for prompt construction, ensuring the coherence of the context for the LLM. Notably, RepoScope relies solely on static analysis, eliminating the need for additional training or multiple LLM queries, thus ensuring both efficiency and generalizability. Evaluation on widely-used repository-level code generation benchmarks (CoderEval and DevEval) demonstrates that RepoScope outperforms state-of-the-art methods, achieving up to a 36.35% relative improvement in pass@1 scores. Further experiments emphasize RepoScope's potential to improve code generation across different tasks and its ability to integrate effectively with existing approaches.
comment: Accepted by ICSE 2026
♻ ☆ Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation
LLMs have become the mainstream approaches to code generation. Existing LLMs mainly employ autoregressive generation, i.e. generating code token-by-token from left to right. However, the underlying autoregressive generation has two limitations in code generation. First, autoregressive LLMs only generate a token at each step, showing low efficiency in practice. Second, programming is a non-sequential process involving back-and-forth editing, while autoregressive LLMs only employ the left-to-right generation order. These two intrinsic limitations hinder the further development of LLMs in code generation. Recently, diffusion LLMs have emerged as a promising alternative. Diffusion LLMs address the above limitations with two advances, including multi-token prediction (i.e. generating multiple tokens at each step) and flexible generation order (i.e. flexibly determining which positions to generate tokens). However, there is no systematic study exploring diffusion LLMs in code generation. To bridge the knowledge gap, we present the first empirical study of diffusion LLMs for code generation. Our study involves 9 representative diffusion LLMs and conduct experiments on 4 widely used benchmarks. Based on the results, we summarize the following findings. (1) Existing diffusion LLMs are competitive with autoregressive LLMs with similar sizes. (2) Diffusion LLMs have a stronger length extrapolation ability than autoregressive LLMs and perform better in long code understanding. (3) We explore factors impacting the effectiveness and efficiency of diffusion LLMs, and provide practical guidance. (4) We discuss several promising further directions to improve diffusion LLMs on code generation. We open-source all source code, data, and results to facilitate the following research. The code is publicly available at https://github.com/zhangyitonggg/dllm4code.
Systems and Control 20
☆ Universal Barrier Functions for Safety and Stability of Constrained Nonlinear Systems
In this paper, we address the problem of synthesizing safe and stabilizing controllers for nonlinear systems subject to complex safety specifications and input constraints. We introduce the Universal Barrier Function (UBF), a single continuously differentiable scalar-valued function that encodes both stability and safety criteria while accounting for input constraints. Using the UBF, we formulate a Quadratic Program (UBF-QP) to generate control inputs that are both safe and stabilizing under input constraints. We demonstrate that the UBF-QP is feasible if a UBF exists. Furthermore, under mild conditions, we prove that a UBF always exists. The proposed framework is then extended to systems with higher relative degrees. Finally, numerical simulations illustrate the effectiveness of our proposed approach.
comment: 16 pages, 14 figures
☆ Robust Self-Triggered Control Approaches Optimizing Sampling Sequences with Synchronous Measurements
Feedback control algorithms traditionally rely on periodic execution on digital platforms. While this simplifies design and analysis, it often leads to inefficient resource usage (e.g., CPU, network bandwidth) in embedded control and shared networks. This work investigates self-triggering implementations of linear controllers in sampled-data systems with synchronous measurements. Our approach precomputes the next sampling sequence over a finite horizon based on current state information. We introduce a novel optimal self-triggering scheme that guarantees exponential stability for unperturbed systems and global uniform ultimate boundedness for perturbed systems. This ensures robustness against external disturbances with explicit performance guarantees. Simulations demonstrate the benefits of our approach.
comment: Note: This research was conducted in 2017--2018. The literature review has not been updated and may not reflect subsequent or concurrent developments in the field
☆ GOSPA-Driven Non-Myopic Multi-Sensor Management with Multi-Bernoulli Filtering
In this paper, we propose a non-myopic sensor management algorithm for multi-target tracking, with multiple sensors operating in the same surveillance area. The algorithm is based on multi-Bernoulli filtering and selects the actions that solve a non-myopic minimisation problem, where the cost function is the mean square generalised optimal sub-pattern assignment (GOSPA) error, over a future time window. For tractability, the sensor management algorithm actually uses an upper bound of the GOSPA error and is implemented via Monte Carlo Tree Search (MCTS). The sensors have the ability to jointly optimise and select their actions with the considerations of all other sensors in the surveillance area. The benefits of the proposed algorithm are analysed via simulations.
comment: submitted to IEEE Transactions on Aerospace and Electronic Systems November 2025
☆ Online Energy Storage Arbitrage under Imperfect Predictions: A Conformal Risk-Aware Approach
This work proposes a conformal approach for energy storage arbitrage to control the downside risks arose from imperfect price forecasts. Energy storage arbitrage relies solely on predictions of future market prices, while inaccurate price predictions may lead to significant profit losses. Based on conformal decision theory, we describe a controller that dynamically adjusts decision conservativeness through prediction sets without distributional assumptions. To enable online calibration when online profit loss feedback is unobservable, we establish that a temporal difference error serves as a measurable proxy. Building on this insight, we develop two online calibration strategies: prediction error-based adaptation targeting forecast accuracy, and value error-based calibration focusing on decision quality. Analysis of the conformal controller proves bounded long-term risk with convergence guarantees in temporal difference error, which further effectively manages risk exposure in potential profit losses. Case studies demonstrate superior performance in balancing risk and opportunity compared to benchmarks under varying forecast conditions.
☆ On Structural Properties of Risk-Averse Optimal Stopping Problems
We establish structural properties of optimal stopping problems under time-consistent dynamic (coherent) risk measures, focusing on value function monotonicity and the existence of control limit (threshold) optimal policies. While such results are well developed for risk-neutral (expected-value) models, they remain underexplored in risk-averse settings. Coherent risk measures typically lack the tower property and are subadditive rather than additive, complicating structural analysis. We show that value function monotonicity mirrors the risk-neutral case. Moreover, if the risk envelope associated with each coherent risk measure admits a minimal element, the risk-averse optimal stopping problem reduces to an equivalent risk-neutral formulation. We also develop a general procedure for identifying control limit optimal policies and use it to derive practical, verifiable conditions on the risk measures and MDP structure that guarantee their existence. We illustrate the theory and verify these conditions through optimal stopping problems arising in operations, marketing, and finance.
☆ Autonomous Vehicle front steering control computation saving
For autonomous vehicles lane keeping purposes it is crucial to control the vehicle yaw rate. As it is known a vehicle yaw rate control can be achieved handling the steering angle. One option is to consider a robust controller and depending of the requirements the synthesis can drive to a high order controller. Nowadays this kind of vehicles needs a networked based control (IVN -Intelligent Vehicle Network-)with a considerable amount of control loops for different vehicle components. Therefore, in this environment the controllers computation saving could be a good option for unload the network and digital processors. That is the main target of this contribution; in order to accomplish this goal a interlacing implementation technique is considered. Results in a real path tracking illustrates viability of this procedure.
comment: 17 pages, 6 figures
☆ Secure Distributed Consensus Estimation under False Data Injection Attacks: A Defense Strategy Based on Partial Channel Coding
This article investigates the security issue caused by false data injection attacks in distributed estimation, wherein each sensor can construct two types of residues based on local estimates and neighbor information, respectively. The resource-constrained attacker can select partial channels from the sensor network and arbitrarily manipulate the transmitted data. We derive necessary and sufficient conditions to reveal system vulnerabilities, under which the attacker is able to diverge the estimation error while preserving the stealthiness of all residues. We propose two defense strategies with mechanisms of exploiting the Euclidean distance between local estimates to detect attacks, and adopting the coding scheme to protect the transmitted data, respectively. It is proven that the former has the capability to address the majority of security loopholes, while the latter can serve as an additional enhancement to the former. By employing the time-varying coding matrix to mitigate the risk of being cracked, we demonstrate that the latter can safeguard against adversaries injecting stealthy sequences into the encoded channels. Hence, drawing upon the security analysis, we further provide a procedure to select security-critical channels that need to be encoded, thereby achieving a trade-off between security and coding costs. Finally, some numerical simulations are conducted to demonstrate the theoretical results.
☆ Parallel KKT Solver in PIQP for Multistage Optimization
This paper presents an efficient parallel Cholesky factorization and triangular solve algorithm for the Karush-Kuhn-Tucker (KKT) systems arising in multistage optimization problems, with a focus on model predictive control and trajectory optimization for racing. The proposed approach directly parallelizes solving the KKT systems with block-tridiagonal-arrow KKT matrices on the linear algebra level arising in interior-point methods. The algorithm is implemented as a new backend of the PIQP solver and released as open source. Numerical experiments on the chain-of-masses benchmarks and a minimum curvature race line optimization problem demonstrate substantial performance gains compared to other state-of-the-art solvers.
☆ Traffic-Aware Grid Planning for Dynamic Wireless Electric Vehicle Charging
Dynamic Wireless Electric Vehicle Charging (DWC) on electrified roadways is an emerging technology that can significantly reduce battery sizes, eliminate charging downtime, and alleviate range anxiety, specially for long-haul transportation and fleet operations of electric vehicles (EVs). However, these systems introduce new challenges for power system planning due to their short-duration and high-power demands which can strain the grid if not properly managed. As the energy demands from DWC depend on vehicle speed, density, dwell time in charging zones, and load profiles along road segments, there is a need for integrated planning of such systems, jointly considering both traffic behavior and EV energy consumption. In this paper, we propose a traffic-aware grid planning framework for DWC. We leverage a macroscopic Cell Transmission Model of traffic flow to estimate real-time, spatiotemporal EV charging demand from DWC corridors. The demand model is then integrated into an AC Optimal Power Flow based formulation to optimally size a microgrid that supports DWC under varying traffic conditions while minimizing the cost of operation. Our framework explicitly models how spatiotemporal traffic patterns affect the utilization of grid resources to obtain system designs that achieve lower costs and are easier to operationalize as compared to planning models that rely on worst-case traffic data. We demonstrate the framework on data from a 14-mile segment of the I-210W highway in California, USA, evaluating multiple traffic scenarios like free-flow, severe congestion, accidents of varying severity, and natural disasters like forest fires. Our results demonstrate that traffic-aware grid planning significantly reduces infrastructure costs as compared to worst-scenario based modeling, while ensuring reliability of service in terms of meeting charging demands under diverse traffic conditions.
☆ Low-Cost Carriers in Aviation: Significance and Developments
This paper aims to discuss the impacts of low-cost airlines on the air transport market and, in particular, to present the most recent findings from the specialized literature in this field. To this end, several papers published on the topic since 2015 were selected and analyzed. Based on this analysis, the main subjects addressed in the studies were categorized into five groups: (i) impacts of low-cost airlines on competing carriers; (ii) impacts on airports; (iii) general effects on air transport demand; (iv) effects on passengers' choice processes; and (v) broader effects on geographical regions.
☆ Minimizing Maximum Latency of Task Offloading for Multi-UAV-assisted Maritime Search and Rescue
Unmanned Aerial Vehicles (UAVs) play a crucial role in Maritime Search and Rescue (MSAR), contributing to the improvement of rescue efficiency and reduction of casualties. Typically, UAVs equipped with cameras collect data from disaster areas and transmit it to the shore-based rescue command centers. By deploying Mobile Edge Computing (MEC) servers, UAVs can pre-process video footage to reduce data transmission volume, thus reducing transmission delays. However, the limited computational capacity and energy of UAVs pose significant challenges to the efficiency of UAV-assisted MSAR systems. To address these problems, in this paper, we investigate a multi-UAV assisted MSAR system consisting of multiple Surveillance UAVs (S-UAVs) and a Relay UAV (R-UAV). Then, we formulate a joint optimization problem to minimize the maximum total latency among all S-UAVs via jointly making the computing offloading decisions, R-UAV deployment, and the association between a S-UAV and rescue targets while ensuring that all targets are monitored by S-UAVs. Since the formulated optimization problem is typically hard to solve due to its non-convexity, we propose an effective iterative algorithm by breaking it into three sub-problems. Numerical simulation results show the effectiveness of the proposed algorithm with various performance parameters.
☆ Real-Time Learning of Predictive Dynamic Obstacle Models for Robotic Motion Planning ICRA
Autonomous systems often must predict the motions of nearby agents from partial and noisy data. This paper asks and answers the question: "can we learn, in real-time, a nonlinear predictive model of another agent's motions?" Our online framework denoises and forecasts such dynamics using a modified sliding-window Hankel Dynamic Mode Decomposition (Hankel-DMD). Partial noisy measurements are embedded into a Hankel matrix, while an associated Page matrix enables singular-value hard thresholding (SVHT) to estimate the effective rank. A Cadzow projection enforces structured low-rank consistency, yielding a denoised trajectory and local noise variance estimates. From this representation, a time-varying Hankel-DMD lifted linear predictor is constructed for multi-step forecasts. The residual analysis provides variance-tracking signals that can support downstream estimators and risk-aware planning. We validate the approach in simulation under Gaussian and heavy-tailed noise, and experimentally on a dynamic crane testbed. Results show that the method achieves stable variance-aware denoising and short-horizon prediction suitable for integration into real-time control frameworks.
comment: 10 pages, 6 figures, submitted to IEEE International Conference on Robotics and Automation (ICRA) 2025
☆ When Semantics Connect the Swarm: LLM-Driven Fuzzy Control for Cooperative Multi-Robot Underwater Coverage
Underwater multi-robot cooperative coverage remains challenging due to partial observability, limited communication, environmental uncertainty, and the lack of access to global localization. To address these issues, this paper presents a semantics-guided fuzzy control framework that couples Large Language Models (LLMs) with interpretable control and lightweight coordination. Raw multimodal observations are compressed by the LLM into compact, human-interpretable semantic tokens that summarize obstacles, unexplored regions, and Objects Of Interest (OOIs) under uncertain perception. A fuzzy inference system with pre-defined membership functions then maps these tokens into smooth and stable steering and gait commands, enabling reliable navigation without relying on global positioning. Then, we further coordinate multiple robots by introducing semantic communication that shares intent and local context in linguistic form, enabling agreement on who explores where while avoiding redundant revisits. Extensive simulations in unknown reef-like environments show that, under limited sensing and communication, the proposed framework achieves robust OOI-oriented navigation and cooperative coverage with improved efficiency and adaptability, narrowing the gap between semantic cognition and distributed underwater control in GPS-denied, map-free conditions.
comment: This paper has been submitted to IEEE Transactions on Mobile Computing
☆ Deep Q-Network for Optimizing NOMA-Aided Resource Allocation in Smart Factories with URLLC Constraints
This paper presents a Deep Q-Network (DQN)- based algorithm for NOMA-aided resource allocation in smart factories, addressing the stringent requirements of Ultra-Reliable Low-Latency Communication (URLLC). The proposed algorithm dynamically allocates sub-channels and optimizes power levels to maximize throughput while meeting strict latency constraints. By incorporating a tunable parameter {\lambda}, the algorithm balances the trade-off between throughput and latency, making it suitable for various devices, including robots, sensors, and controllers, each with distinct communication needs. Simulation results show that robots achieve higher throughput, while sensors and controllers meet the low-latency requirements of URLLC, ensuring reliable communication for real-time industrial applications.
comment: Accepted for presentation at the IEEE Wireless Communications and Networking Conference (WCNC) 2025. This is the preprint version of the paper
☆ High-Power Dual-Channel Field Chamber for High-Frequency Magnetic Neuromodulation
Several novel methods, including magnetogenetics and magnetoelectric stimulation, use high frequency alternating magnetic fields to precisely manipulate neural activity. To quantify the behavioral effects of such interventions in a freely moving mouse, we developed a dual-channel magnetic chamber, specifically designed for rate-sensitive magnetothermal-genetic stimulation, and adaptable for other uses of alternating magnetic fields. Through an optimized coil design, the system allows independent control of two spatially orthogonal uniform magnetic fields delivered at different frequencies within a 10 cm x 10 cm x 6 cm chamber. The two channels have nominal frequencies of 50 and 550 kHz with peak magnetic field strengths of 88 and 12.5 mT, achieved with resonant coil drives having peak voltages of 1.6 and 1.8 kV and currents of 1.0 and 0.26 kA, respectively. Additionally, a liquid cooling system enables magnetic field generation for second-level duration, and an observation port and camera allow video capture of the animal's behavior within the chamber. The system generates high-amplitude magnetic fields across two widely separated frequency channels with negligible interference (< 1%). Relatively uniform magnetic field distribution (+/-10% across 94% of the chamber volume) is maintained throughout the chamber, and temperature increase of the inner side of the coil enclosure during the operation is limited to < 0.35 {\deg}C/s to ensure in vivo safety. Using cobalt-doped and undoped iron oxide nanoparticles, we demonstrate channel-specific heating rates of 3.5 {\deg}C/s and 1.5 {\deg}C/s, respectively, validating frequency-selectivity. Both channels can run continuously for four seconds stably.
comment: 25 pages, 8 figures
☆ Magnetic Materials for Transcranial Magnetic Stimulation (TMS)
Various coils for transcranial magnetic stimulation (TMS) are widely available for clinical and research use. These coils are almost all designed as air coils, which require large levels of energy to achieve a given magnetic flux density and in turn electric field strength, whereas in other sectors, such as power electronics or electrical machines, magnetic materials have been used for a long time to achieve higher efficiencies. We tested the impact on the electric and magnetic properties of different soft magnetic materials, including various ferrite cores, laminated sheet materials of nonisotropic corn-oriented silicon-steel, non-oriented silicon-steel, as well as cobalt-iron, and soft magnetic compound powder cores with insulated particles. Every material led to a reduction in coil current and voltage for the same target electric field strength. For the same field energy, every material yielded lower losses. Most common materials saturated already at very low currents. More material in thicker layers could shift the saturation point but at the cost of high weight. Due to their low saturation flux density, ferrites appear unsuitable for the high amplitude requirements of TMS. Laminated sheet materials and powder cores reduce the pulse energy, but the laminated sheet material adds more weight for the same effect than powder cores. Thus, appropriate magnetic materials can reduce the required pulse energy. Saturation flux density is the most relevant parameter, whereas the permeability beyond a certain base level is practically irrelevant. Most importantly, the weight of a magnetic-core coil may always be increased compared to an air coil for the same target field.
comment: 30 pages, 14 figures, 2 tables
♻ ☆ Iterative Cut-Based PWA Approximation of Multi-Dimensional Nonlinear Systems
PieceWise Affine (PWA) approximations for nonlinear functions have been extensively used for tractable, computationally efficient control of nonlinear systems. However, reaching a desired approximation accuracy without prior information about the behavior of the nonlinear systems remains a challenge in the function approximation and control literature. As the name suggests, PWA approximation aims at approximating a nonlinear function or system by dividing the domain into multiple subregions where the nonlinear function or dynamics is approximated locally by an affine function also called local mode. Without prior knowledge of the form of the nonlinearity, the required number of modes, the locations of the subregions, and the local approximations need to be optimized simultaneously, which becomes highly complex for large-scale systems with multi-dimensional nonlinear functions. This paper introduces a novel approach for PWA approximation of multi-dimensional nonlinear systems, utilizing a hinging hyperplane formalism for cut-based partitioning of the domain. The complexity of the PWA approximation is iteratively increased until reaching the desired accuracy level. Further, the tractable cut definitions allow for different forms of subregions, as well as the ability to impose continuity constraints on the PWA approximation. The methodology is explained via multiple examples and its performance is compared to two existing approaches through case studies, showcasing its efficacy.
comment: 9 pages, 4 figures, submitted to journal
♻ ☆ ORFit: One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares
While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit's update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning.
comment: Journal extension of v1: Y. Min, K, Ahn, N. Azizan, "One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares," IEEE Conference on Decision and Control, 2022
♻ ☆ Spatio-Temporal Consistent Soft Sensor Modeling and Monitoring of Thermal Power Plants based on Physical Knowledge
Data-driven soft sensors have been widely applied in complex industrial processes. However, the interpretable spatio-temporal features extraction by soft sensors remains a challenge. In this light, this work introduces a novel method termed spatio-temporal consistent and interpretable model (STCIM). First, temporal and spatial features are captured and aligned by a far topological spatio-temporal consistency extraction block. Then, the features are mapped into an interpretable latent space for further prediction by explicitly giving physical meanings to latent variables. The efficacy of the proposed STCIM is demonstrated through the modeling of two generated datasets and a real-life dataset of coal-fired power plants. The corresponding experiments show: 1) The generalization of STCIM outperforms other methods, especially in different operation situations. 2) The far topological spatio-temporal consistency is vital for feature alignment. 3) The hyper-parameters of physics-informed interpretable latent space loss decide the performance of STCIM.
♻ ☆ Lyapunov Neural ODE State-Feedback Control Policies
Deep neural networks are increasingly used as an effective parameterization of control policies in various learning-based control paradigms. For continuous-time optimal control problems (OCPs), which are central to many decision-making tasks, control policy learning can be cast as a neural ordinary differential equation (NODE) problem wherein state and control constraints are naturally accommodated. This paper presents a NODE approach to solving continuous-time OCPs for the case of stabilizing a known constrained nonlinear system around a target state. The approach, termed Lyapunov-NODE control (L-NODEC), uses a novel Lyapunov loss formulation that incorporates an exponentially-stabilizing control Lyapunov function to learn a state-feedback neural control policy, bridging the gap of solving continuous-time OCPs via NODEs with stability guarantees. The proposed Lyapunov loss allows L-NODEC to guarantee exponential stability of the controlled system, as well as its adversarial robustness to perturbations to the initial state. The performance of L-NODEC is illustrated in two problems, including a dose delivery problem in plasma medicine. In both cases, L-NODEC effectively stabilizes the controlled system around the target state despite perturbations to the initial state and reduces the inference time necessary to reach the target.
Graphics 6
☆ Detecting Coverage Holes in Wireless Sensor Networks Using Connected Component Labeling and Force-Directed Algorithms
Contour detection in Wireless Sensor Networks (WSNs) is crucial for tasks like energy saving and network optimization, especially in security and surveillance applications. Coverage holes, where data transmission is not achievable, are a significant issue caused by factors such as energy depletion and physical damage. Traditional methods for detecting these holes often suffer from inaccuracy, low processing speed, and high energy consumption, relying heavily on physical information like node coordinates and sensing range. To address these challenges, we propose a novel, coordinate-free coverage hole detection method using Connected Component Labeling (CCL) and Force-Directed (FD) algorithms, termed FD-CCL. This method does not require node coordinates or sensing range information. We also investigate Suzuki's Contour Tracing (CT) algorithm and compare its performance with CCL on various FD graphs. Our experiments demonstrate the effectiveness of FD-CCL in terms of processing time and accuracy. Simulation results confirm the superiority of FD-CCL in detecting and locating coverage holes in WSNs.
☆ G2rammar: Bilingual Grammar Modeling for Enhanced Text-attributed Graph Learning
Text-attributed graphs require models to effectively integrate both structural topology and semantic content. Recent approaches apply large language models to graphs by linearizing structures into token sequences through random walks. These methods create concise graph vocabularies to replace verbose natural language descriptions. However, they overlook a critical component that makes language expressive: grammar. In natural language, grammar assigns syntactic roles to words and defines their functions within sentences. Similarly, nodes in graphs play distinct structural roles as hubs, bridges, or peripheral members. Current graph language methods provide tokens without grammatical annotations to indicate these structural or semantic roles. This absence limits language models' ability to reason about graph topology effectively. We propose \textbf{G2rammar}, a bilingual grammar framework that explicitly encodes both structural and semantic grammar for text-attributed graphs. Structural grammar characterizes topological roles through centrality and neighborhood patterns. Semantic grammar captures content relationships through textual informativity. The framework implements two-stage learning with structural grammar pre-training followed by semantic grammar fine-tuning. Extensive experiments on real-world datasets demonstrate that G2rammar consistently outperforms competitive baselines by providing language models with the grammatical context needed to understand graph structures.
☆ GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks
Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi-agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose \textbf{GraphGeo}, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual-level debate mechanism combining node-level refinement and edge-level argumentation modeling. A cross-level topology refinement strategy enables co-evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods. Our framework transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.
☆ Empowering LLMs with Structural Role Inference for Zero-Shot Graph Learning
Large Language Models have emerged as a promising approach for graph learning due to their powerful reasoning capabilities. However, existing methods exhibit systematic performance degradation on structurally important nodes such as bridges and hubs. We identify the root cause of these limitations. Current approaches encode graph topology into static features but lack reasoning scaffolds to transform topological patterns into role-based interpretations. This limitation becomes critical in zero-shot scenarios where no training data establishes structure-semantics mappings. To address this gap, we propose DuoGLM, a training-free dual-perspective framework for structure-aware graph reasoning. The local perspective constructs relation-aware templates capturing semantic interactions between nodes and neighbors. The global perspective performs topology-to-role inference to generate functional descriptions of structural positions. These complementary perspectives provide explicit reasoning mechanisms enabling LLMs to distinguish topologically similar but semantically different nodes. Extensive experiments across eight benchmark datasets demonstrate substantial improvements. DuoGLM achieves 14.3\% accuracy gain in zero-shot node classification and 7.6\% AUC improvement in cross-domain transfer compared to existing methods. The results validate the effectiveness of explicit role reasoning for graph understanding with LLMs.
♻ ☆ MOSPA: Human Motion Generation Driven by Spatial Audio NeurIPS 2025
Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation
comment: NeurIPS 2025 (Spotlight)
♻ ☆ Improved visual-information-driven model for crowd simulation and its modular application
Data-driven crowd simulation models offer advantages in enhancing the accuracy and realism of simulations, and improving their generalizability is essential for promoting application. Current data-driven approaches are primarily designed for a single scenario, with very few models validated across more than two scenarios. It is still an open question to develop data-driven crowd simulation models with strong generalizibility. We notice that the key to addressing this challenge lies in effectively and accurately capturing the core common influential features that govern pedestrians' navigation across diverse scenarios. Particularly, we believe that visual information is one of the most dominant influencing features. In light of this, this paper proposes a data-driven model incorporating a refined visual information extraction method and exit cues to enhance generalizability. The proposed model is examined on four common fundamental modules: bottleneck, corridor, corner and T-junction. The evaluation results demonstrate that our model performs excellently across these scenarios, aligning with pedestrian movement in real-world experiments, and significantly outperforms the classical knowledge-driven model. Furthermore, we introduce a modular approach to apply our proposed model in composite scenarios, and the results regarding trajectories and fundamental diagrams indicate that our simulations closely match real-world patterns in the composite scenario. The research outcomes can provide inspiration for the development of data-driven crowd simulation models with high generalizability and advance the application of data-driven approaches.This work has been submitted to Elsevier for possible publication.
Computation and Language 29
♻ ☆ Learning to Steer: Input-dependent Steering for Multimodal LLMs NeurIPS 2025
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines. Our code is publicly available at https://jayneelparekh.github.io/learn-to-steer/
comment: NeurIPS 2025
♻ ☆ Multi-Step Reasoning with Large Language Models, a Survey
Large language models (LLMs) with billions of parameters exhibit in-context learning abilities, enabling few-shot learning on tasks that the model was not specifically trained for. Traditional models achieve breakthrough performance on language tasks, but do not perform well on basic reasoning benchmarks. However, a new in-context learning approach, Chain-of-thought, has demonstrated strong multi-step reasoning abilities on these benchmarks. The research on LLM reasoning abilities started with the question whether LLMs can solve grade school math word problems, and has expanded to other tasks in the past few years. This article reviews the field of multi-step reasoning with LLMs. We propose a taxonomy that identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. We find that multi-step reasoning approaches have progressed beyond math word problems, and can now successfully solve challenges in logic, combinatorial games, and robotics, sometimes by first generating code that is then executed by external tools. Many studies in multi-step methods use reinforcement learning for finetuning, external optimization loops, in-context reinforcement learning, and self-reflection.
comment: ACM Computing Surveys
♻ ☆ Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
♻ ☆ Natural Language Generation
This article provides a brief overview of the field of Natural Language Generation. The term Natural Language Generation (NLG), in its broadest definition, refers to the study of systems that verbalize some form of information through natural language. That information could be stored in a large database or knowledge graph (in data-to-text applications), but NLG researchers may also study summarisation (text-to-text) or image captioning (image-to-text), for example. As a subfield of Natural Language Processing, NLG is closely related to other sub-disciplines such as Machine Translation (MT) and Dialog Systems. Some NLG researchers exclude MT from their definition of the field, since there is no content selection involved where the system has to determine what to say. Conversely, dialog systems do not typically fall under the header of Natural Language Generation since NLG is just one component of dialog systems (the others being Natural Language Understanding and Dialog Management). However, with the rise of Large Language Models (LLMs), different subfields of Natural Language Processing have converged on similar methodologies for the production of natural language and the evaluation of automatically generated text.
comment: 4 pages + references. Submitted for publication in the Encyclopedia of Language & Linguistics
♻ ☆ An Exploration of Knowledge Editing for Arabic
While Knowledge Editing (KE) has been widely explored in English, its behavior in morphologically rich languages like Arabic remains underexamined. In this work, we present the first study of Arabic KE. We evaluate four methods (ROME, MEMIT, ICE, and LTE) on Arabic translations of the ZsRE and Counterfact benchmarks, analyzing both multilingual and cross-lingual settings. Our experiments on Llama-2-7B-chat show that parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. We extend Learning-To-Edit (LTE) to a multilingual setting and show that joint Arabic-English training improves both editability and transfer. We release Arabic KE benchmarks and multilingual training for LTE data to support future research.
♻ ☆ Solving Inequality Proofs with Large Language Models NeurIPS 2025
Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.
comment: 50 pages, 24 figures, accepted as a Spotlight at NeurIPS 2025
♻ ☆ Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48\% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://github.com/Skilteee/DiZO.
♻ ☆ Self-correction is Not An Innate Capability in Large Language Models
Although there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness. Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs' moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical evidence from both behavioral and mechanistic analyses, we demonstrate that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.
♻ ☆ Discourse Heuristics For Paradoxically Moral Self-Correction
Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.
♻ ☆ Debiasing LLMs by Masking Unfairness-Driving Attention Heads
Large language models (LLMs) increasingly mediate decisions in domains where unfair treatment of demographic groups is unacceptable. Existing work probes when biased outputs appear, but gives little insight into the mechanisms that generate them, leaving existing mitigations largely fragile. In this paper, we conduct a systematic investigation LLM unfairness and propose DiffHeads, a lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA) prompting to Chain-of-Thought (CoT) prompting across eight representative open- and closed-source LLMs. DA will trigger the nature bias part of LLM and improve measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues. Next, we define a token-to-head contribution score that traces each token's influence back to individual attention heads. This reveals a small cluster of bias heads that activate under DA but stay largely dormant with CoT, providing the first causal link between prompting strategy and bias emergence. Finally, building on this insight, we propose DiffHeads that identifies bias heads through differential activation analysis between DA and CoT, and selectively masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under DA and CoT, respectively, without harming model utility.
♻ ☆ Self-Adaptive Cognitive Debiasing for Large Language Models in Decision-Making
Large language models (LLMs) have shown potential in supporting decision-making applications, particularly as personal assistants in the financial, healthcare, and legal domains. While prompt engineering strategies have enhanced the capabilities of LLMs in decision-making, cognitive biases inherent to LLMs present significant challenges. Cognitive biases are systematic patterns of deviation from norms or rationality in decision-making that can lead to the production of inaccurate outputs. Existing cognitive bias mitigation strategies assume that input prompts only contain one type of cognitive bias, limiting their effectiveness in more challenging scenarios involving multiple cognitive biases. To fill this gap, we propose a cognitive debiasing approach, self-adaptive cognitive debiasing (SACD), that enhances the reliability of LLMs by iteratively refining prompts. Our method follows three sequential steps - bias determination, bias analysis, and cognitive debiasing - to iteratively mitigate potential cognitive biases in prompts. We evaluate SACD on finance, healthcare, and legal decision-making tasks using both open-weight and closed-weight LLMs. Compared to advanced prompt engineering methods and existing cognitive debiasing techniques, SACD achieves the lowest average bias scores in both single-bias and multi-bias settings.
♻ ☆ PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts NeurIPS 2025
In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro, achieve only 54.6 and 52.2 benchmark scores, with about 40% accuracy under the highest level From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.
comment: Accepted by NeurIPS 2025
♻ ☆ Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding NeurIPS 2025
Test-time scaling enhances large language model performance by allocating additional compute resources during inference. Best-of-N (BoN) sampling serves as a common sampling-based scaling technique, broadening the search space in parallel to find better solutions from the model distribution. However, its cost-performance trade-off is still underexplored. Two main challenges limit the efficiency of BoN sampling: (1) Generating N full samples consumes substantial GPU memory, reducing inference capacity under limited resources. (2) Reward models add extra memory and latency overhead, and training strong reward models introduces potential training data costs. Although some studies have explored efficiency improvements, none have addressed both challenges at once. To address this gap, we propose Self-Truncation Best-of-N (ST-BoN), a decoding method that avoids fully generating all N samples and eliminates the need for reward models. It leverages early sampling consistency in the model's internal states to identify the most promising path and truncate suboptimal ones. In terms of cost, ST-BoN reduces dynamic GPU memory usage by over 80% and inference latency by 50%. In terms of cost-performance trade-off, ST-BoN achieves the same performance as Full-BoN while saving computational cost by 70%-80%, and under the same cost, it can improve accuracy by 3-4 points.
comment: Accepted by NeurIPS 2025 (Spotlight)
♻ ☆ Optimizing Token Choice for Code Watermarking: An RL Approach
Protecting intellectual property on LLM-generated code necessitates effective watermarking systems that can operate within code's highly structured, syntactically constrained nature. In this work, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a novel reinforcement learning training paradigm. At its core, CodeTracer features a policy-driven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strategy ensures that embedded watermarks maintain code functionality while exhibiting subtle yet statistically detectable deviations from typical token distributions. To facilitate policy learning, we devise a comprehensive reward system that seamlessly integrates execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards. Additionally, we employ Gumbel Top-k reparameterization to enable gradient-based optimization of discrete watermarking decisions. Extensive comparative evaluations demonstrate CodeTracer's significant superiority over state-of-the-art baselines in both watermark detectability and the preservation of generated code's functionality.
comment: 18 pages, 3 figures
♻ ☆ From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models
Classifiers are an important and defining feature of the Chinese language, and their correct prediction is key to numerous educational applications. Yet, whether the most popular Large Language Models (LLMs) possess proper knowledge the Chinese classifiers is an issue that has largely remain unexplored in the Natural Language Processing (NLP) literature. To address such a question, we employ various masking strategies to evaluate the LLMs' intrinsic ability, the contribution of different sentence elements, and the working of the attention mechanisms during prediction. Besides, we explore fine-tuning for LLMs to enhance the classifier performance. Our findings reveal that LLMs perform worse than BERT, even with fine-tuning. The prediction, as expected, greatly benefits from the information about the following noun, which also explains the advantage of models with a bidirectional attention mechanism such as BERT.
♻ ☆ CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward
We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1
comment: 10 pages, 6 figures and 4 tables
♻ ☆ Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios
Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.
♻ ☆ Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.
comment: The authors have discovered a significant error in the paper subsequent to submission, and are withdrawing the manuscript for substantial correction
♻ ☆ KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization
This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.
♻ ☆ SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.
♻ ☆ Medical Hallucinations in Foundation Models and Their Impact on Healthcare
Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. We define medical hallucination as any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions. We evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks spanning medical reasoning and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median: 76.6% vs 51.3%, difference = 25.2%, 95% CI: 18.7-31.3%, Mann-Whitney U = 27.0, p = 0.012, rank-biserial r = -0.64). Top-performing models such as Gemini-2.5 Pro exceeded 97% accuracy when augmented with chain-of-thought prompting (base: 87.6%), while medical-specialized models like MedGemma ranged from 28.6-61.9% despite explicit training on medical corpora. Chain-of-thought reasoning significantly reduced hallucinations in 86.4% of tested comparisons after FDR correction (q < 0.05), demonstrating that explicit reasoning traces enable self-verification and error detection. Physician audits confirmed that 64-72% of residual hallucinations stemmed from causal or temporal reasoning failures rather than knowledge gaps. A global survey of clinicians (n = 70) validated real-world impact: 91.8% had encountered medical hallucinations, and 84.7% considered them capable of causing patient harm. The underperformance of medical-specialized models despite domain training indicates that safety emerges from sophisticated reasoning capabilities and broad knowledge integration developed during large-scale pre-training, not from narrow optimization.
♻ ☆ AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM's embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.
♻ ☆ Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
comment: 82 pages
♻ ☆ Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse
Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information media. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague this novel technology. This study employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing one of the aforementioned metrics. The results show an amplified liberal-authoritarian alignment across the six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on the region's pre-existing socio-political structures.
comment: 19 pages, 7 figures
♻ ☆ Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP
The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. Mafoko addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational Mafoko dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. Mafoko provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa's rich linguistic diversity is represented in the digital age.
comment: Accepted for Sixth Workshop on Resources for African Indigenous Languages (RAIL) 2025
♻ ☆ FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs EMNLP
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
comment: Accepted in the Findings of EMNLP, 2025
♻ ☆ Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.
♻ ☆ Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access NeurIPS 2025
A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose Hierarchical Sparse Attention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selects the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.
comment: Accepted to NeurIPS 2025
♻ ☆ BrainLLM: Generative Language Decoding from Brain Recordings
Generating human language through non-invasive brain-computer interfaces (BCIs) has the potential to unlock many applications, such as serving disabled patients and improving communication. Currently, however, generating language via BCIs has been previously successful only within a classification setup for selecting pre-generated sentence continuation candidates with the most likely cortical semantic representation. Inspired by recent research that revealed associations between the brain and the large computational language models, we propose a generative language BCI that utilizes the capacity of a large language model (LLM) jointly with a semantic brain decoder to directly generate language from functional magnetic resonance imaging (fMRI) input. The proposed model can generate coherent language sequences aligned with the semantic content of visual or auditory language stimuli perceived, without prior knowledge of any pre-generated candidates. We compare the language generated from the presented model with a random control, pre-generated language selection approach, and a standard LLM, which generates common coherent text solely based on the next word likelihood according to statistical language training data. The proposed model is found to generate language that is more aligned with semantic stimulus in response to which brain input is sampled. Our findings demonstrate the potential and feasibility of employing BCIs in direct language generation.
comment: Nature Communications Biology
Information Retrieval 8
☆ Controlling Gender Bias in Retrieval via a Backpack Architecture
The presence of social biases in large language models (LLMs) has become a significant concern in AI research. These biases, often embedded in training data, can perpetuate harmful stereotypes and distort decision-making processes. When LLMs are integrated into ranking systems, they can propagate these biases, leading to unfair outcomes in critical applications such as search engines and recommendation systems. Backpack Language Models, unlike traditional transformer-based models that treat text sequences as monolithic structures, generate outputs as weighted combinations of non-contextual, learned word aspects, also known as senses. Leveraging this architecture, we propose a framework for debiasing ranking tasks. Our experimental results show that this framework effectively mitigates gender bias in text retrieval and ranking with minimal degradation in performance.
☆ REaR: Retrieve, Expand and Refine for Effective Multitable Retrieval
Answering natural language queries over relational data often requires retrieving and reasoning over multiple tables, yet most retrievers optimize only for query-table relevance and ignore table table compatibility. We introduce REAR (Retrieve, Expand and Refine), a three-stage, LLM-free framework that separates semantic relevance from structural joinability for efficient, high-fidelity multi-table retrieval. REAR (i) retrieves query-aligned tables, (ii) expands these with structurally joinable tables via fast, precomputed column-embedding comparisons, and (iii) refines them by pruning noisy or weakly related candidates. Empirically, REAR is retriever-agnostic and consistently improves dense/sparse retrievers on complex table QA datasets (BIRD, MMQA, and Spider) by improving both multi-table retrieval quality and downstream SQL execution. Despite being LLM-free, it delivers performance competitive with state-of-the-art LLM-augmented retrieval systems (e.g.,ARM) while achieving much lower latency and cost. Ablations confirm complementary gains from expansion and refinement, underscoring REAR as a practical, scalable building block for table-based downstream tasks (e.g., Text-to-SQL).
comment: 13 pages, 2 figures, 8 tables
♻ ☆ Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals NeurIPS 2025
Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce RAGuard, the first benchmark to evaluate the robustness of RAG systems against misleading retrievals. Unlike prior benchmarks that rely on synthetic noise, our fact-checking dataset captures naturally occurring misinformation by constructing its retrieval corpus from Reddit discussions. It categorizes retrieved evidence into three types: supporting, misleading, and unrelated, providing a realistic and challenging testbed for assessing how well RAG systems navigate different types of evidence. Our experiments reveal that, when exposed to potentially misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), while human annotators consistently perform better, highlighting LLMs' susceptibility to noisy environments. To our knowledge, RAGuard is the first benchmark to systematically assess the robustness of the RAG against misleading evidence. We expect this benchmark to drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications. The dataset is available at https://huggingface.co/datasets/UCSC-IRKM/RAGuard.
comment: Advances in Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Gated Rotary-Enhanced Linear Attention for Long-term Sequential Recommendation
In Sequential Recommendation Systems (SRSs), Transformer models have demonstrated remarkable performance but face computational and memory cost challenges, especially when modeling long-term user behavior sequences. Due to its quadratic complexity, the dot-product attention mechanism in Transformers becomes expensive for processing long sequences. By approximating the dot-product attention using elaborate mapping functions, linear attention provides a more efficient option with linear complexity. However, existing linear attention methods face three limitations: 1) they often use learnable position encodings, which incur extra computational costs in long-term sequence scenarios, 2) they may not sufficiently account for user's fine-grained local preferences (short-lived burst of interest), and 3) they try to capture some temporary activities, but often confuse these with stable and long-term interests. This can result in unclear or less effective recommendations. To remedy these drawbacks, we propose a long-term sequential Recommendation model with Gated Rotary Enhanced Linear Attention (RecGRELA). Specifically, we first propose a Rotary-Enhanced Linear Attention (RELA) module to efficiently model long-range dependency within the user's historical information using rotary position encodings. Then, we introduce a local short operation to add the local preferences of interactions and show the theoretical insight. We further introduce a SiLU-based Gated mechanism for RELA (GRELA) to help the model tell if a user behavior shows a short-term, local interest or a real change in their long-term tastes. Experimental results on four public benchmark datasets show that our RecGRELA achieves state-of-the-art performance compared with existing SRSs based on Recurrent Neural Networks, Transformer, and Mamba while keeping low memory overhead.
comment: 14 pages,9 figures
♻ ☆ SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.5% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.1% AUC gain.
comment: Technical Report
♻ ☆ LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark
♻ ☆ HCT-QA: A Benchmark for Question Answering on Human-Centric Tables
Tabular data embedded within PDF files, web pages, and other document formats are prevalent across numerous sectors such as government, engineering, science, and business. These human-centric tables (HCTs) possess a unique combination of high business value, intricate layouts, limited operational power at scale, and sometimes serve as the only data source for critical insights. However, their complexity poses significant challenges to traditional data extraction, processing, and querying methods. While current solutions focus on transforming these tables into relational formats for SQL queries, they fall short in handling the diverse and complex layouts of HCTs and hence being amenable to querying. This paper describes HCT-QA, an extensive benchmark of HCTs, natural language queries, and related answers on thousands of tables. Our dataset includes 2,188 real-world HCTs with 9,835 QA pairs and 4,679 synthetic tables with 67.5K QA pairs. While HCTs can be potentially processed by different type of query engines, in this paper, we focus on Large Language Models as potential engines and assess their ability in processing and querying such tables.
♻ ☆ Federated Vision-Language-Recommendation with Personalized Fusion
Applying large pre-trained Vision-Language Models to recommendation is a burgeoning field, a direction we term Vision-Language-Recommendation (VLR). Bringing VLR to user-oriented on-device intelligence within a federated learning framework is a crucial step for enhancing user privacy and delivering personalized experiences. This paper introduces FedVLR, a federated VLR framework specially designed for user-specific personalized fusion of vision-language representations. At its core is a novel bi-level fusion mechanism: The server-side multi-view fusion module first generates a diverse set of pre-fused multimodal views. Subsequently, each client employs a user-specific mixture-of-expert mechanism to adaptively integrate these views based on individual user interaction history. This designed lightweight personalized fusion module provides an efficient solution to implement a federated VLR system. The effectiveness of our proposed FedVLR has been validated on seven benchmark datasets.
comment: 15 pages, 10 figures, 7 tables, conference
Multimedia 5
☆ Deciphering Personalization: Towards Fine-Grained Explainability in Natural Language for Personalized Image Generation Models
Image generation models are usually personalized in practical uses in order to better meet the individual users' heterogeneous needs, but most personalized models lack explainability about how they are being personalized. Such explainability can be provided via visual features in generated images, but is difficult for human users to understand. Explainability in natural language is a better choice, but the existing approaches to explainability in natural language are limited to be coarse-grained. They are unable to precisely identify the multiple aspects of personalization, as well as the varying levels of personalization in each aspect. To address such limitation, in this paper we present a new technique, namely \textbf{FineXL}, towards \textbf{Fine}-grained e\textbf{X}plainability in natural \textbf{L}anguage for personalized image generation models. FineXL can provide natural language descriptions about each distinct aspect of personalization, along with quantitative scores indicating the level of each aspect of personalization. Experiment results show that FineXL can improve the accuracy of explainability by 56\%, when different personalization scenarios are applied to multiple types of image generation models.
☆ Med-Banana-50K: A Cross-modality Large-Scale Dataset for Text-guided Medical Image Editing
Recent advances in multimodal large language models have enabled remarkable medical image editing capabilities. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built specifically for medical image editing with strict anatomical and clinical constraints. We introduce Med-Banana-50K, a comprehensive 50K-image dataset for instruction-based medical image editing spanning three modalities (chest X-ray, brain MRI, fundus photography) and 23 disease types. Our dataset is constructed by leveraging Gemini-2.5-Flash-Image to generate bidirectional edits (lesion addition and removal) from real medical images. What distinguishes Med-Banana-50K from general-domain editing datasets is our systematic approach to medical quality control: we employ LLM-as-Judge with a medically grounded rubric (instruction compliance, structural plausibility, realism, and fidelity preservation) and history-aware iterative refinement up to five rounds. Beyond single-turn editing, Med-Banana-50K includes 37K failed attempts with full conversation logs for preference learning and alignment research. By providing this large-scale, medically validated, and fully documented resource, Med-Banana-50K establishes a foundation for training and evaluating the next generation of medical image editing models.Our dataset and code are publicly available at [https://github.com/richardChenzhihui/med-banana-50k].
☆ Rhythm in the Air: Vision-based Real-Time Music Generation through Gestures
Gesture recognition is an essential component of human-computer interaction (HCI), facilitating seamless interconnectivity between users and computer systems without physical touch. This paper introduces an innovative application of vision-based dynamic gesture recognition (VDGR) for real-time music composition through gestures. To implement this application, we generate a custom gesture dataset that encompasses over 15000 samples across 21 classes, incorporating 7 musical notes each manifesting at three distinct pitch levels. To effectively deal with the modest volume of training data and to accurately discern and prioritize complex gesture sequences for music creation, we develop a multi-layer attention-based gated recurrent unit (MLA-GRU) model, in which gated recurrent unit (GRU) is used to learn temporal patterns from the observed sequence and an attention layer is employed to focus on musically pertinent gesture segments. Our empirical studies demonstrate that MLA-GRU significantly surpasses the classical GRU model, achieving a remarkable accuracy of 96.83% compared to the baseline's 86.7%. Moreover, our approach exhibits superior efficiency and processing speed, which are crucial for interactive applications. Using our proposed system, we believe that people will interact with music in a new and exciting way. It not only advances HCI experiences but also highlights MLA-GRU's effectiveness in scenarios demanding swift and precise gesture recognition.
comment: 8 pages, 7 figures
♻ ☆ Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
♻ ☆ Music Arena: Live Evaluation for Text-to-Music NeurIPS 2025
We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare, as study protocols may differ across systems. Moreover, human preferences might help researchers align their TTM systems or improve automatic evaluation metrics, but an open and renewable source of preferences does not currently exist. We aim to fill these gaps by offering *live* evaluation for TTM. In Music Arena, real-world users input text prompts of their choosing and compare outputs from two TTM systems, and their preferences are used to compile a leaderboard. While Music Arena follows recent evaluation trends in other AI domains, we also design it with key features tailored to music: an LLM-based routing system to navigate the heterogeneous type signatures of TTM systems, and the collection of *detailed* preferences including listening data and natural language feedback. We also propose a rolling data release policy with user privacy guarantees, providing a renewable source of preference data and increasing platform transparency. Through its standardized evaluation protocol, transparent data access policies, and music-specific features, Music Arena not only addresses key challenges in the TTM ecosystem but also demonstrates how live evaluation can be thoughtfully adapted to unique characteristics of specific AI domains. Music Arena is available at: https://music-arena.org . Preference data is available at: https://huggingface.co/music-arena .
comment: NeurIPS 2025 Creative AI Track
Image and Video Processing 3
☆ Evaluating Video Quality Metrics for Neural and Traditional Codecs using 4K/UHD-1 Videos
With neural video codecs (NVCs) emerging as promising alternatives for traditional compression methods, it is increasingly important to determine whether existing quality metrics remain valid for evaluating their performance. However, few studies have systematically investigated this using well-designed subjective tests. To address this gap, this paper presents a subjective quality assessment study using two traditional (AV1 and VVC) and two variants of a neural video codec (DCVC-FM and DCVC-RT). Six source videos (8-10 seconds each, 4K/UHD-1, 60 fps) were encoded at four resolutions (360p to 2160p) using nine different QP values, resulting in 216 sequences that were rated in a controlled environment by 30 participants. These results were used to evaluate a range of full-reference, hybrid, and no-reference quality metrics to assess their applicability to the induced quality degradations. The objective quality assessment results show that VMAF and AVQBits|H0|f demonstrate strong Pearson correlation, while FasterVQA performed best among the tested no-reference metrics. Furthermore, PSNR shows the highest Spearman rank order correlation for within-sequence comparisons across the different codecs. Importantly, no significant performance differences in metric reliability are observed between traditional and neural video codecs across the tested metrics. The dataset, consisting of source videos, encoded videos, and both subjective and quality metric scores will be made publicly available following an open-science approach (https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-NVC).
comment: Accepted for the 2025 Picture Coding Symposium (PCS)
♻ ☆ A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing
A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at https://github.com/ghosh64/RaceGAN.
comment: Currently Under Review
♻ ☆ Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration
Glaucoma is a major cause of irreversible blindness, with significant diagnostic subjectivity. This inherent uncertainty, combined with the overconfidence of models optimized solely for accuracy can lead to fatal issues such as overdiagnosis or missing critical diseases. To ensure clinical trust, model calibration is essential for reliable predictions, yet study in this field remains limited. Existing calibration study have overlooked glaucoma's systemic associations and high diagnostic subjectivity. To overcome these limitations, we propose V-ViT (Voting-based ViT), a framework that enhances calibration by integrating a patient's binocular information and metadata. Furthermore, to mitigate diagnostic subjectivity, V-ViT utilizes an iterative dropout-based Voting System to maximize calibration performance. The proposed framework achieved state-of-the-art performance across all metrics, including the primary calibration metrics. Our results demonstrate that V-ViT effectively resolves the issue of overconfidence in predictions in glaucoma diagnosis, providing highly reliable predictions for clinical use. Our source code is available at https://github.com/starforTJ/V-ViT.
Computer Vision and Pattern Recognition 43
♻ ☆ ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models
Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous fast sampling methods with fixed analytical form are not able to robust with the various error patterns in the noise estimated from pretrained diffusion models. In this work, we construct an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector. Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower errors in the estimated noise. The proposed solver can be directly applied to any pretrained diffusion models, without extra training. Experiments on Cifar10, CelebA, LSUN-Church, and ImageNet 64 x 64 (conditional) datasets demonstrate that our proposed ERA-Solver achieves 3.54, 5.06, 5.02, and 5.11 Frechet Inception Distance (FID) for image generation, with only 10 network evaluations.
♻ ☆ AI-Generated Video Detection via Perceptual Straightening NeurIPS 2025
The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.
comment: Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
♻ ☆ A Study in Dataset Distillation for Image Super-Resolution
Dataset distillation aims to compress large datasets into compact yet highly informative subsets that preserve the training behavior of the original data. While this concept has gained traction in classification, its potential for image Super-Resolution (SR) remains largely untapped. In this work, we conduct the first systematic study of dataset distillation for SR, evaluating both pixel- and latent-space formulations. We show that a distilled dataset, occupying only 8.88% of the original size, can train SR models that retain nearly the same reconstruction fidelity as those trained on full datasets. Furthermore, we analyze how initialization strategies and distillation objectives affect efficiency, convergence, and visual quality. Our findings highlight the feasibility of SR dataset distillation and establish foundational insights for memory- and compute-efficient generative restoration models.
♻ ☆ Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Lung Nodule Malignancy Prediction
Machine learning models have utilized semantic features, deep features, or both to assess lung nodule malignancy. However, their reliance on manual annotation during inference, limited interpretability, and sensitivity to imaging variations hinder their application in real-world clinical settings. Thus, this research aims to integrate semantic features derived from radiologists' assessments of nodules, guiding the model to learn clinically relevant, robust, and explainable imaging features for predicting lung cancer. We obtained 938 low-dose CT scans from the National Lung Screening Trial (NLST) with 1,261 nodules and semantic features. Additionally, the Lung Image Database Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for nodule characteristics. Three external datasets were obtained from UCLA Health, the LUNGx Challenge, and the Duke Lung Cancer Screening. We fine-tuned a pretrained Contrastive Language-Image Pretraining (CLIP) model with a parameter-efficient fine-tuning approach to align imaging and semantic text features and predict the one-year lung cancer diagnosis. Our model outperformed state-of-the-art (SOTA) models in the NLST test set with an AUROC of 0.901 and AUPRC of 0.776. It also showed robust results in external datasets. Using CLIP, we also obtained predictions on semantic features through zero-shot inference, such as nodule margin (AUROC: 0.807), nodule consistency (0.812), and pleural attachment (0.840). Our approach surpasses the SOTA models in predicting lung cancer across datasets collected from diverse clinical settings, providing explainable outputs, aiding clinicians in comprehending the underlying meaning of model predictions. This approach also prevents the model from learning shortcuts and generalizes across clinical settings. The code is available at https://github.com/luotingzhuang/CLIP_nodule.
♻ ☆ AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays MICCAI 2025
Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, including demographic biases, have received limited attention for CLIP models. This oversight leads to critical issues, particularly those related to race and gender, resulting in disparities in diagnostic outcomes and reduced reliability for underrepresented groups. To address these challenges, we introduce AdFair-CLIP, a novel framework employing adversarial feature intervention to suppress sensitive attributes, thereby mitigating spurious correlations and improving prediction fairness. We conduct comprehensive experiments on chest X-ray (CXR) datasets, and show that AdFair-CLIP significantly enhances both fairness and diagnostic accuracy, while maintaining robust generalization in zero-shot and few-shot scenarios. These results establish new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.
comment: This preprint has been accepted by MICCAI 2025
♻ ☆ REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders
We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge. Code and models are available at: https://github.com/savya08/REN.
♻ ☆ Autoadaptive Medical Segment Anything Model
Medical image segmentation is a key task in the imaging workflow, influencing many image-based decisions. Traditional, fully-supervised segmentation models rely on large amounts of labeled training data, typically obtained through manual annotation, which can be an expensive, time-consuming, and error-prone process. This signals a need for accurate, automatic, and annotation-efficient methods of training these models. We propose ADA-SAM (automated, domain-specific, and adaptive segment anything model), a novel multitask learning framework for medical image segmentation that leverages class activation maps from an auxiliary classifier to guide the predictions of the semi-supervised segmentation branch, which is based on the Segment Anything (SAM) framework. Additionally, our ADA-SAM model employs a novel gradient feedback mechanism to create a learnable connection between the segmentation and classification branches by using the segmentation gradients to guide and improve the classification predictions. We validate ADA-SAM on real-world clinical data collected during rehabilitation trials, and demonstrate that our proposed method outperforms both fully-supervised and semi-supervised baselines by double digits in limited label settings. Our code is available at: https://github.com/tbwa233/ADA-SAM.
comment: 11 pages, 2 figures, 3 tables
♻ ☆ RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
♻ ☆ Combinative Matching for Geometric Shape Assembly ICCV 2025
This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: 'identical surface shape' and 'opposite volume occupancy.' Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art. Project page: https://nahyuklee.github.io/cmnet.
comment: Accepted to ICCV 2025 (Highlight)
♻ ☆ Audio Driven Real-Time Facial Animation for Social Telepresence SIGGRAPH
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.
comment: SIGGRAPH Asia 2025. Project page: https://jiyewise.github.io/projects/AudioRTA
♻ ☆ BEN: Using Confidence-Guided Matting for Dichotomous Image Segmentation
Current approaches to dichotomous image segmentation (DIS) treat image matting and object segmentation as fundamentally different tasks. As improvements in image segmentation become increasingly challenging to achieve, combining image matting and grayscale segmentation techniques offers promising new directions for architectural innovation. Inspired by the possibility of aligning these two model tasks, we propose a new architectural approach for DIS called Confidence-Guided Matting (CGM). We created the first CGM model called Background Erase Network (BEN). BEN consists of two components: BEN Base for initial segmentation and BEN Refiner for confidence-based refinement. Our approach achieves substantial improvements over current state-of-the-art methods on the DIS5K validation dataset, demonstrating that matting-based refinement can significantly enhance segmentation quality. This work introduces a new paradigm for integrating matting and segmentation techniques, improving fine-grained object boundary prediction in computer vision.
comment: 6 pages, 2 figures, 3 tables, and 1 algorithms
♻ ☆ Joint Lossless Compression and Steganography for Medical Images via Large Language Models
Recently, large language models (LLMs) have driven promising progress in lossless image compression. However, directly adopting existing paradigms for medical images suffers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compression process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this insight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, providing global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we innovatively propose a segmented message steganography algorithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the superiority of our proposed method in terms of compression ratios, efficiency, and security. The source code will be made publicly available.
♻ ☆ OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery
Building properties, such as height, usage, and material, play a crucial role in spatial data infrastructures, supporting various urban applications. Despite their importance, comprehensive building attribute data remain scarce in many urban areas. Recent advances have enabled the extraction of objective building attributes using remote sensing and street-level imagery. However, establishing a pipeline that integrates diverse open datasets, acquires holistic building imagery, and infers comprehensive building attributes at scale remains a significant challenge. Among the first, this study bridges the gaps by introducing OpenFACADES, an open framework that leverages multimodal crowdsourced data to enrich building profiles with both objective attributes and semantic descriptors through multimodal large language models. First, we integrate street-level image metadata from Mapillary with OpenStreetMap geometries via isovist analysis, identifying images that provide suitable vantage points for observing target buildings. Second, we automate the detection of building facades in panoramic imagery and tailor a reprojection approach to convert objects into holistic perspective views that approximate real-world observation. Third, we introduce an innovative approach that harnesses and investigates the capabilities of open-source large vision-language models (VLMs) for multi-attribute prediction and open-vocabulary captioning in building-level analytics, leveraging a globally sourced dataset of 31,180 labeled images from seven cities. Evaluation shows that fine-tuned VLM excel in multi-attribute inference, outperforming single-attribute computer vision models and zero-shot ChatGPT-4o. Further experiments confirm its superior generalization and robustness across culturally distinct region and varying image conditions.
♻ ☆ Multi-scale Latent Point Consistency Models for 3D Shape Generation
Consistency Models (CMs) have significantly accelerated the sampling process in diffusion models, yielding impressive results in synthesizing high-resolution images. To explore and extend these advancements to point-cloud-based 3D shape generation, we propose a novel Multi-scale Latent Point Consistency Model (MLPCM). Our MLPCM follows a latent diffusion framework and introduces hierarchical levels of latent representations, ranging from point-level to super-point levels, each corresponding to a different spatial resolution. We design a multi-scale latent integration module along with 3D spatial attention to effectively denoise the point-level latent representations conditioned on those from multiple super-point levels. Additionally, we propose a latent consistency model, learned through consistency distillation, that compresses the prior into a one-step generator. This significantly improves sampling efficiency while preserving the performance of the original teacher model. Extensive experiments on standard benchmarks ShapeNet and ShapeNet-Vol demonstrate that MLPCM achieves a 100x speedup in the generation process, while surpassing state-of-the-art diffusion models in terms of both shape quality and diversity.
♻ ☆ Robust Atypical Mitosis Classification with DenseNet121: Stain-Aware Augmentation and Hybrid Loss for Domain Generalization MICCAI
Atypical mitotic figures are important biomarkers of tumor aggressiveness in histopathology, yet reliable recognition remains challenging due to severe class imbalance and variability across imaging domains. We present a DenseNet-121-based framework tailored for atypical mitosis classification in the MIDOG 2025 (Track 2) setting. Our method integrates stain-aware augmentation (Macenko), geometric and intensity transformations, and imbalance-aware learning via weighted sampling with a hybrid objective combining class-weighted binary cross-entropy and focal loss. Trained end-to-end with AdamW and evaluated across multiple independent domains, the model demonstrates strong generalization under scanner and staining shifts, achieving balanced accuracy 85.0%, AUROC 0.927, sensitivity 89.2%, and specificity 80.9% on the official test set. These results indicate that combining DenseNet-121 with stain-aware augmentation and imbalance-adaptive objectives yields a robust, domain-generalizable framework for atypical mitosis classification suitable for real-world computational pathology workflows.
comment: MIDOG 2025 MICCAI Workshop accepted
♻ ☆ A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift
Transformer-based architectures have recently advanced the image reconstruction quality of super-resolution (SR) models. Yet, their scalability remains limited by quadratic attention costs and coarse patch embeddings that weaken pixel-level fidelity. We propose TaylorIR, a plug-and-play framework that enforces 1x1 patch embeddings for true pixel-wise reasoning and replaces conventional self-attention with TaylorShift, a Taylor-series-based attention mechanism enabling full token interactions with near-linear complexity. Across multiple SR benchmarks, TaylorIR delivers state-of-the-art performance while reducing memory consumption by up to 60%, effectively bridging the gap between fine-grained detail restoration and efficient transformer scaling.
♻ ☆ Knolling Bot: Teaching Robots the Human Notion of Tidiness NeurIPS 2025
For robots to truly collaborate and assist humans, they must understand not only logic and instructions, but also the subtle emotions, aesthetics, and feelings that define our humanity. Human art and aesthetics are among the most elusive concepts-often difficult even for people to articulate-and without grasping these fundamentals, robots will be unable to help in many spheres of daily life. Consider the long-promised robotic butler: automating domestic chores demands more than motion planning. It requires an internal model of cleanliness and tidiness-a challenge largely unexplored by AI. To bridge this gap, we propose an approach that equips domestic robots to perform simple tidying tasks via knolling, the practice of arranging scattered items into neat, space-efficient layouts. Unlike the uniformity of industrial settings, household environments feature diverse objects and highly subjective notions of tidiness. Drawing inspiration from NLP, we treat knolling as a sequential prediction problem and employ a transformer based model to forecast each object's placement. Our method learns a generalizable concept of tidiness, generates diverse solutions adaptable to varying object sets, and incorporates human preferences for personalized arrangements. This work represents a step forward in building robots that internalize human aesthetic sense and can genuinely co-create in our living spaces.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Creative AI Track
♻ ☆ mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework
Collaborative perception significantly enhances individual vehicle perception performance through the exchange of sensory information among agents. However, real-world deployment faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange. To address these issues, we propose mmCooper, a novel multi-agent, multi-stage, communication-efficient, and collaboration-robust cooperative perception framework. Our framework leverages a multi-stage collaboration strategy that dynamically and adaptively balances intermediate- and late-stage information to share among agents, enhancing perceptual performance while maintaining communication efficiency. To support robust collaboration despite potential misalignments and calibration errors, our framework prevents misleading low-confidence sensing information from transmission and refines the received detection results from collaborators to improve accuracy. The extensive evaluation results on both real-world and simulated datasets demonstrate the effectiveness of the mmCooper framework and its components.
♻ ☆ FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications. Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes. Concretely, FLARE starts with camera pose estimation, whose results condition the subsequent learning of geometric structure and appearance, optimized through the objectives of geometry reconstruction and novel-view synthesis. Utilizing large-scale public datasets for training, our method delivers state-of-the-art performance in the tasks of pose estimation, geometry reconstruction, and novel view synthesis, while maintaining the inference efficiency (i.e., less than 0.5 seconds). The project page and code can be found at: https://zhanghe3z.github.io/FLARE/
♻ ☆ Anti-Aliased 2D Gaussian Splatting NeurIPS 2025
2D Gaussian Splatting (2DGS) has recently emerged as a promising method for novel view synthesis and surface reconstruction, offering better view-consistency and geometric accuracy than volumetric 3DGS. However, 2DGS suffers from severe aliasing artifacts when rendering at different sampling rates than those used during training, limiting its practical applications in scenarios requiring camera zoom or varying fields of view. We identify that these artifacts stem from two key limitations: the lack of frequency constraints in the representation and an ineffective screen-space clamping approach. To address these issues, we present AA-2DGS, an anti-aliased formulation of 2D Gaussian Splatting that maintains its geometric benefits while significantly enhancing rendering quality across different scales. Our method introduces a world-space flat smoothing kernel that constrains the frequency content of 2D Gaussian primitives based on the maximal sampling frequency from training views, effectively eliminating high-frequency artifacts when zooming in. Additionally, we derive a novel object-space Mip filter by leveraging an affine approximation of the ray-splat intersection mapping, which allows us to efficiently apply proper anti-aliasing directly in the local space of each splat.
comment: NeurIPS 2025. Code will be available at https://github.com/maeyounes/AA-2DGS
♻ ☆ Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification NeurIPS 2025
Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.
comment: Accepted to the Reliable ML from Unreliable Data workshop at NeurIPS 2025 (ReliableML@NeurIPS)
♻ ☆ SViM3D: Stable Video Material Diffusion for Single Image 3D Generation ICCV 2025
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
comment: Accepted by International Conference on Computer Vision (ICCV 2025). Project page: http://svim3d.aengelhardt.com
♻ ☆ Spike Imaging Velocimetry: Dense Motion Estimation of Fluids Using Spike Cameras
The need for accurate and non-intrusive flow measurement methods has led to the widespread adoption of Particle Image Velocimetry (PIV), a powerful diagnostic tool in fluid motion estimation. This study investigates the tremendous potential of spike cameras (a type of ultra-high-speed, high-dynamic-range camera) in PIV. We propose a deep learning framework, Spike Imaging Velocimetry (SIV), designed specifically for highly turbulent and intricate flow fields. To aggregate motion features from the spike stream while minimizing information loss, we incorporate a Detail-Preserving Hierarchical Transform (DPHT) module. Additionally, we introduce a Graph Encoder (GE) to extract contextual features from highly complex fluid flows. Furthermore, we present a spike-based PIV dataset, Particle Scenes with Spike and Displacement (PSSD), which provides labeled data for three challenging fluid dynamics scenarios. Our proposed method achieves superior performance compared to existing baseline methods on PSSD. The datasets and our implementation of SIV are open-sourced in the supplementary materials.
♻ ☆ FIPER: Factorized Features for Robust Image Super-Resolution and Compression NeurIPS 2025
In this work, we propose using a unified representation, termed Factorized Features, for low-level vision tasks, where we test on Single Image Super-Resolution (SISR) and \textbf{Image Compression}. Motivated by the shared principles between these tasks, they require recovering and preserving fine image details, whether by enhancing resolution for SISR or reconstructing compressed data for Image Compression. Unlike previous methods that mainly focus on network architecture, our proposed approach utilizes a basis-coefficient decomposition as well as an explicit formulation of frequencies to capture structural components and multi-scale visual features in images, which addresses the core challenges of both tasks. We replace the representation of prior models from simple feature maps with Factorized Features to validate the potential for broad generalizability. In addition, we further optimize the compression pipeline by leveraging the mergeable-basis property of our Factorized Features, which consolidates shared structures on multi-frame compression. Extensive experiments show that our unified representation delivers state-of-the-art performance, achieving an average relative improvement of 204.4% in PSNR over the baseline in Super-Resolution (SR) and 9.35% BD-rate reduction in Image Compression compared to the previous SOTA. Project page: https://jayisaking.github.io/FIPER/
comment: NeurIPS 2025. Project page: https://jayisaking.github.io/FIPER/
♻ ☆ Policy Optimized Text-to-Image Pipeline Design
Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language models (LLMs), but they suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow vocabulary training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.
♻ ☆ VideoExplorer: Think With Videos For Agentic Long-Video Understanding
Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
♻ ☆ OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search
Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.
comment: Some of the online experimental results in the paper are significantly different from the actual results, and need to be re-experimented and revised before submission. The current version is prone to misunderstanding
♻ ☆ A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
comment: 22 pages,2 figures
♻ ☆ Consistent Supervised-Unsupervised Alignment for Generalized Category Discovery NeurIPS 2025
Generalized Category Discovery (GCD) focuses on classifying known categories while simultaneously discovering novel categories from unlabeled data. However, previous GCD methods face challenges due to inconsistent optimization objectives and category confusion. This leads to feature overlap and ultimately hinders performance on novel categories. To address these issues, we propose the Neural Collapse-inspired Generalized Category Discovery (NC-GCD) framework. By pre-assigning and fixing Equiangular Tight Frame (ETF) prototypes, our method ensures an optimal geometric structure and a consistent optimization objective for both known and novel categories. We introduce a Consistent ETF Alignment Loss that unifies supervised and unsupervised ETF alignment and enhances category separability. Additionally, a Semantic Consistency Matcher (SCM) is designed to maintain stable and consistent label assignments across clustering iterations. Our method achieves strong performance on multiple GCD benchmarks, significantly enhancing novel category accuracy and demonstrating its effectiveness.
comment: Accepted by NeurIPS 2025
♻ ☆ PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.
comment: 22 pages, 13 figures
♻ ☆ Style-Aware Blending and Prototype-Based Cross-Contrast Consistency for Semi-Supervised Medical Image Segmentation
Weak-strong consistency learning strategies are widely employed in semi-supervised medical image segmentation to train models by leveraging limited labeled data and enforcing weak-to-strong consistency. However, existing methods primarily focus on designing and combining various perturbation schemes, overlooking the inherent potential and limitations within the framework itself. In this paper, we first identify two critical deficiencies: (1) separated training data streams, which lead to confirmation bias dominated by the labeled stream; and (2) incomplete utilization of supervisory information, which limits exploration of strong-to-weak consistency. To tackle these challenges, we propose a style-aware blending and prototype-based cross-contrast consistency learning framework. Specifically, inspired by the empirical observation that the distribution mismatch between labeled and unlabeled data can be characterized by statistical moments, we design a style-guided distribution blending module to break the independent training data streams. Meanwhile, considering the potential noise in strong pseudo-labels, we introduce a prototype-based cross-contrast strategy to encourage the model to learn informative supervisory signals from both weak-to-strong and strong-to-weak predictions, while mitigating the adverse effects of noise. Experimental results demonstrate the effectiveness and superiority of our framework across multiple medical segmentation benchmarks under various semi-supervised settings.
♻ ☆ Exploring Effective Factors for Improving Visual In-Context Learning
The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning. Prompt selection is the process of identifying the most appropriate prompt or example to help the model understand new tasks. This is important because providing the model with relevant prompts can help it learn more effectively and efficiently. Prompt fusion involves combining knowledge from different positions within the large-scale visual model. By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks. Based these findings, we propose a simple framework prompt-SelF for visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate all the knowledge stored in the large-scale model, and finally ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. And we conduct extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, the prompt-SelF has outperformed OSLSM based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at https://github.com/syp2ysy/prompt-SelF.
♻ ☆ VRP-SAM: SAM with Visual Reference Prompt CVPR 2024
In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at https://github.com/syp2ysy/VRP-SAM
comment: Accepted by CVPR 2024; The camera-ready version
♻ ☆ MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.
comment: Project Page: https://umass-embodied-agi.github.io/MindJourney
♻ ☆ Distribution-aware Knowledge Unification and Association for Non-exemplar Lifelong Person Re-identification
Lifelong person re-identification (LReID) encounters a key challenge: balancing the preservation of old knowledge with adaptation to new information. Existing LReID methods typically employ knowledge distillation to enforce representation alignment. However, these approaches ignore two crucial aspects: specific distribution awareness and cross-domain unified knowledge learning, both of which are essential for addressing this challenge. To overcome these limitations, we propose a novel distribution-aware knowledge unification and association (DKUA) framework where domain-style modeling is performed for each instance to propagate domain-specific representations, enhancing anti-forgetting and generalization capacity. Specifically, we design a distribution-aware model to transfer instance-level representations of the current domain into the domain-specific representations with the different domain styles, preserving learned knowledge without storing old samples. Next, we propose adaptive knowledge consolidation (AKC) to dynamically generate the unified representation as a cross-domain representation center. To further mitigate forgetting, we develop a unified knowledge association (UKA) mechanism, which explores the unified representation as a bridge to explicitly model inter-domain associations, reducing inter-domain gaps. Finally, distribution-based knowledge transfer (DKT) is proposed to prevent the current domain distribution from deviating from the cross-domain distribution center, improving adaptation capacity. Experimental results show our DKUA outperforms the existing methods by 7.6%/5.3% average mAP/R@1 improvement on anti-forgetting and generalization capacity, respectively. Our code is available at https://github.com/LiuShiBen/DKUA.
comment: 11 papges, 6 figures
♻ ☆ TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding
Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view data. Furthermore, a major limitation of prior approaches is the lack of temporal modeling to capture time-dependent relationships among dynamically evolving entities in a scene. To address these challenges, we propose Temporal Equivariant Scene Graph Neural Network (TESGNN), consisting of two key components: (1) an Equivariant Scene Graph Neural Network (ESGNN), which extracts information from 3D point clouds to generate scene graph while preserving crucial symmetry properties, and (2) a Temporal Graph Matching Network, which fuses scene graphs generated by ESGNN across multiple time sequences into a unified global representation using an approximate graph-matching algorithm. Our combined architecture TESGNN shown to be effective compared to existing methods in scene graph generation, achieving higher accuracy and faster training convergence. Moreover, we show that leveraging the symmetry-preserving property produces a more stable and accurate global scene representation compared to existing approaches. Finally, it is computationally efficient and easily implementable using existing frameworks, making it well-suited for real-time applications in robotics and computer vision. This approach paves the way for more robust and scalable solutions to complex multi-view scene understanding challenges. Our source code is publicly available at: https://github.com/HySonLab/TESGraph
comment: arXiv admin note: text overlap with arXiv:2407.00609
♻ ☆ Diffusion Classifiers Understand Compositionality, but Conditions Apply NeurIPS 2025
Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark \textsc{Self-Bench} comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.
comment: NeurIPS 2025 Datasets and Benchmarks
♻ ☆ Space Object Detection using Multi-frame Temporal Trajectory Completion Method
Space objects in Geostationary Earth Orbit (GEO) present significant detection challenges in optical imaging due to weak signals, complex stellar backgrounds, and environmental interference. In this paper, we enhance high-frequency features of GEO targets while suppressing background noise at the single-frame level through wavelet transform. Building on this, we propose a multi-frame temporal trajectory completion scheme centered on the Hungarian algorithm for globally optimal cross-frame matching. To effectively mitigate missing and false detections, a series of key steps including temporal matching and interpolation completion, temporal-consistency-based noise filtering, and progressive trajectory refinement are designed in the post-processing pipeline. Experimental results on the public SpotGEO dataset demonstrate the effectiveness of the proposed method, achieving an F_1 score of 90.14%.
♻ ☆ Split Gibbs Discrete Diffusion Posterior Sampling NeurIPS 2025
We study the problem of posterior sampling in discrete-state spaces using discrete diffusion models. While posterior sampling methods for continuous diffusion models have achieved remarkable progress, analogous methods for discrete diffusion models remain challenging. In this work, we introduce a principled plug-and-play discrete diffusion posterior sampling algorithm based on split Gibbs sampling, which we call SGDD. Our algorithm enables reward-guided generation and solving inverse problems in discrete-state spaces. We demonstrate the convergence of SGDD to the target posterior distribution and verify this through controlled experiments on synthetic benchmarks. Our method enjoys state-of-the-art posterior sampling performance on a range of benchmarks for discrete data, including DNA sequence design, discrete image inverse problems, and music infilling, achieving more than 30% improved performance compared to existing baselines. Our code is available at https://github.com/chuwd19/Split-Gibbs-Discrete-Diffusion-Posterior-Sampling.
comment: Accepted to NeurIPS 2025
♻ ☆ Scalable Autoregressive Image Generation with Mamba
We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM
comment: 9 pages, 8 figures
♻ ☆ Lattice Boltzmann Model for Learning Real-World Pixel Dynamicity NeurIPS 2025
This work proposes the Lattice Boltzmann Model (LBM) to learn real-world pixel dynamicity for visual tracking. LBM decomposes visual representations into dynamic pixel lattices and solves pixel motion states through collision-streaming processes. Specifically, the high-dimensional distribution of the target pixels is acquired through a multilayer predict-update network to estimate the pixel positions and visibility. The predict stage formulates lattice collisions among the spatial neighborhood of target pixels and develops lattice streaming within the temporal visual context. The update stage rectifies the pixel distributions with online visual representations. Compared with existing methods, LBM demonstrates practical applicability in an online and real-time manner, which can efficiently adapt to real-world visual tracking tasks. Comprehensive evaluations of real-world point tracking benchmarks such as TAP-Vid and RoboTAP validate LBM's efficiency. A general evaluation of large-scale open-world object tracking benchmarks such as TAO, BFT, and OVT-B further demonstrates LBM's real-world practicality.
comment: NeurIPS 2025. Project page: https://george-zhuang.github.io/lbm/
♻ ☆ Efficiency vs. Efficacy: Assessing the Compression Ratio-Dice Score Relationship through a Simple Benchmarking Framework for Cerebrovascular 3D Segmentation
The increasing size and complexity of medical imaging datasets, particularly in 3D formats, present significant barriers to collaborative research and transferability. This study investigates whether the ZFP compression technique can mitigate these challenges without compromising the performance of automated cerebrovascular segmentation, a critical first step in intracranial aneurysm detection. We apply ZFP in both its error tolerance and fixed-rate modes to a large scale, and one of the most recent, datasets in the literature, 3D medical dataset containing ground-truth vascular segmentations. The segmentation quality on the compressed volumes is rigorously compared to the uncompressed baseline (Dice approximately equals 0.8774). Our findings reveal that ZFP can achieve substantial data reduction--up to a 22.89:1 ratio in error tolerance mode--while maintaining a high degree of fidelity, with the mean Dice coefficient remaining high at 0.87656. These results demonstrate that ZFP is a viable and powerful tool for enabling more efficient and accessible research on large-scale medical datasets, fostering broader collaboration across the community.
♻ ☆ Targeted Attack Improves Protection against Unauthorized Diffusion Customization ICLR 2025
Diffusion models build a new milestone for image generation yet raising public concerns, for they can be fine-tuned on unauthorized images for customization. Protection based on adversarial attacks rises to encounter this unauthorized diffusion customization, by adding protective watermarks to images and poisoning diffusion models. However, current protection, leveraging untargeted attacks, does not appear to be effective enough. In this paper, we propose a simple yet effective improvement for the protection against unauthorized diffusion customization by introducing targeted attacks. We show that by carefully selecting the target, targeted attacks significantly outperform untargeted attacks in poisoning diffusion models and degrading the customization image quality. Extensive experiments validate the superiority of our method on two mainstream customization methods of diffusion models, compared to existing protections. To explain the surprising success of targeted attacks, we delve into the mechanism of attack-based protections and propose a hypothesis based on our observation, which enhances the comprehension of attack-based protections. To the best of our knowledge, we are the first to both reveal the vulnerability of diffusion models to targeted attacks and leverage targeted attacks to enhance protection against unauthorized diffusion customization. Our code is available on GitHub: https://github.com/psyker-team/mist-v2.
comment: ICLR 2025 (Spotlight)
Machine Learning 4
♻ ☆ Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models
Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Thanks to their bidirectional attention mechanism, DLLMs are more capable of capturing the connection of context, and thus show unique advantages in challenges like the famous "reversal curse" or learning under data-constrained scenarios. In addition, taking advantage of their inherent modeling foundations, DLLMs have the great potential of efficient inference with parallel decoding algorithms, which enable multi-token prediction per step. However, the high generation quality often requires the number of decoding steps equal to the sequence length, which performs a one-token-per-step decoding, and existing parallel decoding algorithms, which yield suboptimal decoding paths, bring inference speedup at the cost of non-negligible performance degradation. To overcome this challenge, we introduce Free Draft-and-Verification (FreeDave), a novel fast decoding algorithm tailored for DLLMs that achieves lossless parallel decoding without any model modification or extra modules. Specifically, we propose an algorithm of parallel-decoded candidate generation and verification, which is theoretically guaranteed to use the fewest model forward calls to reproduce the same sequence generated by static decoding when enough computation and memory budget is provided. By extensive evaluations on math reasoning and code generation benchmarks across different DLLMs, FreeDave is proven to boost the inference throughput up to $3.78\times$ without performance degradation.
♻ ☆ Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Defending against novel jailbreaks represents a critical challenge in AI safety. Adversarial training -- designed to make models robust against worst-case perturbations -- has been the dominant paradigm for adversarial robustness. However, due to optimization challenges and difficulties in defining realistic threat models, adversarial training methods often fail on newly developed jailbreaks in practice. This paper proposes a new paradigm for improving robustness against unseen jailbreaks, centered on the Adversarial D\'ej\`a Vu hypothesis: novel jailbreaks are not fundamentally new, but largely recombinations of adversarial skills from previous attacks. We study this hypothesis through a large-scale analysis of 32 attack papers published over two years. Using an automated pipeline, we extract and compress adversarial skills into a sparse dictionary of primitives, with LLMs generating human-readable descriptions. Our analysis reveals that unseen attacks can be effectively explained as sparse compositions of earlier skills, with explanatory power increasing monotonically as skill coverage grows. Guided by this insight, we introduce Adversarial Skill Compositional Training (ASCoT), which trains on diverse compositions of skill primitives rather than isolated attack instances. ASCoT substantially improves robustness to unseen attacks, including multi-turn jailbreaks, while maintaining low over-refusal rates. We also demonstrate that expanding adversarial skill coverage, not just data scale, is key to defending against novel attacks. \textcolor{red}{\textbf{Warning: This paper contains content that may be harmful or offensive in nature.
♻ ☆ CausalARC: Abstract Reasoning with Causal World Models
On-the-fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within- and between-model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.
comment: Peer-reviewed workshop paper
♻ ☆ LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory NeurIPS 2025
Strategic decision-making involves interactive reasoning where agents adapt their choices in response to others, yet existing evaluations of large language models (LLMs) often emphasize Nash Equilibrium (NE) approximation, overlooking the mechanisms driving their strategic choices. To bridge this gap, we introduce an evaluation framework grounded in behavioral game theory, disentangling reasoning capability from contextual effects. Testing 22 state-of-the-art LLMs, we find that GPT-o3-mini, GPT-o1, and DeepSeek-R1 dominate most games yet also demonstrate that the model scale alone does not determine performance. In terms of prompting enhancement, Chain-of-Thought (CoT) prompting is not universally effective, as it increases strategic reasoning only for models at certain levels while providing limited gains elsewhere. Additionally, we investigate the impact of encoded demographic features on the models, observing that certain assignments impact the decision-making pattern. For instance, GPT-4o shows stronger strategic reasoning with female traits than males, while Gemma assigns higher reasoning levels to heterosexual identities compared to other sexual orientations, indicating inherent biases. These findings underscore the need for ethical standards and contextual alignment to balance improved reasoning with fairness.
comment: Accepted by NeurIPS 2025
Human-Computer Interaction 16
☆ Teaching LLMs to See and Guide: Context-Aware Real-Time Assistance in Augmented Reality
The growing adoption of augmented and virtual reality (AR and VR) technologies in industrial training and on-the-job assistance has created new opportunities for intelligent, context-aware support systems. As workers perform complex tasks guided by AR and VR, these devices capture rich streams of multimodal data, including gaze, hand actions, and task progression, that can reveal user intent and task state in real time. Leveraging this information effectively remains a major challenge. In this work, we present a context-aware large language model (LLM) assistant that integrates diverse data modalities, such as hand actions, task steps, and dialogue history, into a unified framework for real-time question answering. To systematically study how context influences performance, we introduce an incremental prompting framework, where each model version receives progressively richer contextual inputs. Using the HoloAssist dataset, which records AR-guided task executions, we evaluate how each modality contributes to the assistant's effectiveness. Our experiments show that incorporating multimodal context significantly improves the accuracy and relevance of responses. These findings highlight the potential of LLM-driven multimodal integration to enable adaptive, intuitive assistance for AR and VR-based industrial training and assistance.
comment: This work is intended for submission to the IEEE Transactions on Systems, Man, and Cybernetics: Systems for possible publication
☆ A Voice-Enabled Virtual Patient System for Interactive Training in Standardized Clinical Assessment
Training mental health clinicians to conduct standardized clinical assessments is challenging due to a lack of scalable, realistic practice opportunities, which can impact data quality in clinical trials. To address this gap, we introduce a voice-enabled virtual patient simulation system powered by a large language model (LLM). This study describes the system's development and validates its ability to generate virtual patients who accurately adhere to pre-defined clinical profiles, maintain coherent narratives, and produce realistic dialogue. We implemented a system using a LLM to simulate patients with specified symptom profiles, demographics, and communication styles. The system was evaluated by 5 experienced clinical raters who conducted 20 simulated structured MADRS interviews across 4 virtual patient personas. The virtual patients demonstrated strong adherence to their clinical profiles, with a mean item difference between rater-assigned MADRS scores and configured scores of 0.52 (SD=0.75). Inter-rater reliability across items was 0.90 (95% CI=0.68-0.99). Expert raters consistently rated the qualitative realism and cohesiveness of the virtual patients favorably, giving average ratings between "Agree" and "Strongly Agree." Our findings suggest that LLM-powered virtual patient simulations are a viable and scalable tool for training clinicians, capable of producing high-fidelity, clinically relevant practice scenarios.
☆ Measuring Machine Companionship: Scale Development and Validation for AI Companions
The mainstreaming of companionable machines--customizable artificial agents designed to participate in ongoing, idiosyncratic, socioemotional relationships--is met with relative theoretical and empirical disarray, according to recent systematic reviews. In particular, the conceptualization and measurement of machine companionship (MC) is inconsistent or sometimes altogether missing. This study starts to bridge that gap by developing and initially validating a novel measurement to capture MC experiences--the unfolding, autotelic, positively experienced, coordinated connection between human and machine--with AI companions (AICs). After systematic generation and expert review of an item pool (including items pertaining to dyadism, coordination, autotelicity, temporality, and positive valence), N = 467 people interacting with AICs responded to the item pool and to construct validation measures. Through exploratory factor analysis, two factors were induced: Eudaimonic Exchange and Connective Coordination. Construct validation analyses (confirmed in a second sample; N = 249) indicate the factors function largely as expected. Post-hoc analyses of deviations suggest two different templates for MC with AICs: One socioinstrumental and one autotelic.
☆ On Improvisation and Open-Endedness: Insights for Experiential AI AAAI 2026
Improvisation-the art of spontaneous creation that unfolds moment-to-moment without a scripted outcome-requires practitioners to continuously sense, adapt, and create anew. It is a fundamental mode of human creativity spanning music, dance, and everyday life. The open-ended nature of improvisation produces a stream of novel, unrepeatable moments-an aspect highly valued in artistic creativity. In parallel, open-endedness (OE)-a system's capacity for unbounded novelty and endless "interestingness"-is exemplified in natural or cultural evolution and has been considered "the last grand challenge" in artificial life (ALife). The rise of generative AI now raises the question in computational creativity (CC) research: What makes a "good" improvisation for AI? Can AI learn to improvise in a genuinely open-ended way? In this work-in-progress paper, we report insights from in-depth interviews with 6 experts in improvisation across dance, music, and contact improvisation. We draw systemic connections between human improvisational arts and the design of future experiential AI agents that could improvise alone or alongside humans-or even with other AI agents-embodying qualities of improvisation drawn from practice: active listening (umwelt and awareness), being in the time (mindfulness and ephemerality), embracing the unknown (source of randomness and serendipity), non-judgmental flow (acceptance and dynamical stability, balancing structure and surprise (unpredictable criticality at edge of chaos), imaginative metaphor (synaesthesia and planning), empathy, trust, boundary, and care (mutual theory of mind), and playfulness and intrinsic motivation (maintaining interestingness).
comment: Submitted to AAAI 2026 Creative AI for Live Interactive Performances Workshop (CLIP) as a work-in-progress paper
☆ Human-AI Programming Role Optimization: Developing a Personality-Driven Self-Determination Framework CCS 2012
As artificial intelligence transforms software development, a critical question emerges: how can developers and AI systems collaborate most effectively? This dissertation optimizes human-AI programming roles through self-determination theory and personality psychology, introducing the Role Optimization Motivation Alignment (ROMA) framework. Through Design Science Research spanning five cycles, this work establishes empirically-validated connections between personality traits, programming role preferences, and collaborative outcomes, engaging 200 experimental participants and 46 interview respondents. Key findings demonstrate that personality-driven role optimization significantly enhances self-determination and team dynamics, yielding 23% average motivation increases among professionals and up to 65% among undergraduates. Five distinct personality archetypes emerge: The Explorer (high Openness/low Agreeableness), The Orchestrator (high Extraversion/Agreeableness), The Craftsperson (high Neuroticism/low Extraversion), The Architect (high Conscientiousness), and The Adapter (balanced profile). Each exhibits distinct preferences for programming roles (Co-Pilot, Co-Navigator, Agent), with assignment modes proving crucial for satisfaction. The dissertation contributes: (1) an empirically-validated framework linking personality traits to role preferences and self-determination outcomes; (2) a taxonomy of AI collaboration modalities mapped to personality profiles while preserving human agency; and (3) an ISO/IEC 29110 extension enabling Very Small Entities to implement personality-driven role optimization within established standards. Keywords: artificial intelligence, human-computer interaction, behavioral software engineering, self-determination theory, personality psychology, phenomenology, intrinsic motivation, pair programming, design science research, ISO/IEC 29110
comment: PhD Dissertation, Prague University of Economics and Business, 2025. 323 pages. ACM CCS 2012: Human-computer interaction, Collaborative interaction, Human-AI collaborative systems, Pair programming, AI-assisted software engineering
☆ Reducing students' misconceptions about video game development. A mixed-method study
This study examines students' na\"ive mindset (misconceptions) about video game development, idealized and inaccurate beliefs that shape an unrealistic understanding of the field. The research evaluated the effectiveness of a fifteen-hour-long lecture series delivered by industry professionals, designed to challenge this mindset and expose students to the complexities and realities of game production. A mixed-methods approach was employed, combining qualitative analysis with a prototype quantitative tool developed to measure levels of misconception. Participants included students (n = 91) from diverse academic backgrounds interested in game creation and professionals (n = 94) working in the video game industry. Findings show that the intervention significantly reduced students' na\"ive beliefs while enhancing their motivation to pursue careers in the industry. Exposure to professional perspectives fostered a more realistic and informed mindset, taking into account the understanding of the technical, collaborative, and business aspects of game development. The results suggest that incorporating similar expert-led interventions early in game development education can improve learning outcomes, support informed career choices, and mitigate future professional disappointment.
☆ VisionCAD: An Integration-Free Radiology Copilot Framework
Widespread clinical deployment of computer-aided diagnosis (CAD) systems is hindered by the challenge of integrating with existing hospital IT infrastructure. Here, we introduce VisionCAD, a vision-based radiological assistance framework that circumvents this barrier by capturing medical images directly from displays using a camera system. The framework operates through an automated pipeline that detects, restores, and analyzes on-screen medical images, transforming camera-captured visual data into diagnostic-quality images suitable for automated analysis and report generation. We validated VisionCAD across diverse medical imaging datasets, demonstrating that our modular architecture can flexibly utilize state-of-the-art diagnostic models for specific tasks. The system achieves diagnostic performance comparable to conventional CAD systems operating on original digital images, with an F1-score degradation typically less than 2\% across classification tasks, while natural language generation metrics for automated reports remain within 1\% of those derived from original images. By requiring only a camera device and standard computing resources, VisionCAD offers an accessible approach for AI-assisted diagnosis, enabling the deployment of diagnostic capabilities in diverse clinical settings without modifications to existing infrastructure.
♻ ☆ Nirvana AI Governance: How AI Policymaking Is Committing Three Old Fallacies
This research applies Harold Demsetz's concept of the nirvana approach to the realm of AI governance and debunks three common fallacies in various AI policy proposals--"the grass is always greener on the other side," "free lunch," and "the people could be different." Through this, I expose fundamental flaws in the current AI regulatory proposal. First, some commentators intuitively believe that people are more reliable than machines and that government works better in risk control than companies' self-regulation, but they do not fully compare the differences between the status quo and the proposed replacements. Second, when proposing some regulatory tools, some policymakers and researchers do not realize and even gloss over the fact that harms and costs are also inherent in their proposals. Third, some policy proposals are initiated based on a false comparison between the AI-driven world, where AI does lead to some risks, and an entirely idealized world, where no risk exists at all. However, the appropriate approach is to compare the world where AI causes risks to the real world where risks are everywhere, but people can live well with these risks. The prevalence of these fallacies in AI governance underscores a broader issue: the tendency to idealize potential solutions without fully considering their real-world implications. This idealization can lead to regulatory proposals that are not only impractical but potentially harmful to innovation and societal progress.
comment: 9 pages
♻ ☆ Balancing Caregiving and Self-Care: Exploring Mental Health Needs of Alzheimer's and Dementia Caregivers SC
Alzheimer's Disease and Related Dementias (AD/ADRD) are progressive neurodegenerative conditions that impair memory, thought processes, and functioning. Family caregivers of individuals with AD/ADRD face significant mental health challenges due to long-term caregiving responsibilities. Yet, current support systems often overlook the evolving nature of their mental wellbeing needs. Our study examines caregivers' mental wellbeing concerns, focusing on the practices they adopt to manage the burden of caregiving and the technologies they use for support. Through semi-structured interviews with 25 family caregivers of individuals with AD/ADRD, we identified the key causes and effects of mental health challenges, and developed a temporal mapping of how caregivers' mental wellbeing evolves across three distinct stages of the caregiving journey. Additionally, our participants shared insights into improvements for existing mental health technologies, emphasizing the need for accessible, scalable, and personalized solutions that adapt to caregivers' changing needs over time. These findings offer a foundation for designing dynamic, stage-sensitive interventions that holistically support caregivers' mental wellbeing, benefiting both caregivers and care recipients.
comment: This work is published at the Proceedings of the ACM on Human-Computer Interaction (CSCW), 2025
♻ ☆ SplashNet: Split-and-Share Encoders for Accurate and Efficient Typing with Surface Electromyography
Surface electromyography (sEMG) at the wrists could enable natural, keyboard-free text entry, yet the state-of-the-art emg2qwerty baseline still misrecognizes $51.8\%$ of characters in the zero-shot setting on unseen users and $7.0\%$ after user-specific fine-tuning. We trace many of these errors to mismatched cross-user signal statistics, fragile reliance on high-order feature dependencies, and the absence of architectural inductive biases aligned with the bilateral nature of typing. To address these issues, we introduce three simple modifications: (i) Rolling Time Normalization, which adaptively aligns input distributions across users; (ii) Aggressive Channel Masking, which encourages reliance on low-order feature combinations more likely to generalize across users; and (iii) a Split-and-Share encoder that processes each hand independently with weight-shared streams to reflect the bilateral symmetry of the neuromuscular system. Combined with a five-fold reduction in spectral resolution ($33\!\rightarrow\!6$ frequency bands), these components yield a compact Split-and-Share model, SplashNet-mini, which uses only $\tfrac14$ the parameters and $0.6\times$ the FLOPs of the baseline while reducing character-error rate (CER) to $36.4\%$ zero-shot and $5.9\%$ after fine-tuning. An upscaled variant, SplashNet ($\tfrac12$ the parameters, $1.15\times$ the FLOPs of the baseline), further lowers error to $35.7\%$ and $5.5\%$, representing relative improvements of $31\%$ and $21\%$ in the zero-shot and fine-tuned settings, respectively. SplashNet therefore establishes a new state of the art without requiring additional data.
♻ ☆ Unpacking Personal(?!) Health Informatics for Proactive Collective Care in India
Personal Health Informatics (PHI), which leverages digital tools and information systems to support health assessment and self-care, promises more proactive, user-centered care, yet adoption and meaningful utilization barriers in India remain underexplored. Through a sequential mixed-methods study in urban India (Initial survey (n=87) and semi-structured interviews (n=22), follow-up survey = 116, and co-design workshops (n=6)), we surface practices, perceptions, and behaviors to identify ways PHI can be better utilized for proactive care in the Indian context. We find that PHI is valued for monitoring and enabling collective care; however, its adoption is constrained by low health and technology literacy, usability and integration issues, fragmented and costly technology ecosystems, and mistrust of digital health platforms. From triangulated evidence, we derive concrete design requirements, including user-controlled sharing, accessible analytics, and verifiable health information, and present a culturally grounded design vision for an integrated platform for collective care through design and evaluation of a figma prototype. The prototype evaluation provides further directions for design and development to better orient PHI for proactive care through the PHI-Proact operational map, which involves agency, elicitation, and engagement. Finally, using PHI-Proact, we conclude with concrete recommendations for designing and responsibly deploying PHI systems for proactive collective care in emerging contexts, which differ socially, culturally, infrastructurally, and technologically from WEIRD contexts.
comment: 32 pages, 3 figures, 6 tables; A qualitative HCI study with prototype evaluation
♻ ☆ 3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark EMNLP 25
Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM's context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.
comment: EMNLP 25 (main)
♻ ☆ Towards Immersive Mixed Reality Street Play: Understanding Co-located Bodily Play with See-through Head-mounted Displays in Public Spaces SC
As see-through Mixed Reality Head-Mounted Displays (MRHMDs) proliferate, their usage is gradually shifting from controlled, private settings to spontaneous, public contexts. While location-based augmented reality mobile games such as Pokemon GO have been successful, the embodied interaction afforded by MRHMDs moves play beyond phone-based screen-tapping toward co-located, bodily, movement-based play. In anticipation of widespread MRHMD adoption, major technology companies have teased concept videos envisioning urban streets as vast mixed reality playgrounds-imagine Harry Potter-style wizard duels in city streets-which we term Immersive Mixed Reality Street Play (IMRSP). However, few real-world studies examine such scenarios. Through empirical, in-the-wild studies of our research-through-design game probe, Multiplayer Omnipresent Fighting Arena (MOFA), deployed across diverse public venues, we offer initial insights into the social implications, challenges, opportunities, and design recommendations of IMRSP. The MOFA framework, which includes three gameplay modes-"The Training," "The Duel," and "The Dragon"-is open-sourced at https://github.com/realitydeslab/mofa.
comment: Accepted in CSCW 2025
♻ ☆ The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems
Artificial intelligence systems based on large language models (LLMs) can now generate coherent text, music, and images, yet they operate without a persistent state: each inference reconstructs context from scratch. This paper introduces the Narrative Continuity Test (NCT) -- a conceptual framework for evaluating identity persistence and diachronic coherence in AI systems. Unlike capability benchmarks that assess task performance, the NCT examines whether an LLM remains the same interlocutor across time and interaction gaps. The framework defines five necessary axes -- Situated Memory, Goal Persistence, Autonomous Self-Correction, Stylistic & Semantic Stability, and Persona/Role Continuity -- and explains why current architectures systematically fail to support them. Case analyses (Character.\,AI, Grok, Replit, Air Canada) show predictable continuity failures under stateless inference. The NCT reframes AI evaluation from performance to persistence, outlining conceptual requirements for future benchmarks and architectural designs that could sustain long-term identity and goal coherence in generative models.
comment: 33 pages, 127 references v2: Minor editorial revision: redundant phrasing reduced, punctuation and formatting improved; no conceptual or data changes
♻ ☆ Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual Relationships
While interpretability methods identify a model's learned concepts, they overlook the relationships between concepts that make up its abstractions and inform its ability to generalize to new data. To assess whether models' have learned human-aligned abstractions, we introduce abstraction alignment, a methodology to compare model behavior against formal human knowledge. Abstraction alignment externalizes domain-specific human knowledge as an abstraction graph, a set of pertinent concepts spanning levels of abstraction. Using the abstraction graph as a ground truth, abstraction alignment measures the alignment of a model's behavior by determining how much of its uncertainty is accounted for by the human abstractions. By aggregating abstraction alignment across entire datasets, users can test alignment hypotheses, such as which human concepts the model has learned and where misalignments recur. In evaluations with experts, abstraction alignment differentiates seemingly similar errors, improves the verbosity of existing model-quality metrics, and uncovers improvements to current human abstractions.
comment: 20 pages, 7 figures, published in CHI 2025
♻ ☆ CGM-Led Multimodal Tracking with Chatbot Support: An Autoethnography in Sub-Health
Metabolic disorders present a pressing global health challenge, with China carrying the world's largest burden. While continuous glucose monitoring (CGM) has transformed diabetes care, its potential for supporting sub-health populations -- such as individuals who are overweight, prediabetic, or anxious -- remains underexplored. At the same time, large language models (LLMs) are increasingly used in health coaching, yet CGM is rarely incorporated as a first-class signal. To address this gap, we conducted a six-week autoethnography, combining CGM with multimodal indicators captured via common digital devices and a chatbot that offered personalized reflections and explanations of glucose fluctuations. Our findings show how CGM-led, data-first multimodal tracking, coupled with conversational support, shaped everyday practices of diet, activity, stress, and wellbeing. This work contributes to HCI by extending CGM research beyond clinical diabetes and demonstrating how LLM-driven agents can support preventive health and reflection in at-risk populations.
comment: International Conference on Human-Engaged Computing (ICHEC 2025), Singapore
Software Engineering 16
☆ An Empirical Investigation of the Experiences of Dyslexic Software Engineers
Dyslexia is a common learning disorder that primarily impairs an individual's reading and writing abilities. In adults, dyslexia can affect both professional and personal lives, often leading to mental challenges and difficulties acquiring and keeping work. In Software Engineering (SE), reading and writing difficulties appear to pose substantial challenges for core tasks such as programming. However, initial studies indicate that these challenges may not significantly affect their performance compared to non-dyslexic colleagues. Conversely, strengths associated with dyslexia could be particularly valuable in areas like programming and design. However, there is currently no work that explores the experiences of dyslexic software engineers, and puts their strengths into relation with their difficulties. To address this, we present a qualitative study of the experiences of dyslexic individuals in SE. We followed the basic stage of the Socio-Technical Grounded Theory method and base our findings on data collected through 10 interviews with dyslexic software engineers, 3 blog posts and 153 posts on the social media platform Reddit. We find that dyslexic software engineers especially struggle at the programming learning stage, but can succeed and indeed excel at many SE tasks once they master this step. Common SE-specific support tools, such as code completion and linters are especially useful to these individuals and mitigate many of the experienced difficulties. Finally, dyslexic software engineers exhibit strengths in areas such as visual thinking and creativity. Our findings have implications to SE practice and motivate several areas of future research in SE, such as investigating what makes code less/more understandable to dyslexic individuals.
☆ Repairing Responsive Layout Failures Using Retrieval Augmented Generation
Responsive websites frequently experience distorted layouts at specific screen sizes, called Responsive Layout Failures (RLFs). Manually repairing these RLFs involves tedious trial-and-error adjustments of HTML elements and CSS properties. In this study, an automated repair approach, leveraging LLM combined with domain-specific knowledge is proposed. The approach is named ReDeFix, a Retrieval-Augmented Generation (RAG)-based solution that utilizes Stack Overflow (SO) discussions to guide LLM on CSS repairs. By augmenting relevant SO knowledge with RLF-specific contexts, ReDeFix creates a prompt that is sent to the LLM to generate CSS patches. Evaluation demonstrates that our approach achieves an 88\% accuracy in repairing RLFs. Furthermore, a study from software engineers reveals that generated repairs produce visually correct layouts while maintaining aesthetics.
comment: Accepted at the 41st IEEE International Conference on Software Maintenance and Evolution 2025 (ICSME'25)
☆ Lessons Learned from the Use of Generative AI in Engineering and Quality Assurance of a WEB System for Healthcare
The advances and availability of technologies involving Generative Artificial Intelligence (AI) are evolving clearly and explicitly, driving immediate changes in various work activities. Software Engineering (SE) is no exception and stands to benefit from these new technologies, enhancing productivity and quality in its software development processes. However, although the use of Generative AI in SE practices is still in its early stages, considering the lack of conclusive results from ongoing research and the limited technological maturity, we have chosen to incorporate these technologies in the development of a web-based software system to be used in clinical trials by a thoracic diseases research group at our university. For this reason, we decided to share this experience report documenting our development team's learning journey in using Generative AI during the software development process. Project management, requirements specification, design, development, and quality assurance activities form the scope of observation. Although we do not yet have definitive technological evidence to evolve our development process significantly, the results obtained and the suggestions shared here represent valuable insights for software organizations seeking to innovate their development practices to achieve software quality with generative AI.
comment: 11 pages, 2 figures, in Portuguese language
☆ AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems
With the rapid progress of large language models (LLMs), LLM-powered multi-agent systems (MAS) are drawing increasing interest across academia and industry. However, many current MAS frameworks struggle with reliability and scalability, especially on complex tasks. We present AgentGit, a framework that brings Git-like rollback and branching to MAS workflows. Built as an infrastructure layer on top of LangGraph, AgentGit supports state commit, revert, and branching, allowing agents to traverse, compare, and explore multiple trajectories efficiently. To evaluate AgentGit, we designed an experiment that optimizes target agents by selecting better prompts. We ran a multi-step A/B test against three baselines -- LangGraph, AutoGen, and Agno -- on a real-world task: retrieving and analyzing paper abstracts. Results show that AgentGit significantly reduces redundant computation, lowers runtime and token usage, and supports parallel exploration across multiple branches, enhancing both reliability and scalability in MAS development. This work offers a practical path to more robust MAS design and enables error recovery, safe exploration, iterative debugging, and A/B testing in collaborative AI systems.
☆ Can Large Language Models Detect Real-World Android Software Compliance Violations?
The rapid development of Large Language Models (LLMs) has transformed software engineering, showing promise in tasks like code generation, bug detection, and compliance checking. However, current models struggle to detect compliance violations in Android applications across diverse legal frameworks. We propose \emph{CompliBench}, a novel evaluation framework for assessing LLMs' ability to detect compliance violations under regulations like LGPD, PDPA, and PIPEDA. The framework defines two tasks: Task 1 evaluates \emph{retrieval and localization} at file, module, and line granularities, and Task 2 assesses \emph{multi-label judgment} for code snippets. These tasks mirror the audit process, where auditors locate problematic code and determine implicated provisions. Traditional metrics fail to capture important aspects like cross-granularity stability and jurisdictional consistency. Thus, we introduce stability-aware composites (SGS, RCS, CRGS, and OCS) for a more comprehensive assessment. Experiments with six models, including GPT-4O and Claude-3.5, show \emph{CompliBench} improves compliance detection, with Claude-3.5-sonnet-20241022 achieving the highest OCS score (0.3295), and Gemini-2.5-pro the lowest (0.0538). This work demonstrates \emph{CompliBench}'s potential for improving LLM performance in compliance tasks and provides a foundation for future tools aligned with data protection standards. Our project is available at https://github.com/Haoyi-Zhang/CompliBench.
☆ GDPR-Bench-Android: A Benchmark for Evaluating Automated GDPR Compliance Detection in Android
Automating the detection of EU General Data Protection Regulation (GDPR) violations in source code is a critical but underexplored challenge. We introduce \textbf{GDPR-Bench-Android}, the first comprehensive benchmark for evaluating diverse automated methods for GDPR compliance detection in Android applications. It contains \textbf{1951} manually annotated violation instances from \textbf{15} open-source repositories, covering 23 GDPR articles at file-, module-, and line-level granularities. To enable a multi-paradigm evaluation, we contribute \textbf{Formal-AST}, a novel, source-code-native formal method that serves as a deterministic baseline. We define two tasks: (1) \emph{multi-granularity violation localization}, evaluated via Accuracy@\textit{k}; and (2) \emph{snippet-level multi-label classification}, assessed by macro-F1 and other classification metrics. We benchmark 11 methods, including eight state-of-the-art LLMs, our Formal-AST analyzer, a retrieval-augmented (RAG) method, and an agentic (ReAct) method. Our findings reveal that no single paradigm excels across all tasks. For Task 1, the ReAct agent achieves the highest file-level Accuracy@1 (17.38%), while the Qwen2.5-72B LLM leads at the line level (61.60%), in stark contrast to the Formal-AST method's 1.86%. For the difficult multi-label Task 2, the Claude-Sonnet-4.5 LLM achieves the best Macro-F1 (5.75%), while the RAG method yields the highest Macro-Precision (7.10%). These results highlight the task-dependent strengths of different automated approaches and underscore the value of our benchmark in diagnosing their capabilities. All resources are available at: https://github.com/Haoyi-Zhang/GDPR-Bench-Android.
☆ Employee Performance when Implementing Agile Practices in an IT Workforce
Adoption of agile practices has increased in IT workforces. However, there is a lack of comprehensive studies in the African context on employee performance when implementing agile practices. This study addresses this gap by exploring employee performance in agile environments for IT workforces in South Africa. An interpretivist mono-method qualitative approach was used, with the use of interviews as a research strategy. Seventeen semi-structured interviews were conducted with agile practitioners from various roles. Our results indicated that agile practices influence employee performance significantly, with participants reporting on aspects which included planning, communication, employee development and well-being, collaboration, team culture and progress. Additionally, our results reported obstacles when using agile practices that included adoption, team engagement, leadership and instilling an agile mindset. Agile practices influence employee performance in IT workforces by fostering improved team dynamics, enhanced collaboration, improved efficiencies, risk management, planning, continuous improvement, learning, personal development and well-being. Conclusively, our findings suggest that if agile challenges are addressed and additional support is provided, employee performance can be significantly improved.
comment: 11 pages, 1 figure, 1 table, 7th World Symposium on Software Engineering (WSSE 2025)
☆ HIP-LLM: A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
Large Language Models (LLMs) are increasingly deployed across diverse domains, raising the need for rigorous reliability assessment methods. Existing benchmark-based evaluations primarily offer descriptive statistics of model accuracy over datasets, providing limited insight into the probabilistic behavior of LLMs under real operational conditions. This paper introduces HIP-LLM, a Hierarchical Imprecise Probability framework for modeling and inferring LLM reliability. Building upon the foundations of software reliability engineering, HIP-LLM defines LLM reliability as the probability of failure-free operation over a specified number of future tasks under a given Operational Profile (OP). HIP-LLM represents dependencies across (sub-)domains hierarchically, enabling multi-level inference from subdomain to system-level reliability. HIP-LLM embeds imprecise priors to capture epistemic uncertainty and incorporates OPs to reflect usage contexts. It derives posterior reliability envelopes that quantify uncertainty across priors and data. Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches. A publicly accessible repository of HIP-LLM is provided.
comment: under review
☆ Issue-Oriented Agent-Based Framework for Automated Review Comment Generation
Code review (CR) is a crucial practice for ensuring software quality. Various automated review comment generation techniques have been proposed to streamline the labor-intensive process. However, existing approaches heavily rely on a single model to identify various issues within the code, limiting the model's ability to handle the diverse, issue-specific nature of code changes and leading to non-informative comments, especially in complex scenarios such as bug fixes. To address these limitations, we propose RevAgent, a novel agent-based issue-oriented framework, decomposes the task into three stages: (1) Generation Stage, where five category-specific commentator agents analyze code changes from distinct issue perspectives and generate candidate comments; (2) Discrimination Stage, where a critic agent selects the most appropriate issue-comment pair; and (3) Training Stage, where all agents are fine-tuned on curated, category-specific data to enhance task specialization. Evaluation results show that RevAgent significantly outperforms state-of-the-art PLM- and LLM-based baselines, with improvements of 12.90\%, 10.87\%, 6.32\%, and 8.57\% on BLEU, ROUGE-L, METEOR, and SBERT, respectively. It also achieves relatively higher accuracy in issue-category identification, particularly for challenging scenarios. Human evaluations further validate the practicality of RevAgent in generating accurate, readable, and context-aware review comments. Moreover, RevAgent delivers a favorable trade-off between performance and efficiency.
☆ A Big Step Forward? A User-Centric Examination of iOS App Privacy Report and Enhancements
The prevalent engagement with mobile apps underscores the importance of understanding their data practices. Transparency plays a crucial role in this context, ensuring users to be informed and give consent before any data access occurs. Apple introduced a new feature since iOS 15.2, App Privacy Report, to inform users about detailed insights into apps' data access and sharing. This feature continues Apple's trend of privacy-focused innovations (following Privacy Nutrition Labels), and has been marketed as a big step forward in user privacy. However, its real-world impacts on user privacy and control remain unexamined. We thus proposed an end-to-end study involving systematic assessment of the App Privacy Report's real-world benefits and limitations, LLM-enabled and multi-technique synthesized enhancements, and comprehensive evaluation from both system and user perspectives. Through a structured focus group study with twelve everyday iOS users, we explored their experiences, understanding, and perceptions of the feature, suggesting its limited practical impact resulting from missing important details. We identified two primary user concerns: the clarity of data access purpose and domain description. In response, we proposed enhancements including a purpose inference framework and domain clarification pipeline. We demonstrated the effectiveness and benefits of such enhancements for mobile app users. This work provides practical insights that could help enhance user privacy transparency and discusses areas for future research.
comment: Accepted to S&P 2025
☆ SmartDoc: A Context-Aware Agentic Method Comment Generation Plugin
Context: The software maintenance phase involves many activities such as code refactoring, bug fixing, code review or testing. Program comprehension is key to all these activities, as it demands developers to grasp the knowledge (e.g., implementation details) required to modify the codebase. Methods as main building blocks in a program can offer developers this knowledge source for code comprehension. However, reading entire method statements can be challenging, which necessitates precise and up-to-date comments. Objective: We propose a solution as an IntelliJ IDEA plugin, named SmartDoc, that assists developers in generating context-aware method comments. Method: This plugin acts as an Artificial Intelligence (AI) agent that has its own memory and is augmented by target methods' context. When a request is initiated by the end-user, the method content and all its nested method calls are used in the comment generation. At the beginning, these nested methods are visited and a call graph is generated. This graph is then traversed using depth-first search (DFS), enabling the provision of full-context to enrich Large Language Model (LLM) prompts. Result: The product is a software, as a plugin, developed for Java codebase and installable on IntelliJ IDEA. This plugin can serve concurrently for methods whose comments are being updated , and it shares memory across all flows to avoid redundant calls. o measure the accuracy of this solution, a dedicated test case is run to record SmartDoc generated comments and their corresponding ground truth. For each collected result-set, three metrics are computed, BERTScore, BLEU and ROUGE-1. These metrics will determine how accurate the generated comments are in comparison to the ground truth. Result: The obtained accuracy, in terms of the precision, recall and F1, is promising, and lies in the range of 0.80 to 0.90 for BERTScore.
comment: 6 pages, Already submitted to The 3rd International Workshop on Integrated Development Environments (the IDE Workshop)
☆ Human-AI Programming Role Optimization: Developing a Personality-Driven Self-Determination Framework CCS 2012
As artificial intelligence transforms software development, a critical question emerges: how can developers and AI systems collaborate most effectively? This dissertation optimizes human-AI programming roles through self-determination theory and personality psychology, introducing the Role Optimization Motivation Alignment (ROMA) framework. Through Design Science Research spanning five cycles, this work establishes empirically-validated connections between personality traits, programming role preferences, and collaborative outcomes, engaging 200 experimental participants and 46 interview respondents. Key findings demonstrate that personality-driven role optimization significantly enhances self-determination and team dynamics, yielding 23% average motivation increases among professionals and up to 65% among undergraduates. Five distinct personality archetypes emerge: The Explorer (high Openness/low Agreeableness), The Orchestrator (high Extraversion/Agreeableness), The Craftsperson (high Neuroticism/low Extraversion), The Architect (high Conscientiousness), and The Adapter (balanced profile). Each exhibits distinct preferences for programming roles (Co-Pilot, Co-Navigator, Agent), with assignment modes proving crucial for satisfaction. The dissertation contributes: (1) an empirically-validated framework linking personality traits to role preferences and self-determination outcomes; (2) a taxonomy of AI collaboration modalities mapped to personality profiles while preserving human agency; and (3) an ISO/IEC 29110 extension enabling Very Small Entities to implement personality-driven role optimization within established standards. Keywords: artificial intelligence, human-computer interaction, behavioral software engineering, self-determination theory, personality psychology, phenomenology, intrinsic motivation, pair programming, design science research, ISO/IEC 29110
comment: PhD Dissertation, Prague University of Economics and Business, 2025. 323 pages. ACM CCS 2012: Human-computer interaction, Collaborative interaction, Human-AI collaborative systems, Pair programming, AI-assisted software engineering
☆ Reducing students' misconceptions about video game development. A mixed-method study
This study examines students' na\"ive mindset (misconceptions) about video game development, idealized and inaccurate beliefs that shape an unrealistic understanding of the field. The research evaluated the effectiveness of a fifteen-hour-long lecture series delivered by industry professionals, designed to challenge this mindset and expose students to the complexities and realities of game production. A mixed-methods approach was employed, combining qualitative analysis with a prototype quantitative tool developed to measure levels of misconception. Participants included students (n = 91) from diverse academic backgrounds interested in game creation and professionals (n = 94) working in the video game industry. Findings show that the intervention significantly reduced students' na\"ive beliefs while enhancing their motivation to pursue careers in the industry. Exposure to professional perspectives fostered a more realistic and informed mindset, taking into account the understanding of the technical, collaborative, and business aspects of game development. The results suggest that incorporating similar expert-led interventions early in game development education can improve learning outcomes, support informed career choices, and mitigate future professional disappointment.
☆ Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs
In Socratic debugging, instructors guide students towards identifying and fixing a bug on their own, instead of providing the bug fix directly. Most novice programmer bugs are caused by programming misconceptions, namely false beliefs about a programming concept. In this context, Socratic debugging can be formulated as a guided Reasoning Trajectory (RT) leading to a statement about the program behavior that contradicts the bug-causing misconception. Upon reaching this statement, the ensuing cognitive dissonance leads the student to first identify and then update their false belief. In this paper, we introduce the task of reasoning trajectory generation, together with a dataset of debugging problems manually annotated with RTs. We then describe LLM-based solutions for generating RTs and Socratic conversations that are anchored on them. A large-scale LLM-as-judge evaluation shows that frontier models can generate up to 91% correct reasoning trajectories and 98.7% valid conversation turns.
comment: 25 pages, 2 tables, 13 figures
☆ Sherlock: Reliable and Efficient Agentic Workflow Execution
With the increasing adoption of large language models (LLM), agentic workflows, which compose multiple LLM calls with tools, retrieval, and reasoning steps, are increasingly replacing traditional applications. However, such workflows are inherently error-prone: incorrect or partially correct output at one step can propagate or even amplify through subsequent stages, compounding the impact on the final output. Recent work proposes integrating verifiers that validate LLM output or actions, such as self-reflection, debate, or LLM-as-a-judge mechanisms. Yet, verifying every step introduces significant latency and cost overheads. In this work, we seek to answer three key questions: which nodes in a workflow are most error-prone and thus deserve costly verification, how to select the most appropriate verifier for each node, and how to use verification with minimal impact to latency? Our solution, Sherlock, addresses these using counterfactual analysis on agentic workflows to identify error-prone nodes and selectively attaching cost-optimal verifiers only where necessary. At runtime, Sherlock speculatively executes downstream tasks to reduce latency overhead, while verification runs in the background. If verification fails, execution is rolled back to the last verified output. Compared to the non-verifying baseline, Sherlock delivers an 18.3% accuracy gain on average across benchmarks. Sherlock reduces workflow execution time by up to 48.7% over non-speculative execution and lowers verification cost by 26.0% compared to the Monte Carlo search-based method, demonstrating that principled, fault-aware verification effectively balances efficiency and reliability in agentic workflows.
♻ ☆ Execution-Aware Program Reduction for WebAssembly via Record and Replay
WebAssembly (Wasm) programs may trigger bugs in their engine implementations. To aid debugging, program reduction techniques try to produce a smaller variant of the input program that still triggers the bug. However, existing execution-unaware program reduction techniques struggle with large and complex Wasm programs, because they rely on static information and apply syntactic transformations, while ignoring the valuable information offered by the input program's execution behavior. We present RR-Reduce and Hybrid-Reduce, novel execution-aware program reduction techniques that leverage execution behaviors via record and replay. RR-Reduce identifies a bug-triggering function as the target function, isolates that function from the rest of the program, and generates a reduced program that replays only the interactions between the target function and the rest of the program. Hybrid-Reduce combines a complementary execution-unaware reduction technique with RR-Reduce to further reduce program size. We evaluate RR-Reduce and Hybrid-Reduce on 28 Wasm programs that trigger a diverse set of bugs in three engines. On average, RR-Reduce reduces the programs to 1.20 percent of their original size in 14.5 minutes, which outperforms the state of the art by 33.15 times in terms of reduction time. Hybrid-Reduce reduces the programs to 0.13 percent of their original size in 3.5 hours, which outperforms the state of the art by 3.42 times in terms of reduced program size and 2.26 times in terms of reduction time. We envision RR-Reduce as the go-to tool for rapid, on-demand debugging in minutes, and Hybrid-Reduce for scenarios where developers require the smallest possible programs.
comment: Accepted at ASE 2025
Systems and Control 21
☆ Quantum Computing for EVs to Enhance Grid Resilience and Disaster Relief: Challenges and Opportunities
The power grid is the foundation of modern society, however extreme weather events have increasingly caused widespread outages. Enhancing grid resilience is therefore critical to maintaining secure and reliable operations. In disaster relief and restoration, vehicle-to-grid (V2G) technology allows electric vehicles (EVs) to serve as mobile energy resources by discharging to support critical loads or regulating grid frequency as needed. Effective V2G operation requires coordinated charging and discharging of many EVs through optimization. Similarly, in grid restoration, EVs must be strategically routed to affected areas, forming the mobile charging station placement (CSP) problem, which presents another complex optimization challenge. This work reviews state-of-the-art optimization methods for V2G and mobile CSP applications, outlines their limitations, and explores how quantum computing (QC) could overcome current computational bottlenecks. A QC-focused perspective is presented on enhancing grid resilience and accelerating restoration as extreme weather events grow more frequent and severe.
comment: 11 pages, 0 figures, 2 tables, Submitted to IEEE Transactions on Smart Grid
☆ Hybrid Quantum-Classical Optimization of the Resource Scheduling Problem
Resource scheduling is critical in many industries, especially in power systems. The Unit Commitment problem determines the on/off status and output levels of generators under many constraints. Traditional exact methods, such as mathematical programming methods or dynamic programming, remain the backbone of UC solution techniques, but they often rely on linear approximations or exhaustive search, leading to high computational burdens as system size grows. Metaheuristic approaches, such as genetic algorithms, particle swarm optimization, and other evolutionary methods, have been explored to mitigate this complexity; however, they typically lack optimality guarantees, exhibit sensitivity to initial conditions, and can become prohibitively time-consuming for large-scale systems. In this paper, we introduce a quantum-classical hybrid algorithm for UC and, by extension, other resource scheduling problems, that leverages Benders decomposition to decouple binary commitment decisions from continuous economic dispatch. The binary master problem is formulated as a quadratic unconstrained binary optimization model and solved on a quantum annealer. The continuous subproblem, which minimizes generation costs, with Lagrangian cuts feeding back to the master until convergence. We evaluate our hybrid framework on systems scaled from 10 to 1,000 generation units. Compared against a classical mixed-integer nonlinear programming baseline, the hybrid algorithm achieves a consistently lower computation-time growth rate and maintains an absolute optimality gap below 1.63%. These results demonstrate that integrating quantum annealing within a hybrid quantum-classical Benders decomposition loop can significantly accelerate large-scale resource scheduling without sacrificing solution quality, pointing toward a viable path for addressing the escalating complexity of modern power grids.
comment: 13 pages, 7 figures, 1 table Submitted to Next Research
☆ Unveiling Uniform Shifted Power Law in Stochastic Human and Autonomous Driving Behavior
Accurately simulating rare but safety-critical driving behaviors is essential for the evaluation and certification of autonomous vehicles (AVs). However, current models often fail to reproduce realistic collision rates when calibrated on real-world data, largely due to inadequate representation of long-tailed behavioral distributions. Here, we uncover a simple yet unifying shifted power law that robustly characterizes the stochasticity of both human-driven vehicle (HV) and AV behaviors, especially in the long-tail regime. The model adopts a parsimonious analytical form with only one or two parameters, enabling efficient calibration even under data sparsity. Analyzing large-scale, micro-level trajectory data from global HV and AV datasets, the shifted power law achieves an average R2 of 0.97 and a nearly identical tail distribution, uniformly fits both frequent behaviors and rare safety-critical deviations, significantly outperforming existing Gaussian-based baselines. When integrated into an agent-based traffic simulator, it enables forward-rolling simulations that reproduce realistic crash patterns for both HVs and AVs, achieving rates consistent with real-world statistics and improving the fidelity of safety assessment without post hoc correction. This discovery offers a unified and data-efficient foundation for modeling high-risk behavior and improves the fidelity of simulation-based safety assessments for mixed AV/HV traffic. The shifted power law provides a promising path toward simulation-driven validation and global certification of AV technologies.
☆ Frequency Quality Assessment of GFM and GFL Converters and Synchronous Condensers
This paper compares the impact of different conventional and emerging technologies and control strategies on frequency quality. We study, in particular, the long-term dynamic performance of grid-forming (GFM) and grid-following (GFL) inverter-based resources (IBRs) as well as conventional synchronous machines. Extensive simulations and several realistic scenarios consider both short-term and long-term aspects of frequency quality. It is shown that, while overall GFM IBRs significantly improve frequency quality, a combination of GFL IBRs providing frequency support such as wind and batteries, and synchronous condensers, might be enough to meet similar frequency quality standards. Another result of the paper is that the need for automatic generation control (AGC) becomes less clear in GFM IBR-dominated grids from a frequency quality perspective.
☆ Adaptive Federated Learning to Optimize the MultiCast flows in Data Centers
Data centers play an increasingly critical role in societal digitalization, yet their rapidly growing energy demand poses significant challenges for sustainable operation. To enhance the energy efficiency of geographically distributed data centers, this paper formulates a multi-period optimization model that captures the interdependence of electricity, heat, and data flows. The optimization of such multicast flows inherently involves mixed-integer formulations and the access to proprietary or sensitive datasets, which correspondingly exacerbate computational complexity and raise data-privacy concerns. To address these challenges, an adaptive federated learning-to-optimization approach is proposed, accounting for the heterogeneity of datasets across distributed data centers. To safeguard privacy, cryptography techniques are leveraged in both the learning and optimization processes. A model acceptance criterion with convergence guarantee is developed to improve learning performance and filter out potentially contaminated data, while a verifiable double aggregation mechanism is further proposed to simultaneously ensure privacy and integrity of shared data during optimization. Theoretical analysis and numerical simulations demonstrate that the proposed approach preserves the privacy and integrity of shared data, achieves near-optimal performance, and exhibits high computational efficiency, making it suitable for large-scale data center optimization under privacy constraints.
☆ Efficiency and Optimality in Electrochemical Battery Model Parameter Identification: A Comparative Study of Estimation Techniques
Parameter identification for electrochemical battery models has always been challenging due to the multitude of parameters involved, most of which cannot be directly measured. This paper evaluates the efficiency and optimality of three widely-used parameter identification methods for electrochemical battery models: Least Squares Method (LS), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA). Therefore, a Single Particle Model (SPM) of a battery was developed and discretized. Battery parameter grouping was then performed to reduce the number of parameters required. Using a set of parameters previously identified from a real battery as a benchmark, we generated fitting and validation datasets to assess the methods' runtime and accuracy. The comparative analysis reveals that PSO outperforms the other methods in terms of accuracy and stability, making it highly effective for parameter identification when there is no prior knowledge of the battery's internal parameters. In contrast, LS is better suited for minor adjustments in parameters, particularly for aging batteries, whereas GA lags behind in both computational efficiency and optimality with respect to PSO.
comment: Accepted and published in the Proceedings of the 2024 10th International Conference on Optimization and Applications (ICOA), IEEE, 2024. Copyright 2024 IEEE. This is the author's accepted manuscript; the final version is available at IEEE Xplore (DOI: 10.1109/ICOA62581.2024.10754301)
☆ Digital Twin of Aerosol Jet Printing
Aerosol Jet (AJ) printing is a versatile additive manufacturing technique capable of producing high-resolution interconnects on both 2D and 3D substrates. The AJ process is complex and dynamic with many hidden and unobservable states that influence the machine performance, including aerosol particle diameter, aerosol carrier density, vial level, and ink deposition in the tube and nozzle. Despite its promising potential, the widespread adoption of AJ printing is limited by inconsistencies in print quality that often stem from variability in these hidden states. To address these challenges, we develop a digital twin model of the AJ process that offers real-time insights into the machine's operations. The digital twin is built around a physics-based macro-model created through simulation and experimentation. The states and parameters of the digital model are continuously updated using probabilistic sequential estimation techniques to closely align with real-time measurements extracted from the AJ system's sensor and video data. The result is a digital model of the AJ process that continuously evolves over a physical machine's lifecycle. The digital twin enables accurate monitoring of unobservable physical characteristics, detects and predicts anomalous behavior, and forecasts the effect of control adjustments. This work presents a comprehensive end-to-end digital twin framework that integrates customized computer vision techniques, physics-based macro-modeling, and advanced probabilistic estimation methods to construct an evolving digital representation of the AJ equipment and process. While the methodologies are customized for aerosol jet printing, the process for constructing the digital twin can be applied for other advanced manufacturing techniques.
☆ Towards Quantum Algorithms for the Optimization of Spanning Trees: The Power Distribution Grids Use Case
Optimizing the topology of networks is an important challenge across engineering disciplines. In energy systems, network reconfiguration can substantially reduce losses and costs and thus support the energy transition. Unfortunately, many related optimization problems are NP hard, restricting practical applications. In this article, we address the problem of minimizing losses in radial networks, a problem that routinely arises in distribution grid operation. We show that even the computation of approximate solutions is computationally hard and propose quantum optimization as a promising alternative. We derive two quantum algorithmic primitives based on the Quantum Alternating Operator Ansatz (QAOA) that differ in the sampling of network topologies: a tailored sampling of radial topologies and simple sampling with penalty terms to suppress non-radial topologies. We show how to apply these algorithmic primitives to distribution grid reconfiguration and quantify the necessary quantum resources.
☆ FTT-GRU: A Hybrid Fast Temporal Transformer with GRU for Remaining Useful Life Prediction
Accurate prediction of the remaining useful life (RUL) of industrial machinery is essential for reducing downtime and optimizing maintenance schedules. Existing approaches, such as long short-term memory (LSTM) networks and convolutional neural networks (CNNs), often struggle to model both global temporal dependencies and fine-grained degradation trends in multivariate sensor data. We propose a hybrid model, FTT-GRU, which combines a Fast Temporal Transformer (FTT) -- a lightweight Transformer variant using linearized attention via fast Fourier transform (FFT) -- with a gated recurrent unit (GRU) layer for sequential modeling. To the best of our knowledge, this is the first application of an FTT with a GRU for RUL prediction on NASA CMAPSS, enabling simultaneous capture of global and local degradation patterns in a compact architecture. On CMAPSS FD001, FTT-GRU attains RMSE 30.76, MAE 18.97, and $R^2=0.45$, with 1.12 ms CPU latency at batch=1. Relative to the best published deep baseline (TCN--Attention), it improves RMSE by 1.16\% and MAE by 4.00\%. Training curves averaged over $k=3$ runs show smooth convergence with narrow 95\% confidence bands, and ablations (GRU-only, FTT-only) support the contribution of both components. These results demonstrate that a compact Transformer-RNN hybrid delivers accurate and efficient RUL predictions on CMAPSS, making it suitable for real-time industrial prognostics.
comment: 5 pages, The 2025 International Conference on Computational Science and Computational Intelligence
☆ Rotatable Antenna System Empowered Low-Altitude Economy: Opportunities and Challenges
Low-altitude economy (LAE) is an emerging technological paradigm that enables continuous airspace coverage at multiple altitudes by providing highly reliable data connectivity for numerous low-altitude applications. However, existing networks cannot sufficiently support LAE development, as current base stations (BSs) are primarily designed for terrestrial users and lack the capability to provide continuous coverage at low altitudes. To overcome these challenges, rotatable antenna system (RAS) is introduced in LAE, enabling flexible beamforming by dynamically adjusting the boresight of directional antennas to extend low-altitude coverage and enhance the stability of data transmission. In this article, we first provide an overview of RAS-empowered LAE applications, including low-altitude communication, sensing, control, and computation. Then, we present two practical RAS deployment strategies for LAE scenarios, namely RAS-aided multi-BS and multi-unmanned aerial vehicle (UAV) cooperative coverages, as well as provide detailed discussions on their system architectures and performance benefits. Additionally, key design issues of RAS in LAE are discussed, including channel modeling and estimation, cellular access and interference cancellation, as well as RAS configuration and boresight optimization. Finally, we demonstrate the performance gains of RAS in LAE networks through experimental and simulation results.
comment: 8 pages, 5 figures, accepted in IEEE Wireless Communication (Early Access)
☆ Image-based ground distance detection for crop-residue-covered soil
Conservation agriculture features a soil surface covered with crop residues, which brings benefits of improving soil health and saving water. However, one significant challenge in conservation agriculture lies in precisely controlling the seeding depth on the soil covered with crop residues. This is constrained by the lack of ground distance information, since current distance measurement techniques, like laser, ultrasonic, or mechanical displacement sensors, are incapable of differentiating whether the distance information comes from the residue or the soil. This paper presents an image-based method to get the ground distance information for the crop-residues-covered soil. This method is performed with 3D camera and RGB camera, obtaining depth image and color image at the same time. The color image is used to distinguish the different areas of residues and soil and finally generates a mask image. The mask image is applied to the depth image so that only the soil area depth information can be used to calculate the ground distance, and residue areas can be recognized and excluded from ground distance detection. Experimentation shows that this distance measurement method is feasible for real-time implementation, and the measurement error is within plus or minus 3mm. It can be applied in conservation agriculture machinery for precision depth seeding, as well as other depth-control-demanding applications like transplant or tillage.
comment: under review at Computers and Electronics in Agriculture
☆ On Improvisation and Open-Endedness: Insights for Experiential AI AAAI 2026
Improvisation-the art of spontaneous creation that unfolds moment-to-moment without a scripted outcome-requires practitioners to continuously sense, adapt, and create anew. It is a fundamental mode of human creativity spanning music, dance, and everyday life. The open-ended nature of improvisation produces a stream of novel, unrepeatable moments-an aspect highly valued in artistic creativity. In parallel, open-endedness (OE)-a system's capacity for unbounded novelty and endless "interestingness"-is exemplified in natural or cultural evolution and has been considered "the last grand challenge" in artificial life (ALife). The rise of generative AI now raises the question in computational creativity (CC) research: What makes a "good" improvisation for AI? Can AI learn to improvise in a genuinely open-ended way? In this work-in-progress paper, we report insights from in-depth interviews with 6 experts in improvisation across dance, music, and contact improvisation. We draw systemic connections between human improvisational arts and the design of future experiential AI agents that could improvise alone or alongside humans-or even with other AI agents-embodying qualities of improvisation drawn from practice: active listening (umwelt and awareness), being in the time (mindfulness and ephemerality), embracing the unknown (source of randomness and serendipity), non-judgmental flow (acceptance and dynamical stability, balancing structure and surprise (unpredictable criticality at edge of chaos), imaginative metaphor (synaesthesia and planning), empathy, trust, boundary, and care (mutual theory of mind), and playfulness and intrinsic motivation (maintaining interestingness).
comment: Submitted to AAAI 2026 Creative AI for Live Interactive Performances Workshop (CLIP) as a work-in-progress paper
☆ Descriptive Model-based Learning and Control for Bipedal Locomotion
Bipedal balance is challenging due to its multi-phase, hybrid nature and high-dimensional state space. Traditional balance control approaches for bipedal robots rely on low-dimensional models for locomotion planning and reactive control, constraining the full robot to behave like these simplified models. This involves tracking preset reference paths for the Center of Mass and upper body obtained through low-dimensional models, often resulting in inefficient walking patterns with bent knees. However, we observe that bipedal balance is inherently low-dimensional and can be effectively described with simple state and action descriptors in a low-dimensional state space. This allows the robot's motion to evolve freely in its high-dimensional state space, only constraining its projection in the low-dimensional state space. In this work, we propose a novel control approach that avoids prescribing a low-dimensional model to the full model. Instead, our control framework uses a descriptive model with the minimum degrees of freedom necessary to maintain balance, allowing the remaining degrees of freedom to evolve freely in the high-dimensional space. This results in an efficient human-like walking gait and improved robustness.
comment: 8 pages, 15 figures
☆ Optimization of continuous-flow over traffic networks with fundamental diagram constraints
Optimal transport (OT) theory provides a principled framework for modeling mass movement in applications such as mobility, logistics, and economics. Classical formulations, however, generally ignore capacity limits that are intrinsic in applications, in particular in traffic flow problems. We address this limitation by incorporating fundamental diagrams into a dynamic continuous-flow OT model on graphs, thereby including empirical relations between local density and maximal flux. We adopt an Eulerian kinetic action on graphs that preserves displacement interpolation in direct analogy with the continuous theory. Momentum lives on edges and density on nodes, mirroring road-network semantics in which segments carry speed and intersections store mass. The resulting fundamental-diagram-constrained OT problem preserves mass conservation and admits a convex variational discretization, yielding optimal congestion-aware traffic flow over road networks. We establish the existence and uniqueness of the optimal flow with sources and sinks, and develop an efficient convex optimization method. Numerical studies begin with a single-lane line network and scale to a city-level road network.
comment: 13 pages, 5 figures
☆ Symbol Detection in a MIMO Wireless Communication System Using a FeFET-coupled CMOS Ring Oscillator Array
Symbol decoding in multiple-input multiple-output (MIMO) wireless communication systems requires the deployment of fast, energy-efficient computing hardware deployable at the edge. The brute-force, exact maximum likelihood (ML) decoder, solved on conventional classical digital hardware, has exponential time complexity. Approximate classical solvers implemented on the same hardware have polynomial time complexity at the best. In this article, we design an alternative ring-oscillator-based coupled oscillator array to act as an oscillator Ising machine (OIM) and heuristically solve the ML-based MIMO detection problem. Complementary metal oxide semiconductor (CMOS) technology is used to design the ring oscillators, and ferroelectric field effect transistor (FeFET) technology is chosen as the coupling element (X) between the oscillators in this CMOS + X OIM design. For this purpose, we experimentally report high linear range of conductance variation (1 micro-S to 60 micro-S) in a FeFET device fabricated at 28 nm high-K/ metal gate (HKMG) CMOS technology node. We incorporate the conductance modulation characteristic in SPICE simulation of the ring oscillators connected in an all-to-all fashion through a crossbar array of these FeFET devices. We show that the above range of conductance variation of the FeFET device is suitable to obtain optimum OIM performance with no significant performance drop up to a MIMO size of 100 transmitting and 100 receiving antennas, thereby making FeFET a suitable device for this application. Our simulations and associated analysis using the Kuramoto model of oscillators also predict that this designed classical analog OIM, if implemented experimentally, will offer logarithmic scaling of computation time with MIMO size, thereby offering a huge improvement (in terms of computation speed) over aforementioned MIMO decoders run on conventional digital hardware.
comment: 58 pages including supplementary information, 5 main figures, 4 main tables, 2 supplementary figures, 2 supplementary tables
☆ CT-ESKF: A General Framework of Covariance Transformation-Based Error-State Kalman Filter
Invariant extended Kalman filter (InEKF) possesses excellent trajectory-independent property and better consistency compared to conventional extended Kalman filter (EKF). However, when applied to scenarios involving both global-frame and body-frame observations, InEKF may fail to preserve its trajectory-independent property. This work introduces the concept of equivalence between error states and covariance matrices among different error-state Kalman filters, and shows that although InEKF exhibits trajectory independence, its covariance propagation is actually equivalent to EKF. A covariance transformation-based error-state Kalman filter (CT-ESKF) framework is proposed that unifies various error-state Kalman filtering algorithms. The framework gives birth to novel filtering algorithms that demonstrate improved performance in integrated navigation systems that incorporate both global and body-frame observations. Experimental results show that the EKF with covariance transformation outperforms both InEKF and original EKF in a representative INS/GNSS/Odometer integrated navigation system.
comment: 19 pages, 12 figures
☆ Large Language Models for Control
This paper investigates using large language models (LLMs) to generate control actions directly, without requiring control-engineering expertise or hand-tuned algorithms. We implement several variants: (i) prompt-only, (ii) tool-assisted with access to historical data, and (iii) prediction-assisted using learned or simple models to score candidate actions. We compare them on tracking accuracy and actuation effort, with and without a prompt that requests lower actuator usage. Results show prompt-only LLMs already produce viable control, while tool-augmented versions adapt better to changing objectives but can be more sensitive to constraints, supporting LLM-in-the-loop control for evolving cyber-physical systems today and operator and human inputs.
♻ ☆ Continuous Classification Aggregation
We prove that any optimal, independent, and zero unanimous fuzzy classification aggregation function of a continuum of individual classifications of $m\ge 3$ objects into $2\le p\le m$ types must be a weighted arithmetic mean. We also provide a characterization for the case when $m=p=2$.
comment: 18 pages
♻ ☆ Imperfect Competition in Markets for Short-Circuit Current Services
An important limitation of Inverter-Based Resources (IBR) is their reduced contribution to Short-Circuit Current (SCC), as compared to that of Synchronous Generators (SGs). With increasing penetration of IBR in most power systems, the reducing SCC poses challenges to a secure system operation, as line protections may not trip when required. In order to address this issue, the SCC ancillary service could be procured via an economic mechanism, aiming at securing adequate SCC on all buses. However, the suitability of markets for SCC services is not well understood, given that these could be prone to market-power issues: since the SCC contributions from various SGs to a certain bus are determined by the electrical topology of the grid, this is a highly local service. It is necessary to understand if SGs at advantageous electrical locations could exert market power and, if so, how it could be mitigated. In order to fill this gap, this paper adopts an SCC-constrained bilevel model to investigate strategic behaviors of SGs. To address the non-convexity due to unit commitment variables, the model is restructured through a primal-dual formulation. Based on a modified IEEE 30-bus system, cases with strategic SGs placed at different buses are analyzed. These studies demonstrate that agents exerting market power could achieve up to triple revenues from SCC provision, highlighting the need to carefully design these markets.
comment: Ancillary services, short-circuit current, market power, bilevel optimization, primal-dual formulation. A paper submitted to
♻ ☆ Distributionally Robust Control Synthesis for Stochastic Systems with Safety and Reach-Avoid Specifications
We investigate the problem of synthesizing distributionally robust control policies for stochastic systems under safety and reach-avoid specifications. Using a game-theoretical framework, we consider the setting where the probability distribution of the disturbance at each time step is selected from an ambiguity set defined by the Wasserstein distance. The goal is to synthesize a distributionally robust control policy that ensures the satisfaction probability exceeds a specified threshold under any distribution within the ambiguity set. First, for both safety and reach-avoid specifications, we establish the existence of optimal policies by leveraging the dynamic programming principles. Then we demonstrate how the associated optimization problem can be efficiently solved using the dual representation of Wasserstein distributionally robust optimization. Furthermore, for safety specifications in particular, we introduce a novel concept of distributionally robust control barrier certificates and show how these enable the efficient synthesis of controllers through sum-of-squares programming techniques. Finally, our experimental results reveal that incorporating distributional robustness during the synthesis phase significantly improves the satisfaction probability during online execution, even with limited statistical knowledge of the disturbance distribution.
♻ ☆ Introducing Coherent-Control Koopman Modeling to Reservoir Scale Porous Media Flow Studies
Accurate and robust surrogate modeling is essential for the real-time control and optimization of large-scale subsurface systems, such as geological CO2 storage and waterflood management. This study investigates the limits of classical Dynamic Mode Decomposition with control (DMDc) and introduces CCKM, as a robust alter-native, in enforcing control in pressure and water saturation reservoir dynamics under challenging prediction scenarios. We introduced a control-coherent incremental ({\Delta}) CCKM formulation, in which the field update is driven by actuator changes rather than rather than actuator levels as in the original level formulation and compared them both against DMDc and a Hybrid B-only surrogate that re-uses DMDcs bottom-B (same-step feed-through), showing that only CCKM remains stable and accurate under regime shifts. Two representative cases are considered: (i) an out-of-distribution shut-in and restart case, and (ii) an in-distribution bottomhole pressure (BHP) drawdown. Results show that only CCKM consistently maintains stability and accuracy across both scenarios, achieving sub-bar mean absolute error and sub-percent Frobenius norm percent change error (FPCE) even under regime shifts, while DMDc exhibit large unphysical errors during control transients. The findings demonstrate that strict control-coherence is critical for reliable surrogate modeling, particularly in settings with abrupt changes in control strategy. The proposed framework is broadly applicable to real-time reservoir optimization and can be integrated seamlessly into existing optimization and monitoring workflows, enabling fast and trustworthy deci-sion support in the presence of both expected and unexpected actuation regimes.
Graphics 6
☆ Applying Medical Imaging Tractography Techniques to Painterly Rendering of Images
Doctors and researchers routinely use diffusion tensor imaging (DTI) and tractography to visualize the fibrous structure of tissues in the human body. This paper explores the connection of these techniques to the painterly rendering of images. Using a tractography algorithm the presented method can place brush strokes that mimic the painting process of human artists, analogously to how fibres are tracked in DTI. The analogue to the diffusion tensor for image orientation is the structural tensor, which can provide better local orientation information than the gradient alone. I demonstrate this technique in portraits and general images, and discuss the parallels between fibre tracking and brush stroke placement, and frame it in the language of tractography. This work presents an exploratory investigation into the cross-domain application of diffusion tensor imaging techniques to painterly rendering of images. All the code is available at https://github.com/tito21/st-python
comment: Exploratory investigation applying medical imaging tractography techniques to painterly image rendering. Code available at https://github.com/tito21/st-python
☆ Image-based ground distance detection for crop-residue-covered soil
Conservation agriculture features a soil surface covered with crop residues, which brings benefits of improving soil health and saving water. However, one significant challenge in conservation agriculture lies in precisely controlling the seeding depth on the soil covered with crop residues. This is constrained by the lack of ground distance information, since current distance measurement techniques, like laser, ultrasonic, or mechanical displacement sensors, are incapable of differentiating whether the distance information comes from the residue or the soil. This paper presents an image-based method to get the ground distance information for the crop-residues-covered soil. This method is performed with 3D camera and RGB camera, obtaining depth image and color image at the same time. The color image is used to distinguish the different areas of residues and soil and finally generates a mask image. The mask image is applied to the depth image so that only the soil area depth information can be used to calculate the ground distance, and residue areas can be recognized and excluded from ground distance detection. Experimentation shows that this distance measurement method is feasible for real-time implementation, and the measurement error is within plus or minus 3mm. It can be applied in conservation agriculture machinery for precision depth seeding, as well as other depth-control-demanding applications like transplant or tillage.
comment: under review at Computers and Electronics in Agriculture
☆ Oitijjo-3D: Generative AI Framework for Rapid 3D Heritage Reconstruction from Street View Imagery CEC
Cultural heritage restoration in Bangladesh faces a dual challenge of limited resources and scarce technical expertise. Traditional 3D digitization methods, such as photogrammetry or LiDAR scanning, require expensive hardware, expert operators, and extensive on-site access, which are often infeasible in developing contexts. As a result, many of Bangladesh's architectural treasures, from the Paharpur Buddhist Monastery to Ahsan Manzil, remain vulnerable to decay and inaccessible in digital form. This paper introduces Oitijjo-3D, a cost-free generative AI framework that democratizes 3D cultural preservation. By using publicly available Google Street View imagery, Oitijjo-3D reconstructs faithful 3D models of heritage structures through a two-stage pipeline - multimodal visual reasoning with Gemini 2.5 Flash Image for structure-texture synthesis, and neural image-to-3D generation through Hexagen for geometry recovery. The system produces photorealistic, metrically coherent reconstructions in seconds, achieving significant speedups compared to conventional Structure-from-Motion pipelines, without requiring any specialized hardware or expert supervision. Experiments on landmarks such as Ahsan Manzil, Choto Sona Mosque, and Paharpur demonstrate that Oitijjo-3D preserves both visual and structural fidelity while drastically lowering economic and technical barriers. By turning open imagery into digital heritage, this work reframes preservation as a community-driven, AI-assisted act of cultural continuity for resource-limited nations.
comment: 6 Pages, 4 figures, 2 Tables, Submitted to ICECTE 2026
♻ ☆ Audio Driven Real-Time Facial Animation for Social Telepresence SIGGRAPH
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.
comment: SIGGRAPH Asia 2025. Project page: https://jiyewise.github.io/projects/AudioRTA
♻ ☆ Anti-Aliased 2D Gaussian Splatting NeurIPS 2025
2D Gaussian Splatting (2DGS) has recently emerged as a promising method for novel view synthesis and surface reconstruction, offering better view-consistency and geometric accuracy than volumetric 3DGS. However, 2DGS suffers from severe aliasing artifacts when rendering at different sampling rates than those used during training, limiting its practical applications in scenarios requiring camera zoom or varying fields of view. We identify that these artifacts stem from two key limitations: the lack of frequency constraints in the representation and an ineffective screen-space clamping approach. To address these issues, we present AA-2DGS, an anti-aliased formulation of 2D Gaussian Splatting that maintains its geometric benefits while significantly enhancing rendering quality across different scales. Our method introduces a world-space flat smoothing kernel that constrains the frequency content of 2D Gaussian primitives based on the maximal sampling frequency from training views, effectively eliminating high-frequency artifacts when zooming in. Additionally, we derive a novel object-space Mip filter by leveraging an affine approximation of the ray-splat intersection mapping, which allows us to efficiently apply proper anti-aliasing directly in the local space of each splat.
comment: NeurIPS 2025. Code will be available at https://github.com/maeyounes/AA-2DGS
♻ ☆ SViM3D: Stable Video Material Diffusion for Single Image 3D Generation ICCV 2025
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
comment: Accepted by International Conference on Computer Vision (ICCV 2025). Project page: http://svim3d.aengelhardt.com
Computation and Language 42
♻ ☆ EmbeddingGemma: Powerful and Lightweight Text Representations
We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
comment: 18 pages. Models are available in HuggingFace (at https://huggingface.co/collections/google/embeddinggemma-68b9ae3a72a82f0562a80dc4), Kaggle (at https://www.kaggle.com/models/google/embeddinggemma/), and Vertex AI (at https://pantheon.corp.google.com/vertex-ai/publishers/google/model-garden/embeddinggemma)
♻ ☆ CausalARC: Abstract Reasoning with Causal World Models
On-the-fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within- and between-model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.
comment: Peer-reviewed workshop paper
♻ ☆ Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction
In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured financial data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three representative methods: representation combination, representation summation, and attentive representations. Next, building on empirical observations from fusion learning, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability observed in the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction and selection.
♻ ☆ Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? AACL
The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60 percent performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.
comment: 24 pages, 3 figures, 13 tables. The paper is accepted at AACL-IJCNLP 2025 (main track), and the latest version adds modifications in camera-ready
♻ ☆ Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally NeurIPS 2025
Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.
comment: Accepted at NeurIPS 2025 (main conference)
♻ ☆ SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).
comment: https://slideagent.github.io/
♻ ☆ Exploring Large Language Models for Detecting Mental Disorders EMNLP 2025
This paper compares the effectiveness of traditional machine learning methods, encoder-based models, and large language models (LLMs) on the task of detecting depression and anxiety. Five Russian-language datasets were considered, each differing in format and in the method used to define the target pathology class. We tested AutoML models based on linguistic features, several variations of encoder-based Transformers such as BERT, and state-of-the-art LLMs as pathology classification models. The results demonstrated that LLMs outperform traditional methods, particularly on noisy and small datasets where training examples vary significantly in text length and genre. However, psycholinguistic features and encoder-based models can achieve performance comparable to language models when trained on texts from individuals with clinically confirmed depression, highlighting their potential effectiveness in targeted clinical applications.
comment: Accepted to EMNLP 2025
♻ ☆ OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews
We study the problem of opinion highlights generation from large volumes of user reviews, often exceeding thousands per entity, where existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook personalized needs. To tackle this, we introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with LLMs to efficiently produce tailored summaries. Additionally, we propose novel reference-free verification metrics designed for sentiment-rich domains, where accurately capturing opinions and sentiment alignment is essential. These metrics offer a fine-grained, context-sensitive assessment of factual consistency. To facilitate evaluation, we contribute the first large-scale dataset of long-form user reviews, comprising entities with over a thousand reviews each, paired with unbiased expert summaries and manually annotated queries. Through extensive experiments, we identify key challenges, provide actionable insights into improving systems, pave the way for future research, and position OpinioRAG as a robust framework for generating accurate, relevant, and structured summaries at scale.
comment: COLM 2025
♻ ☆ Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI's ability to generate derivative content. Yet it's unclear if these models can generate high quality literary text while emulating authors' styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^-8) & writing quality (OR=0.13, p<10^-7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors' complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^-13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright's fourth fair-use factor, the "effect upon the potential market or value" of the source works.
comment: Preprint Under Review
♻ ☆ Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency
Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations--generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages fine-grained cross-model consistency to detect and mitigate hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations: 1) a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39\% compared to existing approaches. For mitigation, Finch-Zk achieves up to 9 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation on multiple datasets demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.
♻ ☆ Editing Across Languages: A Survey of Multilingual Knowledge Editing EMNLP 2025
While Knowledge Editing has been extensively studied in monolingual settings, it remains underexplored in multilingual contexts. This survey systematizes recent research on Multilingual Knowledge Editing (MKE), a growing subdomain of model editing focused on ensuring factual edits generalize reliably across languages. We present a comprehensive taxonomy of MKE methods, covering parameter-based, memory-based, fine-tuning, and hypernetwork approaches. We survey available benchmarks,summarize key findings on method effectiveness and transfer patterns, identify challenges in cross-lingual propagation, and highlight open problems related to language anisotropy, evaluation coverage, and edit scalability. Our analysis consolidates a rapidly evolving area and lays the groundwork for future progress in editable language-aware LLMs.
comment: Accepted at EMNLP 2025
♻ ☆ A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models EMNLP 2025
Chain-of-thought (CoT) reasoning enhances performance of large language models, but questions remain about whether these reasoning traces faithfully reflect the internal processes of the model. We present the first comprehensive study of CoT faithfulness in large vision-language models (LVLMs), investigating how both text-based and previously unexplored image-based biases affect reasoning and bias articulation. Our work introduces a novel, fine-grained evaluation pipeline for categorizing bias articulation patterns, enabling significantly more precise analysis of CoT reasoning than previous methods. This framework reveals critical distinctions in how models process and respond to different types of biases, providing new insights into LVLM CoT faithfulness. Our findings reveal that subtle image-based biases are rarely articulated compared to explicit text-based ones, even in models specialized for reasoning. Additionally, many models exhibit a previously unidentified phenomenon we term ``inconsistent'' reasoning - correctly reasoning before abruptly changing answers, serving as a potential canary for detecting biased reasoning from unfaithful CoTs. We then apply the same evaluation pipeline to revisit CoT faithfulness in LLMs across various levels of implicit cues. Our findings reveal that current language-only reasoning models continue to struggle with articulating cues that are not overtly stated.
comment: Accepted in EMNLP 2025, 34 pages, 25 figures
♻ ☆ RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
♻ ☆ Interpreting the Latent Structure of Operator Precedence in Language Models
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator's embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.
comment: 11 pages, 6 figures. An earlier version of this work was accepted to CoLM 2024. This is an extended version of our CoLM 2024 paper. Includes additional ablations; added Ved Shah as author for those contributions
♻ ☆ What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success. Probes achieve strong in-distribution accuracy but transfer is attack-family-specific, revealing that different jailbreaks are supported by distinct internal mechanisms rather than a single universal direction. To establish causal relevance, we construct probe-guided latent interventions that systematically shift compliance in the predicted direction. Interventions derived from non-linear probes produce larger and more reliable effects than those from linear probes, indicating that features linked to jailbreak success are encoded non-linearly in prompt representations. Overall, the results surface heterogeneous, non-linear structure in jailbreak mechanisms and provide a prompt-side methodology for recovering and testing the features that drive jailbreak outcomes.
♻ ☆ Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque EMNLP 2025
Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation. https://github.com/hitz-zentroa/latxa-instruct
comment: Accepted at EMNLP 2025 Main Conference
♻ ☆ On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study
Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.
♻ ☆ Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.
♻ ☆ Targeted Distillation for Sentiment Analysis
This paper explores targeted distillation methods for sentiment analysis, aiming to build compact and practical models that preserve strong and generalizable sentiment analysis capabilities. To this end, we conceptually decouple the distillation target into knowledge and alignment and accordingly propose a two-stage distillation framework. Moreover, we introduce SentiBench, a comprehensive and systematic sentiment analysis benchmark that covers a diverse set of tasks across 12 datasets. We evaluate a wide range of models on this benchmark. Experimental results show that our approach substantially enhances the performance of compact models across diverse sentiment analysis tasks, and the resulting models demonstrate strong generalization to unseen tasks, showcasing robust competitiveness against existing small-scale models.
♻ ☆ The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This hybrid mechanism explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.
comment: Accepted by TMLR
♻ ☆ Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models EMNLP 2025
Recent efforts leverage knowledge distillation techniques to develop lightweight and practical sentiment analysis models. These methods are grounded in human-written instructions and large-scale user texts. Despite the promising results, two key challenges remain: (1) manually written instructions are limited in diversity and quantity, making them insufficient to ensure comprehensive coverage of distilled knowledge; (2) large-scale user texts incur high computational cost, hindering the practicality of these methods. To this end, we introduce CompEffDist, a comprehensive and efficient distillation framework for sentiment analysis. Our framework consists of two key modules: attribute-based automatic instruction construction and difficulty-based data filtering, which correspondingly tackle the aforementioned challenges. Applying our method across multiple model series (Llama-3, Qwen-3, and Gemma-3), we enable 3B student models to match the performance of 20x larger teacher models on most tasks. In addition, our approach greatly outperforms baseline methods in data efficiency, attaining the same performance level with only 10% of the data.
comment: Accepted by EMNLP 2025. 22 pages, 9 figures. The first two authors contribute equally
♻ ☆ Transferring Linear Features Across Language Models With Model Stitching
In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. In particular, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.
♻ ☆ 3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark EMNLP 25
Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM's context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.
comment: EMNLP 25 (main)
♻ ☆ MaiBaam Annotation Guidelines
This document provides the annotation guidelines for MaiBaam, a Bavarian corpus manually annotated with part-of-speech (POS) tags, syntactic dependencies, and German lemmas. MaiBaam belongs to the Universal Dependencies (UD) project, and our annotations elaborate on the general and German UD version 2 guidelines. In this document, we detail how to preprocess and tokenize Bavarian data, provide an overview of the POS tags and dependencies we use, explain annotation decisions that would also apply to closely related languages like German, and lastly we introduce and motivate decisions that are specific to Bavarian grammar.
comment: Updated for UD v2.17
♻ ☆ Kimi Linear: An Expressive, Efficient Attention Architecture
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
comment: Kimi Linear tech report
♻ ☆ Incivility and Rigidity: Evaluating the Risks of Fine-Tuning LLMs for Political Argumentation
Incivility on platforms such as Twitter (now X) and Reddit complicates the development of AI systems that can support productive, rhetorically sound political argumentation. We present experiments with \textit{GPT-3.5 Turbo} fine-tuned on two contrasting datasets of political discourse: high-incivility Twitter replies to U.S. Congress and low-incivility posts from Reddit's \textit{r/ChangeMyView}. Our evaluation examines how data composition and prompting strategies affect the rhetorical framing and deliberative quality of model-generated arguments. Results show that Reddit-finetuned models generate safer but rhetorically rigid arguments, while cross-platform fine-tuning amplifies adversarial tone and toxicity. Prompt-based steering reduces overt toxicity (e.g., personal attacks) but cannot fully offset the influence of noisy training data. We introduce a rhetorical evaluation rubric - covering justification, reciprocity, alignment, and authority - and provide implementation guidelines for authoring, moderation, and deliberation-support systems.
♻ ☆ Language Native Lightly Structured Databases for Large Language Model Driven Composite Materials Research
The preparation procedures of materials are often embedded narratively in experimental protocols, research articles, patents, and laboratory notes, and are structured around procedural sequences, causal relationships, and conditional logic. The synthesis of boron nitride nanosheet (BNNS) polymer composites exemplifies this linguistically encoded decision-making system, where the practical experiments involve interdependent multistage and path-dependent processes such as exfoliation, functionalization, and dispersion, each governed by heterogeneous parameters and contextual contingencies, challenging conventional numerical optimization paradigms for experiment design. We reformulate this challenge into a text-reasoning problem through a framework centered on a text-first, lightly structured materials database and large language models (LLMs) as text reasoning engines. We constructed a database that captures evidence-linked narrative excerpts from the literature while normalizing only the minimum necessary entities, attributes, and relations to enable composite retrieval that unifies semantic matching, lexical cues, and explicit value filters. Building on this language-native, provenance-preserving foundation, the LLM operates in two complementary modes: retrieval-augmented generation (RAG), grounding outputs in retrieved evidence modules from the database, and experience-augmented reasoning (EAR), which leverages iteratively trained text guides derived from multi-source literature-based narrative data as external references to inform reasoning and decision-making. Applying this integration-and-reasoning framework, we demonstrate rapid, laboratory-scale optimization of BNNS preparation, highlighting how language-native data combined with LLM-based reasoning can significantly accelerate practical material preparation.
♻ ☆ MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification EMNLP 2025
We introduce MultiMatch, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a pseudo-label weighting module designed for selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques -- heads agreement from Multihead Co-training, self-adaptive thresholds from FreeMatch, and Average Pseudo-Margins from MarginMatch -- resulting in a holistic approach that improves robustness and performance in SSL settings. Experimental results on benchmark datasets highlight the superior performance of MultiMatch, i.e., MultiMatch achieves state-of-the-art results on 8 out of 10 setups from 5 natural language processing datasets and ranks first according to the Friedman test among 21 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26%, a critical advantage for real-world text classification tasks. Our code is available on GitHub.
comment: This is the camera-ready version of the paper, accepted for publication in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)
♻ ☆ Exploring the Hidden Capacity of LLMs for One-Step Text Generation EMNLP2025
A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one trained input embedding. In this work, we explore whether autoregressive decoding is essential for such reconstruction. We show that frozen LLMs can generate hundreds of accurate tokens in just one token-parallel forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored multi-token generation capability of autoregressive LLMs. We examine these embeddings and characterize the information they encode. We also empirically show that, although these representations are not unique for a given text, they form connected and local regions in embedding space - suggesting the potential to train a practical encoder. The existence of such representations hints that multi-token generation may be natively accessible in off-the-shelf LLMs via a learned input encoder, eliminating heavy retraining and helping to overcome the fundamental bottleneck of autoregressive decoding while reusing already-trained models.
comment: accepted to EMNLP2025 main
♻ ☆ GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations EMNLP 2025
We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.
comment: 19 pages, 17 figures, accepted in EMNLP 2025
♻ ☆ VideoExplorer: Think With Videos For Agentic Long-Video Understanding
Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
♻ ☆ Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning
Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.
♻ ☆ Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models NeurIPS 2025
Pre-training large language models (LLMs) on vast text corpora enhances natural language processing capabilities but risks encoding social biases, particularly gender bias. While parameter-modification methods like fine-tuning mitigate bias, they are resource-intensive, unsuitable for closed-source models, and lack adaptability to evolving societal norms. Instruction-based approaches offer flexibility but often compromise task performance. To address these limitations, we propose $\textbf{FaIRMaker}$, an automated and model-independent framework that employs an $\textbf{auto-search and refinement}$ paradigm to adaptively generate Fairwords, which act as instructions integrated into input queries to reduce gender bias and enhance response quality. Extensive experiments demonstrate that FaIRMaker automatically searches for and dynamically refines Fairwords, effectively mitigating gender bias while preserving task integrity and ensuring compatibility with both API-based and open-source LLMs.
comment: Accepted to NeurIPS 2025
♻ ☆ IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? AACL 2025
Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://github.com/aforakhilesh/IndicBertology].
comment: 25 pages, 11 figures, Accepted at IJCNLP-AACL 2025 Findings
♻ ☆ Chain of Retrieval: Multi-Aspect Iterative Search Expansion and Post-Order Search Aggregation for Full Paper Retrieval
Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused exclusively on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity between them. Yet, abstracts offer only sparse and high-level summaries, and such methods primarily optimize one-to-one similarity, overlooking the dynamic relations that emerge among relevant papers during the retrieval process. To address this, we propose Chain of Retrieval(COR), a novel iterative framework for full-paper retrieval. Specifically, CoR decomposes each query paper into multiple aspect-specific views, matches them against segmented candidate papers, and iteratively expands the search by promoting top-ranked results as new queries, thereby forming a tree-structured retrieval process. The resulting retrieval tree is then aggregated in a post-order manner: descendants are first combined at the query level, then recursively merged with their parent nodes, to capture hierarchical relations across iterations. To validate this, we present SCIFULLBENCH, a large-scale benchmark providing both complete and segmented contexts of full papers for queries and candidates, and results show that CoR significantly outperforms existing retrieval baselines. Our code and dataset is available at https://github.com/psw0021/Chain-of-Retrieval.git.
♻ ☆ Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning
Large Language Models (LLMs) have shown impressive performance on complex tasks through Chain-of-Thought (CoT) reasoning. However, conventional CoT relies on explicitly verbalized intermediate steps, which constrains its broader applicability, particularly in abstract reasoning tasks beyond language. To address this, there has been growing research interest in \textit{latent CoT reasoning}, where the reasoning process is embedded within latent spaces. By decoupling reasoning from explicit language generation, latent CoT offers the promise of richer cognitive representations and facilitates more flexible, faster inference. This paper aims to present a comprehensive overview of this emerging paradigm and establish a systematic taxonomy. We analyze recent advances in methods, categorizing them from token-wise horizontal approaches to layer-wise vertical strategies. We then provide in-depth discussions of these methods, highlighting their design principles, applications, and remaining challenges. We hope that our survey provides a structured foundation for advancing this promising direction in LLM reasoning. The relevant papers will be regularly updated at https://github.com/EIT-NLP/Awesome-Latent-CoT.
♻ ☆ Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual Relationships
While interpretability methods identify a model's learned concepts, they overlook the relationships between concepts that make up its abstractions and inform its ability to generalize to new data. To assess whether models' have learned human-aligned abstractions, we introduce abstraction alignment, a methodology to compare model behavior against formal human knowledge. Abstraction alignment externalizes domain-specific human knowledge as an abstraction graph, a set of pertinent concepts spanning levels of abstraction. Using the abstraction graph as a ground truth, abstraction alignment measures the alignment of a model's behavior by determining how much of its uncertainty is accounted for by the human abstractions. By aggregating abstraction alignment across entire datasets, users can test alignment hypotheses, such as which human concepts the model has learned and where misalignments recur. In evaluations with experts, abstraction alignment differentiates seemingly similar errors, improves the verbosity of existing model-quality metrics, and uncovers improvements to current human abstractions.
comment: 20 pages, 7 figures, published in CHI 2025
♻ ☆ SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat NeurIPS 2025
We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model's lack of diversity in generation and biases in evaluation, multiple LLMs form a "sparta tribe" to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.
comment: NeurIPS 2025
♻ ☆ Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care NeurIPS 2025
Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI's text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length. Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.
comment: Accepted at NeurIPS 2025 as a poster presentation in The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance (https://openreview.net/forum?id=Kl7KZwJFEG). 33 pages, 10 figures (including appendix), 15 tables (including appendix). To be submitted to peer-reviewed journal. For associated code repository, see https://github.com/almeidava93/llm-as-code-selectors-paper
♻ ☆ The Language of Interoception: Examining Embodiment and Emotion Through a Corpus of Body Part Mentions
This paper is the first investigation of the connection between emotion, embodiment, and everyday language in a large sample of natural language data. We created corpora of body part mentions (BPMs) in online English text (blog posts and tweets). This includes a subset featuring human annotations for the emotions of the person whose body part is mentioned in the text. We show that BPMs are common in personal narratives and tweets (~5% to 10% of posts include BPMs) and that their usage patterns vary markedly by time and %geographic location. Using word-emotion association lexicons and our annotated data, we show that text containing BPMs tends to be more emotionally charged, even when the BPM is not explicitly used to describe a physical reaction to the emotion in the text. Finally, we discover a strong and statistically significant correlation between body-related language and a variety of poorer health outcomes. In sum, we argue that investigating the role of body-part related words in language can open up valuable avenues of future research at the intersection of NLP, the affective sciences, and the study of human wellbeing.
comment: 8 pages, 26 figures
♻ ☆ QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research. (Codes and datasets are available at https://qcoder-bench.github.io/ )
comment: Accepted to INLG2025
♻ ☆ Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.
comment: 22 pages, 10 figures
Information Retrieval 8
☆ Taxonomy-based Negative Sampling In Personalized Semantic Search for E-commerce
Large retail outlets offer products that may be domain-specific, and this requires having a model that can understand subtle differences in similar items. Sampling techniques used to train these models are most of the time, computationally expensive or logistically challenging. These models also do not factor in users' previous purchase patterns or behavior, thereby retrieving irrelevant items for them. We present a semantic retrieval model for e-commerce search that embeds queries and products into a shared vector space and leverages a novel taxonomy-based hard-negative sampling(TB-HNS) strategy to mine contextually relevant yet challenging negatives. To further tailor retrievals, we incorporate user-level personalization by modeling each customer's past purchase history and behavior. In offline experiments, our approach outperforms BM25, ANCE and leading neural baselines on Recall@K, while live A/B testing shows substantial uplifts in conversion rate, add-to-cart rate, and average order value. We also demonstrate that our taxonomy-driven negatives reduce training overhead and accelerate convergence, and we share practical lessons from deploying this system at scale.
comment: Accepted at 2025 IEEE International Conference on Big Data
☆ Object-Centric Analysis of XES Event Logs: Integrating OCED Modeling with SPARQL Queries
Object Centric Event Data (OCED) has gained attention in recent years within the field of process mining. However, there are still many challenges, such as connecting the XES format to object-centric approaches to enable more insightful analysis. It is important for a process miner to understand the insights and dependencies of events in the event log to see what is going on in our processes. In previous standards, the dependencies of event logs are only used to show events, but not their dependencies among each other and actions in detail as described in OCEDO. There is more information in the event log when it is revealed using the OCEDO model. It becomes more understandable and easier to grasp the concepts and deal with the processes. This paper proposes the use of Object-Centric Event Data Ontology (OCEDO) to overcome the limitations of the XES standard in event logs for process mining. We demonstrate how the OCEDO approach, integrated with SPARQL queries, can be applied to the BPIC 2013 dataset to make the relationships between events and objects more explicit. It describes dealing with the meta descriptions of the OCEDO model on a business process challenge as an event log. It improves the completeness and readability of process data, suggesting that object-centric modeling allows for richer analyses than traditional approaches.
comment: 12 pages, 4 figures, PROFES2025 conference
☆ Structurally Refined Graph Transformer for Multimodal Recommendation
Multimodal recommendation systems utilize various types of information, including images and text, to enhance the effectiveness of recommendations. The key challenge is predicting user purchasing behavior from the available data. Current recommendation models prioritize extracting multimodal information while neglecting the distinction between redundant and valuable data. They also rely heavily on a single semantic framework (e.g., local or global semantics), resulting in an incomplete or biased representation of user preferences, particularly those less expressed in prior interactions. Furthermore, these approaches fail to capture the complex interactions between users and items, limiting the model's ability to meet diverse users. To address these challenges, we present SRGFormer, a structurally optimized multimodal recommendation model. By modifying the transformer for better integration into our model, we capture the overall behavior patterns of users. Then, we enhance structural information by embedding multimodal information into a hypergraph structure to aid in learning the local structures between users and items. Meanwhile, applying self-supervised tasks to user-item collaborative signals enhances the integration of multimodal information, thereby revealing the representational features inherent to the data's modality. Extensive experiments on three public datasets reveal that SRGFormer surpasses previous benchmark models, achieving an average performance improvement of 4.47 percent on the Sports dataset. The code is publicly available online.
comment: Comment: 13 pages, 7 figures, accepted by IEEE Transactions on Multimedia 2025
☆ Listwise Preference Diffusion Optimization for User Behavior Trajectories Prediction
Forecasting multi-step user behavior trajectories requires reasoning over structured preferences across future actions, a challenge overlooked by traditional sequential recommendation. This problem is critical for applications such as personalized commerce and adaptive content delivery, where anticipating a user's complete action sequence enhances both satisfaction and business outcomes. We identify an essential limitation of existing paradigms: their inability to capture global, listwise dependencies among sequence items. To address this, we formulate User Behavior Trajectory Prediction (UBTP) as a new task setting that explicitly models long-term user preferences. We introduce Listwise Preference Diffusion Optimization (LPDO), a diffusion-based training framework that directly optimizes structured preferences over entire item sequences. LPDO incorporates a Plackett-Luce supervision signal and derives a tight variational lower bound aligned with listwise ranking likelihoods, enabling coherent preference generation across denoising steps and overcoming the independent-token assumption of prior diffusion methods. To rigorously evaluate multi-step prediction quality, we propose the task-specific metric Sequential Match (SeqMatch), which measures exact trajectory agreement, and adopt Perplexity (PPL), which assesses probabilistic fidelity. Extensive experiments on real-world user behavior benchmarks demonstrate that LPDO consistently outperforms state-of-the-art baselines, establishing a new benchmark for structured preference learning with diffusion models.
☆ LIR: The First Workshop on Late Interaction and Multi Vector Retrieval @ ECIR 2026 ECIR 2026
Late interaction retrieval methods, pioneered by ColBERT, have emerged as a powerful alternative to single-vector neural IR. By leveraging fine-grained, token-level representations, they have been demonstrated to deliver strong generalisation and robustness, particularly in out-of-domain settings. They have recently been shown to be particularly well-suited for novel use cases, such as reasoning-based or cross-modality retrieval. At the same time, these models pose significant challenges of efficiency, usability, and integrations into fully fledged systems; as well as the natural difficulties encountered while researching novel application domains. Recent years have seen rapid advances across many of these areas, but research efforts remain fragmented across communities and frequently exclude practitioners. The purpose of this workshop is to create an environment where all aspects of late interaction can be discussed, with a focus on early research explorations, real-world outcomes, and negative or puzzling results to be freely shared and discussed. The aim of LIR is to provide a highly-interactive environment for researchers from various backgrounds and practitioners to freely discuss their experience, fostering further collaboration.
comment: Accepted workshop at ECIR 2026
☆ PolyRecommender: A Multimodal Recommendation System for Polymer Discovery
We introduce PolyRecommender, a multimodal discovery framework that integrates chemical language representations from PolyBERT with molecular graph-based representations from a graph encoder. The system first retrieves candidate polymers using language-based similarity and then ranks them using fused multimodal embeddings according to multiple target properties. By leveraging the complementary knowledge encoded in both modalities, PolyRecommender enables efficient retrieval and robust ranking across related polymer properties. Our work establishes a generalizable multimodal paradigm, advancing AI-guided design for the discovery of next-generation polymers.
♻ ☆ OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews
We study the problem of opinion highlights generation from large volumes of user reviews, often exceeding thousands per entity, where existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook personalized needs. To tackle this, we introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with LLMs to efficiently produce tailored summaries. Additionally, we propose novel reference-free verification metrics designed for sentiment-rich domains, where accurately capturing opinions and sentiment alignment is essential. These metrics offer a fine-grained, context-sensitive assessment of factual consistency. To facilitate evaluation, we contribute the first large-scale dataset of long-form user reviews, comprising entities with over a thousand reviews each, paired with unbiased expert summaries and manually annotated queries. Through extensive experiments, we identify key challenges, provide actionable insights into improving systems, pave the way for future research, and position OpinioRAG as a robust framework for generating accurate, relevant, and structured summaries at scale.
comment: COLM 2025
♻ ☆ Chain of Retrieval: Multi-Aspect Iterative Search Expansion and Post-Order Search Aggregation for Full Paper Retrieval
Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused exclusively on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity between them. Yet, abstracts offer only sparse and high-level summaries, and such methods primarily optimize one-to-one similarity, overlooking the dynamic relations that emerge among relevant papers during the retrieval process. To address this, we propose Chain of Retrieval(COR), a novel iterative framework for full-paper retrieval. Specifically, CoR decomposes each query paper into multiple aspect-specific views, matches them against segmented candidate papers, and iteratively expands the search by promoting top-ranked results as new queries, thereby forming a tree-structured retrieval process. The resulting retrieval tree is then aggregated in a post-order manner: descendants are first combined at the query level, then recursively merged with their parent nodes, to capture hierarchical relations across iterations. To validate this, we present SCIFULLBENCH, a large-scale benchmark providing both complete and segmented contexts of full papers for queries and candidates, and results show that CoR significantly outperforms existing retrieval baselines. Our code and dataset is available at https://github.com/psw0021/Chain-of-Retrieval.git.
Multimedia 3
☆ Predicting Encoding Energy from Low-Pass Anchors for Green Video Streaming
Video streaming now represents the dominant share of Internet traffic, as ever-higher-resolution content is distributed across a growing range of heterogeneous devices to sustain user Quality of Experience (QoE). However, this trend raises significant concerns about energy efficiency and carbon emissions, requiring methods to provide a trade-off between energy and QoE. This paper proposes a lightweight energy prediction method that estimates the energy consumption of high-resolution video encodings using reference encodings generated at lower resolutions (so-called anchors), eliminating the need for exhaustive per-segment energy measurements, a process that is infeasible at scale. We automatically select encoding parameters, such as resolution and quantization parameter (QP), to achieve substantial energy savings while maintaining perceptual quality, as measured by the Video Multimethod Fusion Assessment (VMAF), within acceptable limits. We implement and evaluate our approach with the open-source VVenC encoder on 100 video sequences from the Inter4K dataset across multiple encoding settings. Results show that, for an average VMAF score reduction of only 1.68, which stays below the Just Noticeable Difference (JND) threshold, our method achieves 51.22% encoding energy savings and 53.54% decoding energy savings compared to a scenario with no quality degradation.
comment: 7 pages, 8 Figures, 4 tables, confernece paper
♻ ☆ A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift
Transformer-based architectures have recently advanced the image reconstruction quality of super-resolution (SR) models. Yet, their scalability remains limited by quadratic attention costs and coarse patch embeddings that weaken pixel-level fidelity. We propose TaylorIR, a plug-and-play framework that enforces 1x1 patch embeddings for true pixel-wise reasoning and replaces conventional self-attention with TaylorShift, a Taylor-series-based attention mechanism enabling full token interactions with near-linear complexity. Across multiple SR benchmarks, TaylorIR delivers state-of-the-art performance while reducing memory consumption by up to 60%, effectively bridging the gap between fine-grained detail restoration and efficient transformer scaling.
♻ ☆ Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning
Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.
Image and Video Processing 12
☆ Been There, Scanned That: Nostalgia-Driven LiDAR Compression for Self-Driving Cars
An autonomous vehicle can generate several terabytes of sensor data per day. A significant portion of this data consists of 3D point clouds produced by depth sensors such as LiDARs. This data must be transferred to cloud storage, where it is utilized for training machine learning models or conducting analyses, such as forensic investigations in the event of an accident. To reduce network and storage costs, this paper introduces DejaView. Although prior work uses interframe redundancies to compress data, DejaView searches for and uses redundancies on larger temporal scales (days and months) for more effective compression. We designed DejaView with the insight that the operating area of autonomous vehicles is limited and that vehicles mostly traverse the same routes daily. Consequently, the 3D data they collect daily is likely similar to the data they have captured in the past. To capture this, the core of DejaView is a diff operation that compactly represents point clouds as delta w.r.t. 3D data from the past. Using two months of LiDAR data, an end-to-end implementation of DejaView can compress point clouds by a factor of 210 at a reconstruction error of only 15 cm.
☆ GDROS: A Geometry-Guided Dense Registration Framework for Optical-SAR Images under Large Geometric Transformations
Registration of optical and synthetic aperture radar (SAR) remote sensing images serves as a critical foundation for image fusion and visual navigation tasks. This task is particularly challenging because of their modal discrepancy, primarily manifested as severe nonlinear radiometric differences (NRD), geometric distortions, and noise variations. Under large geometric transformations, existing classical template-based and sparse keypoint-based strategies struggle to achieve reliable registration results for optical-SAR image pairs. To address these limitations, we propose GDROS, a geometry-guided dense registration framework leveraging global cross-modal image interactions. First, we extract cross-modal deep features from optical and SAR images through a CNN-Transformer hybrid feature extraction module, upon which a multi-scale 4D correlation volume is constructed and iteratively refined to establish pixel-wise dense correspondences. Subsequently, we implement a least squares regression (LSR) module to geometrically constrain the predicted dense optical flow field. Such geometry guidance mitigates prediction divergence by directly imposing an estimated affine transformation on the final flow predictions. Extensive experiments have been conducted on three representative datasets WHU-Opt-SAR dataset, OS dataset, and UBCv2 dataset with different spatial resolutions, demonstrating robust performance of our proposed method across different imaging resolutions. Qualitative and quantitative results show that GDROS significantly outperforms current state-of-the-art methods in all metrics. Our source code will be released at: https://github.com/Zi-Xuan-Sun/GDROS.
comment: To be published in IEEE Transactions on Geoscience and Remote Sensing (T-GRS) 2025
☆ Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound
Purpose: This study provides the first comprehensive evaluation of foundation models in fetal ultrasound (US) imaging under low inter-class variability conditions. While recent vision foundation models such as DINOv3 have shown remarkable transferability across medical domains, their ability to discriminate anatomically similar structures has not been systematically investigated. We address this gap by focusing on fetal brain standard planes--transthalamic (TT), transventricular (TV), and transcerebellar (TC)--which exhibit highly overlapping anatomical features and pose a critical challenge for reliable biometric assessment. Methods: To ensure a fair and reproducible evaluation, all publicly available fetal ultrasound datasets were curated and aggregated into a unified multicenter benchmark, FetalUS-188K, comprising more than 188,000 annotated images from heterogeneous acquisition settings. DINOv3 was pretrained in a self-supervised manner to learn ultrasound-aware representations. The learned features were then evaluated through standardized adaptation protocols, including linear probing with frozen backbone and full fine-tuning, under two initialization schemes: (i) pretraining on FetalUS-188K and (ii) initialization from natural-image DINOv3 weights. Results: Models pretrained on fetal ultrasound data consistently outperformed those initialized on natural images, with weighted F1-score improvements of up to 20 percent. Domain-adaptive pretraining enabled the network to preserve subtle echogenic and structural cues crucial for distinguishing intermediate planes such as TV. Conclusion: Results demonstrate that generic foundation models fail to generalize under low inter-class variability, whereas domain-specific pretraining is essential to achieve robust and clinically reliable representations in fetal brain ultrasound imaging.
☆ Image-based ground distance detection for crop-residue-covered soil
Conservation agriculture features a soil surface covered with crop residues, which brings benefits of improving soil health and saving water. However, one significant challenge in conservation agriculture lies in precisely controlling the seeding depth on the soil covered with crop residues. This is constrained by the lack of ground distance information, since current distance measurement techniques, like laser, ultrasonic, or mechanical displacement sensors, are incapable of differentiating whether the distance information comes from the residue or the soil. This paper presents an image-based method to get the ground distance information for the crop-residues-covered soil. This method is performed with 3D camera and RGB camera, obtaining depth image and color image at the same time. The color image is used to distinguish the different areas of residues and soil and finally generates a mask image. The mask image is applied to the depth image so that only the soil area depth information can be used to calculate the ground distance, and residue areas can be recognized and excluded from ground distance detection. Experimentation shows that this distance measurement method is feasible for real-time implementation, and the measurement error is within plus or minus 3mm. It can be applied in conservation agriculture machinery for precision depth seeding, as well as other depth-control-demanding applications like transplant or tillage.
comment: under review at Computers and Electronics in Agriculture
☆ OmniTrack++: Omnidirectional Multi-Object Tracking by Learning Large-FoV Trajectory Feedback CVPR 2025
This paper investigates Multi-Object Tracking (MOT) in panoramic imagery, which introduces unique challenges including a 360{\deg} Field of View (FoV), resolution dilution, and severe view-dependent distortions. Conventional MOT methods designed for narrow-FoV pinhole cameras generalize unsatisfactorily under these conditions. To address panoramic distortion, large search space, and identity ambiguity under a 360{\deg} FoV, OmniTrack++ adopts a feedback-driven framework that progressively refines perception with trajectory cues. A DynamicSSM block first stabilizes panoramic features, implicitly alleviating geometric distortion. On top of normalized representations, FlexiTrack Instances use trajectory-informed feedback for flexible localization and reliable short-term association. To ensure long-term robustness, an ExpertTrack Memory consolidates appearance cues via a Mixture-of-Experts design, enabling recovery from fragmented tracks and reducing identity drift. Finally, a Tracklet Management module adaptively switches between end-to-end and tracking-by-detection modes according to scene dynamics, offering a balanced and scalable solution for panoramic MOT. To support rigorous evaluation, we establish the EmboTrack benchmark, a comprehensive dataset for panoramic MOT that includes QuadTrack, captured with a quadruped robot, and BipTrack, collected with a bipedal wheel-legged robot. Together, these datasets span wide-angle environments and diverse motion patterns, providing a challenging testbed for real-world panoramic perception. Extensive experiments on JRDB and EmboTrack demonstrate that OmniTrack++ achieves state-of-the-art performance, yielding substantial HOTA improvements of +25.5% on JRDB and +43.07% on QuadTrack over the original OmniTrack. Datasets and code will be made publicly available at https://github.com/xifen523/OmniTrack.
comment: Extended version of CVPR 2025 paper arXiv:2503.04565. Datasets and code will be made publicly available at https://github.com/xifen523/OmniTrack
☆ Investigating Label Bias and Representational Sources of Age-Related Disparities in Medical Segmentation
Algorithmic bias in medical imaging can perpetuate health disparities, yet its causes remain poorly understood in segmentation tasks. While fairness has been extensively studied in classification, segmentation remains underexplored despite its clinical importance. In breast cancer segmentation, models exhibit significant performance disparities against younger patients, commonly attributed to physiological differences in breast density. We audit the MAMA-MIA dataset, establishing a quantitative baseline of age-related bias in its automated labels, and reveal a critical Biased Ruler effect where systematically flawed labels for validation misrepresent a model's actual bias. However, whether this bias originates from lower-quality annotations (label bias) or from fundamentally more challenging image characteristics remains unclear. Through controlled experiments, we systematically refute hypotheses that the bias stems from label quality sensitivity or quantitative case difficulty imbalance. Balancing training data by difficulty fails to mitigate the disparity, revealing that younger patient cases are intrinsically harder to learn. We provide direct evidence that systemic bias is learned and amplified when training on biased, machine-generated labels, a critical finding for automated annotation pipelines. This work introduces a systematic framework for diagnosing algorithmic bias in medical segmentation and demonstrates that achieving fairness requires addressing qualitative distributional differences rather than merely balancing case counts.
comment: Submitted to ISBI 2026
☆ Towards Reliable Pediatric Brain Tumor Segmentation: Task-Specific nnU-Net Enhancements
Accurate segmentation of pediatric brain tumors in multi-parametric magnetic resonance imaging (mpMRI) is critical for diagnosis, treatment planning, and monitoring, yet faces unique challenges due to limited data, high anatomical variability, and heterogeneous imaging across institutions. In this work, we present an advanced nnU-Net framework tailored for BraTS 2025 Task-6 (PED), the largest public dataset of pre-treatment pediatric high-grade gliomas. Our contributions include: (1) a widened residual encoder with squeeze-and-excitation (SE) attention; (2) 3D depthwise separable convolutions; (3) a specificity-driven regularization term; and (4) small-scale Gaussian weight initialization. We further refine predictions with two postprocessing steps. Our models achieved first place on the Task-6 validation leaderboard, attaining lesion-wise Dice scores of 0.759 (CC), 0.967 (ED), 0.826 (ET), 0.910 (NET), 0.928 (TC) and 0.928 (WT).
♻ ☆ Autoadaptive Medical Segment Anything Model
Medical image segmentation is a key task in the imaging workflow, influencing many image-based decisions. Traditional, fully-supervised segmentation models rely on large amounts of labeled training data, typically obtained through manual annotation, which can be an expensive, time-consuming, and error-prone process. This signals a need for accurate, automatic, and annotation-efficient methods of training these models. We propose ADA-SAM (automated, domain-specific, and adaptive segment anything model), a novel multitask learning framework for medical image segmentation that leverages class activation maps from an auxiliary classifier to guide the predictions of the semi-supervised segmentation branch, which is based on the Segment Anything (SAM) framework. Additionally, our ADA-SAM model employs a novel gradient feedback mechanism to create a learnable connection between the segmentation and classification branches by using the segmentation gradients to guide and improve the classification predictions. We validate ADA-SAM on real-world clinical data collected during rehabilitation trials, and demonstrate that our proposed method outperforms both fully-supervised and semi-supervised baselines by double digits in limited label settings. Our code is available at: https://github.com/tbwa233/ADA-SAM.
comment: 11 pages, 2 figures, 3 tables
♻ ☆ BEN: Using Confidence-Guided Matting for Dichotomous Image Segmentation
Current approaches to dichotomous image segmentation (DIS) treat image matting and object segmentation as fundamentally different tasks. As improvements in image segmentation become increasingly challenging to achieve, combining image matting and grayscale segmentation techniques offers promising new directions for architectural innovation. Inspired by the possibility of aligning these two model tasks, we propose a new architectural approach for DIS called Confidence-Guided Matting (CGM). We created the first CGM model called Background Erase Network (BEN). BEN consists of two components: BEN Base for initial segmentation and BEN Refiner for confidence-based refinement. Our approach achieves substantial improvements over current state-of-the-art methods on the DIS5K validation dataset, demonstrating that matting-based refinement can significantly enhance segmentation quality. This work introduces a new paradigm for integrating matting and segmentation techniques, improving fine-grained object boundary prediction in computer vision.
comment: 6 pages, 2 figures, 3 tables, and 1 algorithms
♻ ☆ Joint Lossless Compression and Steganography for Medical Images via Large Language Models
Recently, large language models (LLMs) have driven promising progress in lossless image compression. However, directly adopting existing paradigms for medical images suffers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compression process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this insight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, providing global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we innovatively propose a segmented message steganography algorithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the superiority of our proposed method in terms of compression ratios, efficiency, and security. The source code will be made publicly available.
♻ ☆ Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification NeurIPS 2025
Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.
comment: Accepted to the Reliable ML from Unreliable Data workshop at NeurIPS 2025 (ReliableML@NeurIPS)
♻ ☆ FIPER: Factorized Features for Robust Image Super-Resolution and Compression NeurIPS 2025
In this work, we propose using a unified representation, termed Factorized Features, for low-level vision tasks, where we test on Single Image Super-Resolution (SISR) and \textbf{Image Compression}. Motivated by the shared principles between these tasks, they require recovering and preserving fine image details, whether by enhancing resolution for SISR or reconstructing compressed data for Image Compression. Unlike previous methods that mainly focus on network architecture, our proposed approach utilizes a basis-coefficient decomposition as well as an explicit formulation of frequencies to capture structural components and multi-scale visual features in images, which addresses the core challenges of both tasks. We replace the representation of prior models from simple feature maps with Factorized Features to validate the potential for broad generalizability. In addition, we further optimize the compression pipeline by leveraging the mergeable-basis property of our Factorized Features, which consolidates shared structures on multi-frame compression. Extensive experiments show that our unified representation delivers state-of-the-art performance, achieving an average relative improvement of 204.4% in PSNR over the baseline in Super-Resolution (SR) and 9.35% BD-rate reduction in Image Compression compared to the previous SOTA. Project page: https://jayisaking.github.io/FIPER/
comment: NeurIPS 2025. Project page: https://jayisaking.github.io/FIPER/
Computer Vision and Pattern Recognition 120
☆ LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar
Non-contact electrocardiogram (ECG) reconstruction from radar signals offers a promising approach for unobtrusive cardiac monitoring. We present LifWavNet, a lifting wavelet network based on a multi-resolution analysis and synthesis (MRAS) model for radar-to-ECG reconstruction. Unlike prior models that use fixed wavelet approaches, LifWavNet employs learnable lifting wavelets with lifting and inverse lifting units to adaptively capture radar signal features and synthesize physiologically meaningful ECG waveforms. To improve reconstruction fidelity, we introduce a multi-resolution short-time Fourier transform (STFT) loss, that enforces consistency with the ground-truth ECG in both temporal and spectral domains. Evaluations on two public datasets demonstrate that LifWavNet outperforms state-of-the-art methods in ECG reconstruction and downstream vital sign estimation (heart rate and heart rate variability). Furthermore, intermediate feature visualization highlights the interpretability of multi-resolution decomposition and synthesis in radar-to-ECG reconstruction. These results establish LifWavNet as a robust framework for radar-based non-contact ECG measurement.
☆ Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.
☆ PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
☆ Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models
Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels.Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
☆ Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes
Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.
comment: 12 pages,conference
☆ Deep learning denoising unlocks quantitative insights in operando materials microscopy
Operando microscopy provides direct insight into the dynamic chemical and physical processes that govern functional materials, yet measurement noise limits the effective resolution and undermines quantitative analysis. Here, we present a general framework for integrating unsupervised deep learning-based denoising into quantitative microscopy workflows across modalities and length scales. Using simulated data, we demonstrate that deep denoising preserves physical fidelity, introduces minimal bias, and reduces uncertainty in model learning with partial differential equation (PDE)-constrained optimization. Applied to experiments, denoising reveals nanoscale chemical and structural heterogeneity in scanning transmission X-ray microscopy (STXM) of lithium iron phosphate (LFP), enables automated particle segmentation and phase classification in optical microscopy of graphite electrodes, and reduces noise-induced variability by nearly 80% in neutron radiography to resolve heterogeneous lithium transport. Collectively, these results establish deep denoising as a powerful, modality-agnostic enhancement that advances quantitative operando imaging and extends the reach of previously noise-limited techniques.
☆ Imbalanced Classification through the Lens of Spurious Correlations
Class imbalance poses a fundamental challenge in machine learning, frequently leading to unreliable classification performance. While prior methods focus on data- or loss-reweighting schemes, we view imbalance as a data condition that amplifies Clever Hans (CH) effects by underspecification of minority classes. In a counterfactual explanations-based approach, we propose to leverage Explainable AI to jointly identify and eliminate CH effects emerging under imbalance. Our method achieves competitive classification performance on three datasets and demonstrates how CH effects emerge under imbalance, a perspective largely overlooked by existing approaches.
☆ Gaussian Combined Distance: A Generic Metric for Object Detection
In object detection, a well-defined similarity metric can significantly enhance model performance. Currently, the IoU-based similarity metric is the most commonly preferred choice for detectors. However, detectors using IoU as a similarity metric often perform poorly when detecting small objects because of their sensitivity to minor positional deviations. To address this issue, recent studies have proposed the Wasserstein Distance as an alternative to IoU for measuring the similarity of Gaussian-distributed bounding boxes. However, we have observed that the Wasserstein Distance lacks scale invariance, which negatively impacts the model's generalization capability. Additionally, when used as a loss function, its independent optimization of the center attributes leads to slow model convergence and unsatisfactory detection precision. To address these challenges, we introduce the Gaussian Combined Distance (GCD). Through analytical examination of GCD and its gradient, we demonstrate that GCD not only possesses scale invariance but also facilitates joint optimization, which enhances model localization performance. Extensive experiments on the AI-TOD-v2 dataset for tiny object detection show that GCD, as a bounding box regression loss function and label assignment metric, achieves state-of-the-art performance across various detectors. We further validated the generalizability of GCD on the MS-COCO-2017 and Visdrone-2019 datasets, where it outperforms the Wasserstein Distance across diverse scales of datasets. Code is available at https://github.com/MArKkwanGuan/mmdet-GCD.
comment: This paper is accepted by the GRSL in 2025
☆ NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception NeurIPS 2025
Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality's agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.
comment: 19 pages, Accepted by NeurIPS 2025
☆ VessShape: Few-shot 2D blood vessel segmentation by leveraging shape priors from synthetic images
Semantic segmentation of blood vessels is an important task in medical image analysis, but its progress is often hindered by the scarcity of large annotated datasets and the poor generalization of models across different imaging modalities. A key aspect is the tendency of Convolutional Neural Networks (CNNs) to learn texture-based features, which limits their performance when applied to new domains with different visual characteristics. We hypothesize that leveraging geometric priors of vessel shapes, such as their tubular and branching nature, can lead to more robust and data-efficient models. To investigate this, we introduce VessShape, a methodology for generating large-scale 2D synthetic datasets designed to instill a shape bias in segmentation models. VessShape images contain procedurally generated tubular geometries combined with a wide variety of foreground and background textures, encouraging models to learn shape cues rather than textures. We demonstrate that a model pre-trained on VessShape images achieves strong few-shot segmentation performance on two real-world datasets from different domains, requiring only four to ten samples for fine-tuning. Furthermore, the model exhibits notable zero-shot capabilities, effectively segmenting vessels in unseen domains without any target-specific training. Our results indicate that pre-training with a strong shape bias can be an effective strategy to overcome data scarcity and improve model generalization in blood vessel segmentation.
☆ Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation ICCV 2025
Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.
comment: 15 pages, 18 figures, GitHub link: https://github.com/google-deepmind/sketch_to_layout, accept at ICCV 2025 Workshop (HiGen)
☆ Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.
☆ Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.
comment: 20 pages, 10 figures
☆ Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
comment: preprint
☆ Who Made This? Fake Detection and Source Attribution with Diffusion Features
The rapid progress of generative diffusion models has enabled the creation of synthetic images that are increasingly difficult to distinguish from real ones, raising concerns about authenticity, copyright, and misinformation. Existing supervised detectors often struggle to generalize across unseen generators, requiring extensive labeled data and frequent retraining. We introduce FRIDA (Fake-image Recognition and source Identification via Diffusion-features Analysis), a lightweight framework that leverages internal activations from a pre-trained diffusion model for deepfake detection and source generator attribution. A k-nearest-neighbor classifier applied to diffusion features achieves state-of-the-art cross-generator performance without fine-tuning, while a compact neural model enables accurate source attribution. These results show that diffusion representations inherently encode generator-specific patterns, providing a simple and interpretable foundation for synthetic image forensics.
☆ ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Neural networks have changed the way machines interpret the world. At their core, they learn by following gradients, adjusting their parameters step by step until they identify the most discriminant patterns in the data. This process gives them their strength, yet it also opens the door to a hidden flaw. The very gradients that help a model learn can also be used to produce small, imperceptible tweaks that cause the model to completely alter its decision. Such tweaks are called adversarial attacks. These attacks exploit this vulnerability by adding tiny, imperceptible changes to images that, while leaving them identical to the human eye, cause the model to make wrong predictions. In this work, we propose Adversarially-trained Contrastive Hard-mining for Optimized Robustness (ANCHOR), a framework that leverages the power of supervised contrastive learning with explicit hard positive mining to enable the model to learn representations for images such that the embeddings for the images, their augmentations, and their perturbed versions cluster together in the embedding space along with those for other images of the same class while being separated from images of other classes. This alignment helps the model focus on stable, meaningful patterns rather than fragile gradient cues. On CIFAR-10, our approach achieves impressive results for both clean and robust accuracy under PGD-20 (epsilon = 0.031), outperforming standard adversarial training methods. Our results indicate that combining adversarial guidance with hard-mined contrastive supervision helps models learn more structured and robust representations, narrowing the gap between accuracy and robustness.
comment: 11 pages, 1 figure
☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
☆ Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
☆ MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series
Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.
☆ Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds
The protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method's ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.
☆ Context-Gated Cross-Modal Perception with Visual Mamba for PET-CT Lung Tumor Segmentation
Accurate lung tumor segmentation is vital for improving diagnosis and treatment planning, and effectively combining anatomical and functional information from PET and CT remains a major challenge. In this study, we propose vMambaX, a lightweight multimodal framework integrating PET and CT scan images through a Context-Gated Cross-Modal Perception Module (CGM). Built on the Visual Mamba architecture, vMambaX adaptively enhances inter-modality feature interaction, emphasizing informative regions while suppressing noise. Evaluated on the PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity. These results highlight the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and demonstrate the potential of vMambaX as an efficient and scalable framework for advanced lung cancer analysis. The code is available at https://github.com/arco-group/vMambaX.
☆ NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding NeurIPS 2025
Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.
comment: Accepted to NeurIPS 2025. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS
☆ Referee: Reference-aware Audiovisual Deepfake Detection
Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.
comment: In Progress
☆ A Multi-tiered Human-in-the-loop Approach for Interactive School Mapping Using Earth Observation and Machine Learning
This paper presents a multi-tiered human-in-the-loop framework for interactive school mapping designed to improve the accuracy and completeness of educational facility records, particularly in developing regions where such data may be scarce and infrequently updated. The first tier involves a machine learning based analysis of population density, land cover, and existing infrastructure compared with known school locations. The first tier identifies potential gaps and "mislabelled" schools. In subsequent tiers, medium-resolution satellite imagery (Sentinel-2) is investigated to pinpoint regions with a high likelihood of school presence, followed by the application of very high-resolution (VHR) imagery and deep learning models to generate detailed candidate locations for schools within these prioritised areas. The medium-resolution approach was later removed due to insignificant improvements. The medium and VHR resolution models build upon global pre-trained steps to improve generalisation. A key component of the proposed approach is an interactive interface to allow human operators to iteratively review, validate, and refine the mapping results. Preliminary evaluations indicate that the multi-tiered strategy provides a scalable and cost-effective solution for educational infrastructure mapping to support planning and resource allocation.
☆ From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of "writing-compiling-reviewing" and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.
☆ CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging MICCAI 2025
Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.
comment: Preprint (submitted manuscript). Accepted at the MICCAI 2025 MIRASOL Workshop; to appear in the Springer proceedings volume. This is the pre-review version (not the Version of Record). DOI will be added after publication. [Optional: 8 pages, 4 figures, 4 tables.]
☆ DeblurSDI: Blind Image Deblurring Using Self-diffusion
Blind image deconvolution is a challenging ill-posed inverse problem, where both the latent sharp image and the blur kernel are unknown. Traditional methods often rely on handcrafted priors, while modern deep learning approaches typically require extensive pre-training on large external datasets, limiting their adaptability to real-world scenarios. In this work, we propose DeblurSDI, a zero-shot, self-supervised framework based on self-diffusion (SDI) that requires no prior training. DeblurSDI formulates blind deconvolution as an iterative reverse self-diffusion process that starts from pure noise and progressively refines the solution. At each step, two randomly-initialized neural networks are optimized continuously to refine the sharp image and the blur kernel. The optimization is guided by an objective function combining data consistency with a sparsity-promoting L1-norm for the kernel. A key innovation is our noise scheduling mechanism, which stabilizes the optimization and provides remarkable robustness to variations in blur kernel size. These allow DeblurSDI to dynamically learn an instance-specific prior tailored to the input image. Extensive experiments demonstrate that DeblurSDI consistently achieves superior performance, recovering sharp images and accurate kernels even in highly degraded scenarios.
☆ Mitigating Semantic Collapse in Partially Relevant Video Retrieval NeurIPS 2025
Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
comment: Accpeted to NeurIPS 2025. Code is available at https://github.com/admins97/MSC_PRVR
☆ Who Does Your Algorithm Fail? Investigating Age and Ethnic Bias in the MAMA-MIA Dataset NeurIPS
Deep learning models aim to improve diagnostic workflows, but fairness evaluation remains underexplored beyond classification, e.g., in image segmentation. Unaddressed segmentation bias can lead to disparities in the quality of care for certain populations, potentially compounded across clinical decision points and amplified through iterative model development. Here, we audit the fairness of the automated segmentation labels provided in the breast cancer tumor segmentation dataset MAMA-MIA. We evaluate automated segmentation quality across age, ethnicity, and data source. Our analysis reveals an intrinsic age-related bias against younger patients that continues to persist even after controlling for confounding factors, such as data source. We hypothesize that this bias may be linked to physiological factors, a known challenge for both radiologists and automated systems. Finally, we show how aggregating data from multiple data sources influences site-specific ethnic biases, underscoring the necessity of investigating data at a granular level.
comment: Medical Imaging Meets EurIPS (NeurIPS-endorsed workshop) - MedEurIPS
☆ A Hybrid Deep Learning and Forensic Approach for Robust Deepfake Detection ACSA
The rapid evolution of generative adversarial networks (GANs) and diffusion models has made synthetic media increasingly realistic, raising societal concerns around misinformation, identity fraud, and digital trust. Existing deepfake detection methods either rely on deep learning, which suffers from poor generalization and vulnerability to distortions, or forensic analysis, which is interpretable but limited against new manipulation techniques. This study proposes a hybrid framework that fuses forensic features, including noise residuals, JPEG compression traces, and frequency-domain descriptors, with deep learning representations from convolutional neural networks (CNNs) and vision transformers (ViTs). Evaluated on benchmark datasets (FaceForensics++, Celeb-DF v2, DFDC), the proposed model consistently outperformed single-method baselines and demonstrated superior performance compared to existing state-of-the-art hybrid approaches, achieving F1-scores of 0.96, 0.82, and 0.77, respectively. Robustness tests demonstrated stable performance under compression (F1 = 0.87 at QF = 50), adversarial perturbations (AUC = 0.84), and unseen manipulations (F1 = 0.79). Importantly, explainability analysis showed that Grad-CAM and forensic heatmaps overlapped with ground-truth manipulated regions in 82 percent of cases, enhancing transparency and user trust. These findings confirm that hybrid approaches provide a balanced solution, combining the adaptability of deep models with the interpretability of forensic cues, to develop resilient and trustworthy deepfake detection systems.
comment: 11 pages, 13 figures, 9 tables, Published with International Journal of Advanced Computer Science and Applications (IJACSA)
☆ Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
☆ Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V
We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.
comment: video generation, image-to-video, dif- fusion transformer, LoRA, fine-tuning, cinematic scene synthesis, multi-GPU inference, fully sharded data parallelism, computational efficiency
☆ FPS: Feedforward-based Parameter Selection For Efficient Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations. Addition-based methods, such as Adapters [1], introduce inference latency and engineering complexity, while selection-based methods like Gradient-based Parameter Selection (GPS) [2] require a full backward pass, which results in the same peak memory usage as full fine-tuning. To address this dilemma, we propose Feedforward-based Parameter Selection (FPS), a gradient-free method that identifies an optimal parameter subset in a single forward pass. FPS ranks parameters by the product of their magnitudes and corresponding input activations, leveraging both pre-trained knowledge and downstream data. Evaluated on $24$ visual tasks from FGVC and VTAB-1k, FPS achieves performance comparable to state-of-the-art methods while reducing peak memory usage by nearly $9 \times$ and accelerating parameter selection by about $2 \times$, offering a genuinely memory-efficient and practical solution for fine-tuning large-scale pre-trained models.
☆ RzenEmbed: Towards Comprehensive Multimodal Retrieval
The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model's discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.
☆ Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing
Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.
☆ MeisenMeister: A Simple Two Stage Pipeline for Breast Cancer Classification on MRI MICCAI 2025
The ODELIA Breast MRI Challenge 2025 addresses a critical issue in breast cancer screening: improving early detection through more efficient and accurate interpretation of breast MRI scans. Even though methods for general-purpose whole-body lesion segmentation as well as multi-time-point analysis exist, breast cancer detection remains highly challenging, largely due to the limited availability of high-quality segmentation labels. Therefore, developing robust classification-based approaches is crucial for the future of early breast cancer detection, particularly in applications such as large-scale screening. In this write-up, we provide a comprehensive overview of our approach to the challenge. We begin by detailing the underlying concept and foundational assumptions that guided our work. We then describe the iterative development process, highlighting the key stages of experimentation, evaluation, and refinement that shaped the evolution of our solution. Finally, we present the reasoning and evidence that informed the design choices behind our final submission, with a focus on performance, robustness, and clinical relevance. We release our full implementation publicly at https://github.com/MIC-DKFZ/MeisenMeister
comment: Winning Solution of the MICCAI 2025 ODELIA Breast MRI Classification Challenge
☆ Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis
We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.
☆ SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction
Surgical reconstruction of dynamic tissues from endoscopic videos is a crucial technology in robot-assisted surgery. The development of Neural Radiance Fields (NeRFs) has greatly advanced deformable tissue reconstruction, achieving high-quality results from video and image sequences. However, reconstructing deformable endoscopic scenes remains challenging due to aliasing and artifacts caused by tissue movement, which can significantly degrade visualization quality. The introduction of 3D Gaussian Splatting (3DGS) has improved reconstruction efficiency by enabling a faster rendering pipeline. Nevertheless, existing 3DGS methods often prioritize rendering speed while neglecting these critical issues. To address these challenges, we propose SAGS, a self-adaptive alias-free Gaussian splatting framework. We introduce an attention-driven, dynamically weighted 4D deformation decoder, leveraging 3D smoothing filters and 2D Mip filters to mitigate artifacts in deformable tissue reconstruction and better capture the fine details of tissue movement. Experimental results on two public benchmarks, EndoNeRF and SCARED, demonstrate that our method achieves superior performance in all metrics of PSNR, SSIM, and LPIPS compared to the state of the art while also delivering better visualization quality.
☆ Overcoming Prompts Pool Confusion via Parameterized Prompt for Incremental Object Detection
Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD's effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.
☆ CASR-Net: An Image Processing-focused Deep Learning-based Coronary Artery Segmentation and Refinement Network for X-ray Coronary Angiogram
Early detection of coronary artery disease (CAD) is critical for reducing mortality and improving patient treatment planning. While angiographic image analysis from X-rays is a common and cost-effective method for identifying cardiac abnormalities, including stenotic coronary arteries, poor image quality can significantly impede clinical diagnosis. We present the Coronary Artery Segmentation and Refinement Network (CASR-Net), a three-stage pipeline comprising image preprocessing, segmentation, and refinement. A novel multichannel preprocessing strategy combining CLAHE and an improved Ben Graham method provides incremental gains, increasing Dice Score Coefficient (DSC) by 0.31-0.89% and Intersection over Union (IoU) by 0.40-1.16% compared with using the techniques individually. The core innovation is a segmentation network built on a UNet with a DenseNet121 encoder and a Self-organized Operational Neural Network (Self-ONN) based decoder, which preserves the continuity of narrow and stenotic vessel branches. A final contour refinement module further suppresses false positives. Evaluated with 5-fold cross-validation on a combination of two public datasets that contain both healthy and stenotic arteries, CASR-Net outperformed several state-of-the-art models, achieving an IoU of 61.43%, a DSC of 76.10%, and clDice of 79.36%. These results highlight a robust approach to automated coronary artery segmentation, offering a valuable tool to support clinicians in diagnosis and treatment planning.
☆ A fragile zero-watermarking method based on dual quaternion matrix decomposition
Medical images play a crucial role in assisting diagnosis, remote consultation, and academic research. However, during the transmission and sharing process, they face serious risks of copyright ownership and content tampering. Therefore, protecting medical images is of great importance. As an effective means of image copyright protection, zero-watermarking technology focuses on constructing watermarks without modifying the original carrier by extracting its stable features, which provides an ideal approach for protecting medical images. This paper aims to propose a fragile zero-watermarking model based on dual quaternion matrix decomposition, which utilizes the operational relationship between the standard part and the dual part of dual quaternions to correlate the original carrier image with the watermark image, and generates zero-watermarking information based on the characteristics of dual quaternion matrix decomposition, ultimately achieving copyright protection and content tampering detection for medical images.
comment: 18 pages, 6 figures, 3 tables
☆ Versatile and Efficient Medical Image Super-Resolution Via Frequency-Gated Mamba
Medical image super-resolution (SR) is essential for enhancing diagnostic accuracy while reducing acquisition cost and scanning time. However, modeling both long-range anatomical structures and fine-grained frequency details with low computational overhead remains challenging. We propose FGMamba, a novel frequency-aware gated state-space model that unifies global dependency modeling and fine-detail enhancement into a lightweight architecture. Our method introduces two key innovations: a Gated Attention-enhanced State-Space Module (GASM) that integrates efficient state-space modeling with dual-branch spatial and channel attention, and a Pyramid Frequency Fusion Module (PFFM) that captures high-frequency details across multiple resolutions via FFT-guided fusion. Extensive evaluations across five medical imaging modalities (Ultrasound, OCT, MRI, CT, and Endoscopic) demonstrate that FGMamba achieves superior PSNR/SSIM while maintaining a compact parameter footprint ($<$0.75M), outperforming CNN-based and Transformer-based SOTAs. Our results validate the effectiveness of frequency-aware state-space modeling for scalable and accurate medical image enhancement.
☆ Rethinking Robust Adversarial Concept Erasure in Diffusion Models
Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.
☆ FOCUS: Efficient Keyframe Selection for Long Video Understanding
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.
☆ HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration
Autonomous Graphical User Interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions. We first systematically evaluate probabilistic and verbalized confidence in general and GUI-specific models, revealing a misalignment between confidence and actual accuracy, which is particularly critical in dynamic GUI automation tasks, where single errors can cause task failure. To address this, we propose HyperClick, a novel framework that enhances reliable GUI grounding through uncertainty calibration. HyperClick introduces a dual reward mechanism, combining a binary reward for correct actions with a truncated Gaussian-based spatial confidence modeling, calibrated using the Brier score. This approach jointly optimizes grounding accuracy and confidence reliability, fostering introspective self-criticism. Extensive experiments on seven challenge benchmarks show that HyperClick achieves state-of-the-art performance while providing well-calibrated confidence. By enabling explicit confidence calibration and introspective self-criticism, HyperClick reduces overconfidence and supports more reliable GUI automation.
☆ T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis
In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.
comment: Main: 11 pages, Supplementary: 9 pages 10 tables, 10 figures
☆ RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents
Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02\% in R@1 on average and increases question answering accuracy by 3.56\% while using only 71.42\% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.
☆ Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.
☆ C-LEAD: Contrastive Learning for Enhanced Adversarial Defense
Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model's parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model's robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning.
☆ Trans-defense: Transformer-based Denoiser for Adversarial Defense with Spatial-Frequency Domain Representation
In recent times, deep neural networks (DNNs) have been successfully adopted for various applications. Despite their notable achievements, it has become evident that DNNs are vulnerable to sophisticated adversarial attacks, restricting their applications in security-critical systems. In this paper, we present two-phase training methods to tackle the attack: first, training the denoising network, and second, the deep classifier model. We propose a novel denoising strategy that integrates both spatial and frequency domain approaches to defend against adversarial attacks on images. Our analysis reveals that high-frequency components of attacked images are more severely corrupted compared to their lower-frequency counterparts. To address this, we leverage Discrete Wavelet Transform (DWT) for frequency analysis and develop a denoising network that combines spatial image features with wavelets through a transformer layer. Next, we retrain the classifier using the denoised images, which enhances the classifier's robustness against adversarial attacks. Experimental results across the MNIST, CIFAR-10, and Fashion-MNIST datasets reveal that the proposed method remarkably elevates classification accuracy, substantially exceeding the performance by utilizing a denoising network and adversarial training approaches. The code is available at https://github.com/Mayank94/Trans-Defense.
☆ Fusion of Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis
Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathological foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level feature representations from WSIs. However, current pathological FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the extracted features from different FMs in the downstream tasks. To fully explore the advantage of multiple FMs effectively, in this work, we propose a novel framework for the fusion of heterogeneous pathological FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs' embeddings. (ii) To effectively fuse the heterogeneous patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the heterogeneous slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments conducted on lung cancer, bladder cancer, and colorectal cancer datasets from The Cancer Genome Atlas (TCGA) have demonstrated that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on these public datasets.
comment: 22 pages, 9 figures
☆ Object-IR: Leveraging Object Consistency and Mesh Deformation for Self-Supervised Image Retargeting
Eliminating geometric distortion in semantically important regions remains an intractable challenge in image retargeting. This paper presents Object-IR, a self-supervised architecture that reformulates image retargeting as a learning-based mesh warping optimization problem, where the mesh deformation is guided by object appearance consistency and geometric-preserving constraints. Given an input image and a target aspect ratio, we initialize a uniform rigid mesh at the output resolution and use a convolutional neural network to predict the motion of each mesh grid and obtain the deformed mesh. The retargeted result is generated by warping the input image according to the rigid mesh in the input image and the deformed mesh in the output resolution. To mitigate geometric distortion, we design a comprehensive objective function incorporating a) object-consistent loss to ensure that the important semantic objects retain their appearance, b) geometric-preserving loss to constrain simple scale transform of the important meshes, and c) boundary loss to enforce a clean rectangular output. Notably, our self-supervised paradigm eliminates the need for manually annotated retargeting datasets by deriving supervision directly from the input's geometric and semantic properties. Extensive evaluations on the RetargetMe benchmark demonstrate that our Object-IR achieves state-of-the-art performance, outperforming existing methods in quantitative metrics and subjective visual quality assessments. The framework efficiently processes arbitrary input resolutions (average inference time: 0.009s for 1024x683 resolution) while maintaining real-time performance on consumer-grade GPUs. The source code will soon be available at https://github.com/tlliao/Object-IR.
comment: Publish in Pattern Recognition
☆ MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.
comment: Project Page: https://g-1nonly.github.io/MoRE_Website/, Code: https://github.com/alibaba/Taobao3D
☆ Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery
Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring. This paper presents a detailed analysis of YOLOv11, the recent advancement in the YOLO series of deep learning models, focusing on its application to joint building extraction and discrete height classification from satellite imagery. YOLOv11 builds on the strengths of earlier YOLO models by introducing a more efficient architecture that better combines features at different scales, improves object localization accuracy, and enhances performance in complex urban scenes. Using the DFC2023 Track 2 dataset -- which includes over 125,000 annotated buildings across 12 cities -- we evaluate YOLOv11's performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLOv11 achieves strong instance segmentation performance with 60.4\% mAP@50 and 38.3\% mAP@50--95 while maintaining robust classification accuracy across five predefined height tiers. The model excels in handling occlusions, complex building shapes, and class imbalance, particularly for rare high-rise structures. Comparative analysis confirms that YOLOv11 outperforms earlier multitask frameworks in both detection accuracy and inference speed, making it well-suited for real-time, large-scale urban mapping. This research highlights YOLOv11's potential to advance semantic urban reconstruction through streamlined categorical height modeling, offering actionable insights for future developments in remote sensing and geospatial intelligence.
☆ Soft Task-Aware Routing of Experts for Equivariant Representation Learning NeurIPS 2025
Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce Soft Task-Aware Routing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at https://github.com/YonseiML/star.
comment: NeurIPS 2025
☆ SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping
Hyperspectral imaging (HSI) is a vital tool for fine-grained land-use and land-cover (LULC) mapping. However, the inherent heterogeneity of HSI data has long posed a major barrier to developing generalized models via joint training. Although HSI foundation models have shown promise for different downstream tasks, the existing approaches typically overlook the critical guiding role of sensor meta-attributes, and struggle with multi-sensor training, limiting their transferability. To address these challenges, we propose SpecAware, which is a novel hyperspectral spectral-content aware foundation model for unifying multi-sensor learning for HSI mapping. We also constructed the Hyper-400K dataset to facilitate this research, which is a new large-scale, high-quality benchmark dataset with over 400k image patches from diverse airborne AVIRIS sensors. The core of SpecAware is a two-step hypernetwork-driven encoding process for HSI data. Firstly, we designed a meta-content aware module to generate a unique conditional input for each HSI patch, tailored to each spectral band of every sample by fusing the sensor meta-attributes and its own image content. Secondly, we designed the HyperEmbedding module, where a sample-conditioned hypernetwork dynamically generates a pair of matrix factors for channel-wise encoding, consisting of adaptive spatial pattern extraction and latent semantic feature re-projection. Thus, SpecAware gains the ability to perceive and interpret spatial-spectral features across diverse scenes and sensors. This, in turn, allows SpecAware to adaptively process a variable number of spectral channels, establishing a unified framework for joint pre-training. Extensive experiments on six datasets demonstrate that SpecAware can learn superior feature representations, excelling in land-cover semantic segmentation classification, change detection, and scene classification.
☆ Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness
We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.
☆ GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation NeurIPS 2025
While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.
comment: Published in NeurIPS 2025
☆ Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks
Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.
☆ Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.
☆ Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications
Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations--a phenomenon we term "hallucination" in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.
☆ Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
comment: 5 pages, 3 figures
☆ SilhouetteTell: Practical Video Identification Leveraging Blurred Recordings of Video Subtitles
Video identification attacks pose a significant privacy threat that can reveal videos that victims watch, which may disclose their hobbies, religious beliefs, political leanings, sexual orientation, and health status. Also, video watching history can be used for user profiling or advertising and may result in cyberbullying, discrimination, or blackmail. Existing extensive video inference techniques usually depend on analyzing network traffic generated by streaming online videos. In this work, we observe that the content of a subtitle determines its silhouette displayed on the screen, and identifying each subtitle silhouette also derives the temporal difference between two consecutive subtitles. We then propose SilhouetteTell, a novel video identification attack that combines the spatial and time domain information into a spatiotemporal feature of subtitle silhouettes. SilhouetteTell explores the spatiotemporal correlation between recorded subtitle silhouettes of a video and its subtitle file. It can infer both online and offline videos. Comprehensive experiments on off-the-shelf smartphones confirm the high efficacy of SilhouetteTell for inferring video titles and clips under various settings, including from a distance of up to 40 meters.
comment: 16 pages, 29 figures. Accepted at 26th Privacy Enhancing Technologies Symposium (PETS 2026)
☆ H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models
Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache.
☆ DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model
Recently, diffusion models have shown their impressive ability in visual generation tasks. Besides static images, more and more research attentions have been drawn to the generation of realistic videos. The video generation not only has a higher requirement for the quality, but also brings a challenge in ensuring the video continuity. Among all the video generation tasks, human-involved contents, such as human dancing, are even more difficult to generate due to the high degrees of freedom associated with human motions. In this paper, we propose a novel framework, named as DANCER (Dance ANimation via Condition Enhancement and Rendering with Diffusion Model), for realistic single-person dance synthesis based on the most recent stable video diffusion model. As the video generation is generally guided by a reference image and a video sequence, we introduce two important modules into our framework to fully benefit from the two inputs. More specifically, we design an Appearance Enhancement Module (AEM) to focus more on the details of the reference image during the generation, and extend the motion guidance through a Pose Rendering Module (PRM) to capture pose conditions from extra domains. To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training. The effectiveness of the proposed model has been evaluated through extensive experiments on real-world datasets, where the performance of our model is superior to that of the state-of-the-art methods. All the data and codes will be released upon acceptance.
☆ M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar
Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.
comment: 16 pages, 9 figures
☆ Generating Accurate and Detailed Captions for High-Resolution Images
Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.
comment: Work conducted in 2024; released for archival purposes
☆ How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring
The Banff Classification provides the global standard for evaluating renal transplant biopsies, yet its semi-quantitative nature, complex criteria, and inter-observer variability present significant challenges for computational replication. In this study, we explore the feasibility of approximating Banff lesion scores using existing deep learning models through a modular, rule-based framework. We decompose each Banff indicator - such as glomerulitis (g), peritubular capillaritis (ptc), and intimal arteritis (v) - into its constituent structural and inflammatory components, and assess whether current segmentation and detection tools can support their computation. Model outputs are mapped to Banff scores using heuristic rules aligned with expert guidelines, and evaluated against expert-annotated ground truths. Our findings highlight both partial successes and critical failure modes, including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations often undermine interpretability. These results reveal the limitations of current AI pipelines in replicating computational expert-level grading, and emphasize the importance of modular evaluation and computational Banff grading standard in guiding future model development for transplant pathology.
☆ AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification
Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at https://github.com/tangyuanhao-qhu/AFM-Net.
☆ HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition
Three-dimensional scene generation holds significant potential in gaming, film, and virtual reality. However, most existing methods adopt a single-step generation process, making it difficult to balance scene complexity with minimal user input. Inspired by the human cognitive process in scene modeling, which progresses from global to local, focuses on key elements, and completes the scene through semantic association, we propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition. HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest while the model completes peripheral areas automatically. To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph (PHiSSG), which dynamically organizes spatial relationships and semantic dependencies across the evolving scene structure. PHiSSG ensures spatial and geometric consistency throughout the generation process by maintaining a one-to-one mapping between graph nodes and generated objects and supporting recursive layout optimization. Experiments demonstrate that HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference, offering a controllable and extensible paradigm for efficient 3D scene construction.
☆ Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features
Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper understanding of cross-view relationships while suppressing the edge noise unrelated to the query object. Furthermore, we integrate a Multi-head Spatial Attention Module (MHSAM), which employs convolutional kernels of various sizes to extract multi-scale spatial features from the feature maps containing implicit correspondences, further enhancing the feature representation of the query object. Additionally, given the scarcity of datasets for cross-view object geo-localization, we created a new dataset called G2D for the "Ground-to-Drone" localization task, enriching existing datasets and filling the gap in "Ground-to-Drone" localization task. Extensive experiments on the CVOGL and G2D datasets demonstrate that our proposed method achieves high localization accuracy, surpassing the current state-of-the-art.
☆ E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources
Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.
☆ WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond
3D Gaussian splatting (3DGS) and its subsequent variants have led to remarkable progress in simultaneous localization and mapping (SLAM). While most recent 3DGS-based SLAM works focus on small-scale indoor scenes, developing 3DGS-based SLAM methods for large-scale forest scenes holds great potential for many real-world applications, especially for wildfire emergency response and forest management. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, and collecting such a dataset over real-world scenes is costly and technically infeasible. To this end, we have built a large-scale, comprehensive, and high-quality synthetic dataset for SLAM in wildfire and forest environments. Leveraging the Unreal Engine 5 Electric Dreams Environment Sample Project, we developed a pipeline to easily collect aerial and ground views, including ground-truth camera poses and a range of additional data modalities from unmanned aerial vehicle. Our pipeline also provides flexible controls on environmental factors such as light, weather, and types and conditions of wildfire, supporting the need for various tasks covering forest mapping, wildfire emergency response, and beyond. The resulting pilot dataset, WildfireX-SLAM, contains 5.5k low-altitude RGB-D aerial images from a large-scale forest map with a total size of 16 km2. On top of WildfireX-SLAM, a thorough benchmark is also conducted, which not only reveals the unique challenges of 3DGS-based SLAM in the forest but also highlights potential improvements for future works. The dataset and code will be publicly available. Project page: https://zhicongsun.github.io/wildfirexslam.
comment: This paper has been accepted by MMM 2026
☆ ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding NeurIPS 2025
Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: https://github.com/xmed-lab/ZEBRA.
comment: Accepted by NeurIPS 2025
☆ Hierarchical Transformers for Unsupervised 3D Shape Abstraction
We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.
☆ Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models
Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels. Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
♻ ☆ Intelligent Software System for Low-Cost, Brightfield Segmentation: Algorithmic Implementation for Cytometric Auto-Analysis
Bright-field microscopy, a cost-effective solution for live-cell culture, is often the only resource available, along with standard CPUs, for many low-budget labs. The inherent challenges of bright-field images -- their noisiness, low contrast, and dynamic morphology -- coupled with a lack of GPU resources and complex software interfaces, hinder the desired research output. This article presents a novel microscopy image analysis framework designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular architecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical applications -- particularly in cell transplantation for personalised medicine and muscle regeneration therapies. The access to the application is available for reproducibility
♻ ☆ Conformal Object Detection by Sequential Risk Control
Recent advances in object detectors have led to their adoption for industrial uses. However, their deployment in safety-critical applications is hindered by the inherent lack of reliability of neural networks and the complex structure of object detection models. To address these challenges, we turn to Conformal Prediction, a post-hoc predictive uncertainty quantification procedure with statistical guarantees that are valid for any dataset size, without requiring prior knowledge on the model or data distribution. Our contribution is manifold. First, we formally define the problem of Conformal Object Detection (COD). We introduce a novel method, Sequential Conformal Risk Control (SeqCRC), that extends the statistical guarantees of Conformal Risk Control to two sequential tasks with two parameters, as required in the COD setting. Then, we present old and new loss functions and prediction sets suited to applying SeqCRC to different cases and certification requirements. Finally, we present a conformal toolkit for replication and further exploration of our method. Using this toolkit, we perform extensive experiments that validate our approach and emphasize trade-offs and other practical consequences.
comment: 29 pages, 12 figures
♻ ☆ LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation
While large models have achieved significant progress in computer vision, challenges such as optimization complexity, the intricacy of transformer architectures, computational constraints, and practical application demands highlight the importance of simpler model designs in medical image segmentation. This need is particularly pronounced in mobile medical devices, which require lightweight, deployable models with real-time performance. However, existing lightweight models often suffer from poor robustness across datasets, limiting their widespread adoption. To address these challenges, this paper introduces LV-UNet, a lightweight and vanilla model that leverages pre-trained MobileNetv3-Large backbones and incorporates fusible modules. LV-UNet employs an enhanced deep training strategy and switches to a deployment mode during inference by re-parametrization, significantly reducing parameter count and computational overhead. Experimental results on ISIC 2016, BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvair-SEG datasets demonstrate a better trade-off between performance and the computational load. The code will be released at https://github.com/juntaoJianggavin/LV-UNet.
comment: Accepted by IEEE BIBM2024 ML4BMI workshop
♻ ☆ SRAGAN: Saliency Regularized and Attended Generative Adversarial Network for Chinese Ink-wash Painting Style Transfer
Recent style transfer problems are still largely dominated by Generative Adversarial Network (GAN) from the perspective of cross-domain image-to-image (I2I) translation, where the pivotal issue is to learn and transfer target-domain style patterns onto source-domain content images. This paper handles the problem of translating real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though a wide range of I2I models tackle this problem, a notable challenge is that the content details of the source image could be easily erased or corrupted due to the transfer of ink-wash style elements. To remedy this issue, we propose to incorporate saliency detection into the unpaired I2I framework to regularize image content, where the detected saliency map is utilized from two aspects: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize object content structure by enforcing saliency consistency before and after image stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances object structure integrity of the generated paintings by dynamically injecting image saliency information into the generator to guide stylization process. Besides, we also propose saliency attended discriminator which harnesses image saliency information to focus generative adversarial attention onto the drawn objects, contributing to generating more vivid and delicate brush strokes and ink-wash textures. Extensive qualitative and quantitative experiments demonstrate superiority of our approach over related advanced image stylization methods in both GAN and diffusion model paradigms.
comment: Pattern Recognition, Volume 162, June 2025, 111344
♻ ☆ Augmented Reality-based Guidance with Deformable Registration in Head and Neck Tumor Resection MICCAI 2025
Head and neck squamous cell carcinoma (HNSCC) has one of the highest rates of recurrence cases among solid malignancies. Recurrence rates can be reduced by improving positive margins localization. Frozen section analysis (FSA) of resected specimens is the gold standard for intraoperative margin assessment. However, because of the complex 3D anatomy and the significant shrinkage of resected specimens, accurate margin relocation from specimen back onto the resection site based on FSA results remains challenging. We propose a novel deformable registration framework that uses both the pre-resection upper surface and the post-resection site of the specimen to incorporate thickness information into the registration process. The proposed method significantly improves target registration error (TRE), demonstrating enhanced adaptability to thicker specimens. In tongue specimens, the proposed framework improved TRE by up to 33% as compared to prior deformable registration. Notably, tongue specimens exhibit complex 3D anatomies and hold the highest clinical significance compared to other head and neck specimens from the buccal and skin. We analyzed distinct deformation behaviors in different specimens, highlighting the need for tailored deformation strategies. To further aid intraoperative visualization, we also integrated this framework with an augmented reality-based auto-alignment system. The combined system can accurately and automatically overlay the deformed 3D specimen mesh with positive margin annotation onto the resection site. With a pilot study of the AR guided framework involving two surgeons, the integrated system improved the surgeons' average target relocation error from 9.8 cm to 4.8 cm.
comment: Accepted at MICCAI 2025
♻ ☆ C3Editor: Achieving Controllable Consistency in 2D Model for 3D Editing ICCV 2025
Existing 2D-lifting-based 3D editing methods often encounter challenges related to inconsistency, stemming from the lack of view-consistent 2D editing models and the difficulty of ensuring consistent editing across multiple views. To address these issues, we propose C3Editor, a controllable and consistent 2D-lifting-based 3D editing framework. Given an original 3D representation and a text-based editing prompt, our method selectively establishes a view-consistent 2D editing model to achieve superior 3D editing results. The process begins with the controlled selection of a ground truth (GT) view and its corresponding edited image as the optimization target, allowing for user-defined manual edits. Next, we fine-tune the 2D editing model within the GT view and across multiple views to align with the GT-edited image while ensuring multi-view consistency. To meet the distinct requirements of GT view fitting and multi-view consistency, we introduce separate LoRA modules for targeted fine-tuning. Our approach delivers more consistent and controllable 2D and 3D editing results than existing 2D-lifting-based methods, outperforming them in both qualitative and quantitative evaluations.
comment: ICCV 2025 Workshop Wild3D
♻ ☆ NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation NeurIPS 2025
Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to better scale test-time compute remains largely underexplored. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. We introduce NoisyRollout, a simple yet effective data augmentation method that addresses these issues by mixing training trajectories from both clean and moderately distorted images. This approach injects perceptual diversity, encouraging better policy exploration and leading to more robust reasoning. A noise annealing schedule gradually reduces distortion strength, aiding exploration early in training while ensuring later stability. Crucially, our method is easy-to-adopt--requiring no additional training cost and no modifications to the RL objective. Extensive experiments on 2 distinct training datasets demonstrate that NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models across 5 out-of-domain reasoning and perception benchmarks. Furthermore, we validate the effectiveness of NoisyRollout across model sizes (7B and 32B), data scales (from 1K to 6K) and image augmentation types (Gaussion noise and rotation), highlighting its generalizability and scalability.
comment: NeurIPS 2025
♻ ☆ Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos NeurIPS 2025
This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos. Our code is available at https://github.com/hhhddddddd/dydeblur.
comment: Accepted to NeurIPS 2025
♻ ☆ GASP: Gaussian Splatting for Physic-Based Simulations
Physics simulation is paramount for modeling and utilizing 3D scenes in various real-world applications. However, integrating with state-of-the-art 3D scene rendering techniques such as Gaussian Splatting (GS) remains challenging. Existing models use additional meshing mechanisms, including triangle or tetrahedron meshing, marching cubes, or cage meshes. Alternatively, we can modify the physics-grounded Newtonian dynamics to align with 3D Gaussian components. Current models take the first-order approximation of a deformation map, which locally approximates the dynamics by linear transformations. In contrast, our GS for Physics-Based Simulations (GASP) pipeline uses parametrized flat Gaussian distributions. Consequently, the problem of modeling Gaussian components using the physics engine is reduced to working with 3D points. In our work, we present additional rules for manipulating Gaussians, demonstrating how to adapt the pipeline to incorporate meshes, control Gaussian sizes during simulations, and enhance simulation efficiency. This is achieved through the Gaussian grouping strategy, which implements hierarchical structuring and enables simulations to be performed exclusively on selected Gaussians. The resulting solution can be integrated into any physics engine that can be treated as a black box. As demonstrated in our studies, the proposed pipeline exhibits superior performance on a diverse range of benchmark datasets designed for 3D object rendering. The project webpage, which includes additional visualizations, can be found at https://waczjoan.github.io/GASP.
♻ ☆ D$^2$USt3R: Enhancing 3D Reconstruction for Dynamic Scenes NeurIPS 2025
In this work, we address the task of 3D reconstruction in dynamic scenes, where object motions frequently degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, that are originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose $D^2USt3R$ that directly regresses Static-Dynamic Aligned Pointmaps (SDAP) that simultaneiously capture both static and dynamic 3D scene geometry. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates 3D dense correspondence to the proposed pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior 3D reconstruction performance across various datasets featuring complex motions.
comment: NeurIPS 2025; project page: https://cvlab-kaist.github.io/DDUSt3R/
♻ ☆ DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO NeurIPS 2025
Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
comment: NeurIPS 2025
♻ ☆ Rethinking Metrics and Benchmarks of Video Anomaly Detection
Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation methods through comprehensive analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms. To address these limitations, we propose three novel evaluation methods: first, we establish probabilistic AUC/AP (Prob-AUC/AP) metrics utlizing multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development. We release our data and code in https://github.com/Kamino666/RethinkingVAD.
♻ ☆ Mano Technical Report
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
♻ ☆ Manifold Learning for Hyperspectral Images
Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.
♻ ☆ EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting NeurIPS 2025
Scene reconstruction from casually captured videos has wide applications in real-world scenarios. With recent advancements in differentiable rendering techniques, several methods have attempted to simultaneously optimize scene representations (NeRF or 3DGS) and camera poses. Despite recent progress, existing methods relying on traditional camera input tend to fail in high-speed (or equivalently low-frame-rate) scenarios. Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution, providing valuable scene and motion information in blind inter-frame intervals. In this paper, we introduce the event camera to aid scene construction from a casually captured video for the first time, and propose Event-Aided Free-Trajectory 3DGS, called EF-3DGS, which seamlessly integrates the advantages of event cameras into 3DGS through three key components. First, we leverage the Event Generation Model (EGM) to fuse events and frames, supervising the rendered views observed by the event stream. Second, we adopt the Contrast Maximization (CMax) framework in a piece-wise manner to extract motion information by maximizing the contrast of the Image of Warped Events (IWE), thereby calibrating the estimated poses. Besides, based on the Linear Event Generation Model (LEGM), the brightness information encoded in the IWE is also utilized to constrain the 3DGS in the gradient domain. Third, to mitigate the absence of color information of events, we introduce photometric bundle adjustment (PBA) to ensure view consistency across events and frames. We evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS. Our project page is https://lbh666.github.io/ef-3dgs/.
comment: Accepted to NeurIPS 2025,Project Page: https://lbh666.github.io/ef-3dgs/
♻ ☆ Poisson Informed Retinex Network for Extreme Low-Light Image Enhancement
Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.
comment: 10 pages, 5 figures and 1 table
♻ ☆ LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
comment: 10 pages, 5 figures, 14 tables, Neurips 2025
♻ ☆ FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.
♻ ☆ BALR-SAM: Boundary-Aware Low-Rank Adaptation of SAM for Resource-Efficient Medical Image Segmentation
Vision foundation models like the Segment Anything Model (SAM), pretrained on large-scale natural image datasets, often struggle in medical image segmentation due to a lack of domain-specific adaptation. In clinical practice, fine-tuning such models efficiently for medical downstream tasks with minimal resource demands, while maintaining strong performance, is challenging. To address these issues, we propose BALR-SAM, a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging. It combines three tailored components: (1) a Complementary Detail Enhancement Network (CDEN) using depthwise separable convolutions and multi-scale fusion to capture boundary-sensitive features essential for accurate segmentation; (2) low-rank adapters integrated into SAM's Vision Transformer blocks to optimize feature representation and attention for medical contexts, while simultaneously significantly reducing the parameter space; and (3) a low-rank tensor attention mechanism in the mask decoder, cutting memory usage by 75% and boosting inference speed. Experiments on standard medical segmentation datasets show that BALR-SAM, without requiring prompts, outperforms several state-of-the-art (SOTA) methods, including fully fine-tuned MedSAM, while updating just 1.8% (11.7M) of its parameters.
♻ ☆ AMD-Hummingbird: Towards an Efficient Text-to-Video Model
Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.
comment: Homepage: https://www.amd.com/en/developer/resources/technical-articles/amd-hummingbird-0-9b-text-to-video-diffusion-model-with-4-step-inferencing.html| GitHub: https://github.com/AMD-AIG-AIMA/AMD-Hummingbird-T2V
♻ ☆ Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling NeurIPS 2025
Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom. Codes are available in https://github.com/WLU-wry02/AdaSDE.
comment: To appear in NeurIPS 2025
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
♻ ☆ MixedGaussianAvatar: Realistically and Geometrically Accurate Head Avatar via Mixed 2D-3D Gaussians
Reconstructing high-fidelity 3D head avatars is crucial in various applications such as virtual reality. The pioneering methods reconstruct realistic head avatars with Neural Radiance Fields (NeRF), which have been limited by training and rendering speed. Recent methods based on 3D Gaussian Splatting (3DGS) significantly improve the efficiency of training and rendering. However, the surface inconsistency of 3DGS results in subpar geometric accuracy; later, 2DGS uses 2D surfels to enhance geometric accuracy at the expense of rendering fidelity. To leverage the benefits of both 2DGS and 3DGS, we propose a novel method named MixedGaussianAvatar for realistically and geometrically accurate head avatar reconstruction. Our main idea is to utilize 2D Gaussians to reconstruct the surface of the 3D head, ensuring geometric accuracy. We attach the 2D Gaussians to the triangular mesh of the FLAME model and connect additional 3D Gaussians to those 2D Gaussians where the rendering quality of 2DGS is inadequate, creating a mixed 2D-3D Gaussian representation. These 2D-3D Gaussians can then be animated using FLAME parameters. We further introduce a progressive training strategy that first trains the 2D Gaussians and then fine-tunes the mixed 2D-3D Gaussians. We use a unified mixed Gaussian representation to integrate the two modalities of 2D image and 3D mesh. Furthermore, the comprehensive experiments demonstrate the superiority of MixedGaussianAvatar. The code will be released.
♻ ☆ Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers ICCV2025
Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy. With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention. However, existing DFQ methods exhibit two limitations: (1) semantic distortion, where the semantics of synthetic images deviate substantially from those of real images, and (2) semantic inadequacy, where synthetic images contain extensive regions with limited content and oversimplified textures, leading to suboptimal quantization performance. To address these limitations, we propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs. To address semantic distortion, SARDFQ incorporates Attention Priors Alignment (APA), which optimizes synthetic images to follow randomly generated structure attention priors. To mitigate semantic inadequacy, SARDFQ introduces Multi-Semantic Reinforcement (MSR), leveraging localized patch optimization to enhance semantic richness across synthetic images. Furthermore, SARDFQ employs Soft-Label Learning (SL), wherein multiple semantic targets are adapted to facilitate the learning of multi-semantic images augmented by MSR. Extensive experiments demonstrate the effectiveness of SARDFQ, significantly surpassing existing methods. For example, SARDFQ improves top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B. The code is at https://github.com/zysxmu/SARDFQ.
comment: ICCV2025
♻ ☆ Panoramic Out-of-Distribution Segmentation for Autonomous Driving
Panoramic imaging enables capturing 360{\deg} images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception, which is critical to applications, such as autonomous driving and augmented reality, etc. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene understanding. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities and advances the development of panoramic understanding. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.
comment: Code and datasets will be available at https://github.com/MengfeiD/PanOoS
♻ ☆ Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.
♻ ☆ IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
comment: https://github.com/lifuguan/IGGT_official
♻ ☆ Scaling Diffusion Transformers Efficiently via $μ$P NeurIPS 2025
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.
comment: Accepted by NeurIPS 2025, 38 pages, 10 figures, 17 tables
♻ ☆ Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Despite remarkable progress in recent years, vision language models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve reliability by helping models selectively predict, that is, models respond only when they are sufficiently confident. Unfortunately, Bayesian methods are often assumed to be costly and ineffective for large models, and so far there exists little evidence to show otherwise, especially for multimodal applications. Here, we show the effectiveness and competitive edge of variational Bayes for selective prediction in VQA for the first time. We build on recent advances in variational methods for deep learning and propose an extension called "Variational VQA". This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the error tolerance is low ($\leq 1\%$). Often, just one posterior sample can yield more reliable answers than those obtained by models trained with AdamW. In addition, we propose a new risk-averse selector that outperforms standard sample averaging by considering the variance of predictions. Overall, we present compelling evidence that variational learning is a viable option to make large VLMs safer and more trustworthy.
comment: under review at TMLR
♻ ☆ On the Theory of Conditional Feature Alignment for Unsupervised Domain-Adaptive Counting
Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment and provide a straightforward implementation. By theoretical analysis, our framework is feasible to achieve superior cross-domain generalization for counting. In the presented network, the features related to density are explicitly preserved across domains. Theoretically, we formalize the notion of conditional divergence by partitioning each domain into subsets and measuring divergences per condition. We then derive a joint error bound showing that, under discrete label spaces treated as condition sets, aligning distributions conditionally leads to tighter bounds on the combined source-target decision error than unconditional alignment. Empirically, we demonstrate the effectiveness of our approach through extensive experiments on multiple counting datasets with varying density distributions. The results show that our method outperforms existing unsupervised domain adaptation methods, empirically validating the theoretical insights on conditional feature alignment.
comment: 18 pages, 6 figures, 5 tables
♻ ☆ SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.
♻ ☆ Continual Vision-and-Language Navigation
Developing Vision-and-Language Navigation (VLN) agents typically assumes a \textit{train-once-deploy-once} strategy, which is unrealistic as deployed agents continually encounter novel environments. To address this, we propose the Continual Vision-and-Language Navigation (CVLN) paradigm, where agents learn and adapt incrementally across multiple \textit{scene domains}. CVLN includes two setups: Initial-instruction based CVLN for instruction-following, and Dialogue-based CVLN for dialogue-guided navigation. We also introduce two simple yet effective baselines for sequential decision-making: Perplexity Replay (PerpR), which replays difficult episodes, and Episodic Self-Replay (ESR), which stores and revisits action logits during training. Experiments show that existing continual learning methods fall short for CVLN, while PerpR and ESR achieve better performance by efficiently utilizing replay memory.
♻ ☆ Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
comment: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis
♻ ☆ DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications
Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.
♻ ☆ Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement
We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model's ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.
♻ ☆ AVA: Towards Agentic Video Analytics with Vision Language Models
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
comment: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix
♻ ☆ Tensor Completion via Monotone Inclusion: Generalized Low-Rank Priors Meet Deep Denoisers
Missing entries in multi dimensional data pose significant challenges for downstream analysis across diverse real world applications. These data are naturally represented as tensors, and recent completion methods integrating global low rank priors with plug and play denoisers have demonstrated strong empirical performance. However, these approaches often rely on empirical convergence alone or unrealistic assumptions, such as deep denoisers acting as proximal operators of implicit regularizers, which generally does not hold. To address these limitations, we propose a novel tensor completion framework grounded in the monotone inclusion paradigm. Within this framework, deep denoisers are treated as general operators that require far fewer restrictions than in classical optimization based formulations. To better capture holistic structure, we further incorporate generalized low rank priors with weakly convex penalties. Building upon the Davis Yin splitting scheme, we develop the GTCTV DPC algorithm and rigorously establish its global convergence. Extensive experiments demonstrate that GTCTV DPC consistently outperforms existing methods in both quantitative metrics and visual quality, particularly at low sampling rates. For instance, at a sampling rate of 0.05 for multi dimensional image completion, GTCTV DPC achieves an average mean peak signal to noise ratio (MPSNR) that surpasses the second best method by 0.717 dB, and 0.649 dB for multi spectral images, and color videos, respectively.
comment: 14 pages, 8 figures, 6 tables
♻ ☆ Neural Posterior Estimation for Cataloging Astronomical Images from the Legacy Survey of Space and Time
The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will commence full-scale operations in 2026, yielding an unprecedented volume of astronomical images. Constructing an astronomical catalog, a table of imaged stars, galaxies, and their properties, is a fundamental step in most scientific workflows based on astronomical image data. Traditional deterministic cataloging methods lack statistical coherence as cataloging is an ill-posed problem, while existing probabilistic approaches suffer from computational inefficiency, inaccuracy, or the inability to perform inference with multiband coadded images, the primary output format for LSST images. In this article, we explore a recently developed Bayesian inference method called neural posterior estimation (NPE) as an approach to cataloging. NPE leverages deep learning to achieve both computational efficiency and high accuracy. When evaluated on the DC2 Simulated Sky Survey -- a highly realistic synthetic dataset designed to mimic LSST data -- NPE systematically outperforms the standard LSST pipeline in light source detection, flux measurement, star/galaxy classification, and galaxy shape measurement. Additionally, NPE provides well-calibrated posterior approximations. These promising results, obtained using simulated data, illustrate the potential of NPE in the absence of model misspecification. Although some degree of model misspecification is inevitable in the application of NPE to real LSST images, there are a variety of strategies to mitigate its effects.
♻ ☆ Face Spoofing Detection using Deep Learning
Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.
comment: The author's school has a conflict of interest regarding the submission of this article prior to his graduation thesis submission
♻ ☆ LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space
Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. State-of-the-art methods employ either grid-based classification or gallery-based image-location retrieval, whose spatial generalizability significantly suffers if the spatial distribution of test im- ages does not align with the choices of grids and galleries. Recently emerging generative approaches, while getting rid of grids and galleries, use raw geographical coordinates and suffer quality losses due to their lack of multi-scale information. To address these limitations, we propose a multi-scale latent diffusion model called LocDiff for image geolocalization. We developed a novel positional encoding-decoding framework called Spherical Harmonics Dirac Delta (SHDD) Representations, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking on spherical probability distributions. We also propose a novel SirenNet-based architecture (CS-UNet) to learn an image-based conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. To the best of our knowledge, LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space and generates geolocations under the guidance of images. Experimental results show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets, and demonstrates significantly stronger generalizability to unseen geolocations.
♻ ☆ Hyperparameter Optimization and Reproducibility in Deep Learning Model Training
Reproducibility remains a critical challenge in foundation model training for histopathology, often hindered by software randomness, hardware non-determinism, and inconsistent hyperparameter reporting. To investigate these issues, we trained a CLIP model on the QUILT-1M dataset and systematically evaluated the impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon). Despite variability across runs, we identified clear trends: RandomResizedCrop values of 0.7-0.8 outperformed more aggressive (0.6) or conservative (0.9) settings, distributed training without local loss improved stability, and learning rates below 5.0e-5 consistently degraded performance across all datasets. The LC25000 (Colon) dataset consistently provided the most reproducible benchmark. These findings highlight that reproducibility in computational pathology depends not only on transparent documentation but also on carefully chosen experimental configurations, and we provide practical rules to guide future efforts in developing reproducible foundation models for digital pathology.
♻ ☆ PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue
Recent rapid progress in the field of computational pathology has been enabled by foundation models. These models are beginning to move beyond encoding image patches towards whole-slide understanding but their clinical utility remains limited. In this work, we present PRISM2, a multimodal slide-level foundation model trained on data from 700,000 diagnostic specimen-report pairs, the largest vision (2.3 million whole slide images) and language (14M question-answer pairs) histopathology dataset to date. By learning through clinical-dialogue supervision, PRISM2 aligns histomorphologic features with the language of diagnostic reasoning, producing slide-level representations that support both direct diagnostic question-answering and transferable embeddings for downstream tasks. Without additional training, PRISM2 matches or exceeds the cancer-detection performance of clinical-grade products. This is observed without loss of generality on other tasks, where PRISM2 achieves top performance. Finally, using survival prediction as the example, we show that task-specific finetuning with a large dataset can outperform task-specific models, further improving performance. These results demonstrate how language-supervised pretraining provides a scalable, clinically grounded signal for learning generalizable pathology representations, bridging human diagnostic reasoning and foundation-model performance.
♻ ☆ AI-Driven Detection and Analysis of Handwriting on Seized Ivory: A Tool to Uncover Criminal Networks in the Illicit Wildlife Trade
The transnational ivory trade continues to drive the decline of elephant populations across Africa, and trafficking networks remain difficult to disrupt. Tusks seized by law enforcement officials carry forensic information on the traffickers responsible for their export, including DNA evidence and handwritten markings made by traffickers. For 20 years, analyses of tusk DNA have identified where elephants were poached and established connections among shipments of ivory. While the links established using genetic evidence are extremely conclusive, genetic data is expensive and sometimes impossible to obtain. But though handwritten markings are easy to photograph, they are rarely documented or analyzed. Here, we present an AI-driven pipeline for extracting and analyzing handwritten markings on seized elephant tusks, offering a novel, scalable, and low-cost source of forensic evidence. Having collected 6,085 photographs from eight large seizures of ivory over a 6-year period (2014-2019), we used an object detection model to extract over 17,000 individual markings, which were then labeled and described using state-of-the-art AI tools. We identified 184 recurring "signature markings" that connect the tusks on which they appear. 20 signature markings were observed in multiple seizures, establishing forensic links between these seizures through traffickers involved in both shipments. This work complements other investigative techniques by filling in gaps where other data sources are unavailable. The study demonstrates the transformative potential of AI in wildlife forensics and highlights practical steps for integrating handwriting analysis into efforts to disrupt organized wildlife crime.
comment: Submitted. 5 pages, 3 figures, 1 table
♻ ☆ Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
Measuring alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset, CycleReward, outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling, while maintaining speed and differentiability. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are publicly released at https://cyclereward.github.io.
Machine Learning 150
☆ Continuous Autoregressive Language Models
The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.
☆ PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
☆ Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models
Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels.Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
☆ On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection
Large language models (LLMs) have demonstrated impressive capabilities for many coding tasks, including summarization, translation, completion, and code generation. However, detecting code vulnerabilities remains a challenging task for LLMs. An effective way to improve LLM performance is in-context learning (ICL) - providing few-shot examples similar to the query, along with correct answers, can improve an LLM's ability to generate correct solutions. However, choosing the few-shot examples appropriately is crucial to improving model performance. In this paper, we explore two criteria for choosing few-shot examples for ICL used in the code vulnerability detection task. The first criterion considers if the LLM (consistently) makes a mistake or not on a sample with the intuition that LLM performance on a sample is informative about its usefulness as a few-shot example. The other criterion considers similarity of the examples with the program under query and chooses few-shot examples based on the $k$-nearest neighbors to the given sample. We perform evaluations to determine the benefits of these criteria individually as well as under various combinations, using open-source models on multiple datasets.
☆ MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design
Structure-based drug design (SBDD), which maps target proteins to candidate molecular ligands, is a fundamental task in drug discovery. Effectively aligning protein structural representations with molecular representations, and ensuring alignment between generated drugs and their pharmacological properties, remains a critical challenge. To address these challenges, we propose MolChord, which integrates two key techniques: (1) to align protein and molecule structures with their textual descriptions and sequential representations (e.g., FASTA for proteins and SMILES for molecules), we leverage NatureLM, an autoregressive model unifying text, small molecules, and proteins, as the molecule generator, alongside a diffusion-based structure encoder; and (2) to guide molecules toward desired properties, we curate a property-aware dataset by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO). Experimental results on CrossDocked2020 demonstrate that our approach achieves state-of-the-art performance on key evaluation metrics, highlighting its potential as a practical tool for SBDD.
comment: 21 pages
☆ Bayesian model selection and misspecification testing in imaging inverse problems only from noisy and partial measurements
Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian cross-validation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.
☆ Challenges in Credit Assignment for Multi-Agent Reinforcement Learning in Open Agent Systems
In the rapidly evolving field of multi-agent reinforcement learning (MARL), understanding the dynamics of open systems is crucial. Openness in MARL refers to the dynam-ic nature of agent populations, tasks, and agent types with-in a system. Specifically, there are three types of openness as reported in (Eck et al. 2023) [2]: agent openness, where agents can enter or leave the system at any time; task openness, where new tasks emerge, and existing ones evolve or disappear; and type openness, where the capabil-ities and behaviors of agents change over time. This report provides a conceptual and empirical review, focusing on the interplay between openness and the credit assignment problem (CAP). CAP involves determining the contribution of individual agents to the overall system performance, a task that becomes increasingly complex in open environ-ments. Traditional credit assignment (CA) methods often assume static agent populations, fixed and pre-defined tasks, and stationary types, making them inadequate for open systems. We first conduct a conceptual analysis, in-troducing new sub-categories of openness to detail how events like agent turnover or task cancellation break the assumptions of environmental stationarity and fixed team composition that underpin existing CAP methods. We then present an empirical study using representative temporal and structural algorithms in an open environment. The results demonstrate that openness directly causes credit misattribution, evidenced by unstable loss functions and significant performance degradation.
☆ Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition
Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R\'enyi's $\alpha$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
☆ Imbalanced Classification through the Lens of Spurious Correlations
Class imbalance poses a fundamental challenge in machine learning, frequently leading to unreliable classification performance. While prior methods focus on data- or loss-reweighting schemes, we view imbalance as a data condition that amplifies Clever Hans (CH) effects by underspecification of minority classes. In a counterfactual explanations-based approach, we propose to leverage Explainable AI to jointly identify and eliminate CH effects emerging under imbalance. Our method achieves competitive classification performance on three datasets and demonstrates how CH effects emerge under imbalance, a perspective largely overlooked by existing approaches.
☆ Bayesian Optimization on Networks
This paper studies optimization on networks modeled as metric graphs. Motivated by applications where the objective function is expensive to evaluate or only available as a black box, we develop Bayesian optimization algorithms that sequentially update a Gaussian process surrogate model of the objective to guide the acquisition of query points. To ensure that the surrogates are tailored to the network's geometry, we adopt Whittle-Mat\'ern Gaussian process prior models defined via stochastic partial differential equations on metric graphs. In addition to establishing regret bounds for optimizing sufficiently smooth objective functions, we analyze the practical case in which the smoothness of the objective is unknown and the Whittle-Mat\'ern prior is represented using finite elements. Numerical results demonstrate the effectiveness of our algorithms for optimizing benchmark objective functions on a synthetic metric graph and for Bayesian inversion via maximum a posteriori estimation on a telecommunication network.
comment: 36 pages, 6 figures; includes appendices
☆ SpecAttn: Speculating Sparse Attention NeurIPS 2025
Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.
comment: Accepted to NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling
☆ Enhancing software product lines with machine learning components
Modern software systems increasingly integrate machine learning (ML) due to its advancements and ability to enhance data-driven decision-making. However, this integration introduces significant challenges for software engineering, especially in software product lines (SPLs), where managing variability and reuse becomes more complex with the inclusion of ML components. Although existing approaches have addressed variability management in SPLs and the integration of ML components in isolated systems, few have explored the intersection of both domains. Specifically, there is limited support for modeling and managing variability in SPLs that incorporate ML components. To bridge this gap, this article proposes a structured framework designed to extend Software Product Line engineering, facilitating the integration of ML components. It facilitates the design of SPLs with ML capabilities by enabling systematic modeling of variability and reuse. The proposal has been partially implemented with the VariaMos tool.
comment: pp. 73-94, 2 figures
☆ Panprediction: Optimal Predictions for Any Downstream Task and Loss
Supervised learning is classically formulated as training a model to minimize a fixed loss function over a fixed distribution, or task. However, an emerging paradigm instead views model training as extracting enough information from data so that the model can be used to minimize many losses on many downstream tasks. We formalize a mathematical framework for this paradigm, which we call panprediction, and study its statistical complexity. Formally, panprediction generalizes omniprediction and sits upstream from multi-group learning, which respectively focus on predictions that generalize to many downstream losses or many downstream tasks, but not both. Concretely, we design algorithms that learn deterministic and randomized panpredictors with $\tilde{O}(1/\varepsilon^3)$ and $\tilde{O}(1/\varepsilon^2)$ samples, respectively. Our results demonstrate that under mild assumptions, simultaneously minimizing infinitely many losses on infinitely many tasks can be as statistically easy as minimizing one loss on one task. Along the way, we improve the best known sample complexity guarantee of deterministic omniprediction by a factor of $1/\varepsilon$, and match all other known sample complexity guarantees of omniprediction and multi-group learning. Our key technical ingredient is a nearly lossless reduction from panprediction to a statistically efficient notion of calibration, called step calibration.
comment: 25 pages
☆ ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling
Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs' capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval integrates a tailored variant of the WL-test with an SD detection algorithm to evaluate model equivalence. By focusing on structural equivalence rather than instance-level configurations, ORGEval is robust to numerical variations. Experimental results show that our method can successfully detect model equivalence and produce 100\% consistent results across random parameter configurations, while significantly outperforming solver-based methods in runtime, especially on difficult problems. Leveraging ORGEval, we construct the Bench4Opt dataset and benchmark state-of-the-art LLMs on optimization modeling. Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting, outperforming even leading reasoning models.
☆ Learned Static Function Data Structures
We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.
☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
☆ Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
☆ Optimal Convergence Analysis of DDPM for General Distributions
Score-based diffusion models have achieved remarkable empirical success in generating high-quality samples from target data distributions. Among them, the Denoising Diffusion Probabilistic Model (DDPM) is one of the most widely used samplers, generating samples via estimated score functions. Despite its empirical success, a tight theoretical understanding of DDPM -- especially its convergence properties -- remains limited. In this paper, we provide a refined convergence analysis of the DDPM sampler and establish near-optimal convergence rates under general distributional assumptions. Specifically, we introduce a relaxed smoothness condition parameterized by a constant $L$, which is small for many practical distributions (e.g., Gaussian mixture models). We prove that the DDPM sampler with accurate score estimates achieves a convergence rate of $$\widetilde{O}\left(\frac{d\min\{d,L^2\}}{T^2}\right)~\text{in Kullback-Leibler divergence},$$ where $d$ is the data dimension, $T$ is the number of iterations, and $\widetilde{O}$ hides polylogarithmic factors in $T$. This result substantially improves upon the best-known $d^2/T^2$ rate when $L < \sqrt{d}$. By establishing a matching lower bound, we show that our convergence analysis is tight for a wide array of target distributions. Moreover, it reveals that DDPM and DDIM share the same dependence on $d$, raising an interesting question of why DDIM often appears empirically faster.
☆ Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs
This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.
☆ AstuteRAG-FQA: Task-Aware Retrieval-Augmented Generation Framework for Proprietary Data Challenges in Financial Question Answering
Retrieval-Augmented Generation (RAG) shows significant promise in knowledge-intensive tasks by improving domain specificity, enhancing temporal relevance, and reducing hallucinations. However, applying RAG to finance encounters critical challenges: restricted access to proprietary datasets, limited retrieval accuracy, regulatory constraints, and sensitive data interpretation. We introduce AstuteRAG-FQA, an adaptive RAG framework tailored for Financial Question Answering (FQA), leveraging task-aware prompt engineering to address these challenges. The framework uses a hybrid retrieval strategy integrating both open-source and proprietary financial data while maintaining strict security protocols and regulatory compliance. A dynamic prompt framework adapts in real time to query complexity, improving precision and contextual relevance. To systematically address diverse financial queries, we propose a four-tier task classification: explicit factual, implicit factual, interpretable rationale, and hidden rationale involving implicit causal reasoning. For each category, we identify key challenges, datasets, and optimization techniques within the retrieval and generation process. The framework incorporates multi-layered security mechanisms including differential privacy, data anonymization, and role-based access controls to protect sensitive financial information. Additionally, AstuteRAG-FQA implements real-time compliance monitoring through automated regulatory validation systems that verify responses against industry standards and legal obligations. We evaluate three data integration techniques - contextual embedding, small model augmentation, and targeted fine-tuning - analyzing their efficiency and feasibility across varied financial environments.
☆ Representing Classical Compositions through Implication-Realization Temporal-Gestalt Graphs
Understanding the structural and cognitive underpinnings of musical compositions remains a key challenge in music theory and computational musicology. While traditional methods focus on harmony and rhythm, cognitive models such as the Implication-Realization (I-R) model and Temporal Gestalt theory offer insight into how listeners perceive and anticipate musical structure. This study presents a graph-based computational approach that operationalizes these models by segmenting melodies into perceptual units and annotating them with I-R patterns. These segments are compared using Dynamic Time Warping and organized into k-nearest neighbors graphs to model intra- and inter-segment relationships. Each segment is represented as a node in the graph, and nodes are further labeled with melodic expectancy values derived from Schellenberg's two-factor I-R model-quantifying pitch proximity and pitch reversal at the segment level. This labeling enables the graphs to encode both structural and cognitive information, reflecting how listeners experience musical tension and resolution. To evaluate the expressiveness of these graphs, we apply the Weisfeiler-Lehman graph kernel to measure similarity between and within compositions. Results reveal statistically significant distinctions between intra- and inter-graph structures. Segment-level analysis via multidimensional scaling confirms that structural similarity at the graph level reflects perceptual similarity at the segment level. Graph2vec embeddings and clustering demonstrate that these representations capture stylistic and structural features that extend beyond composer identity. These findings highlight the potential of graph-based methods as a structured, cognitively informed framework for computational music analysis, enabling a more nuanced understanding of musical structure and style through the lens of listener perception.
comment: 8 pages, 11 figures
☆ TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers, 2) OsciReset, an algorithm to suppress weight oscillation, and 3) OutControl, an algorithm to retain outlier accuracy. TetraJet-v2 consistently outperforms prior FP4 training methods on pre-training LLMs across varying model sizes up to 370M and data sizes up to 200B tokens, reducing the performance gap to full-precision training by an average of 51.3%.
☆ Active transfer learning for structural health monitoring
Data for training structural health monitoring (SHM) systems are often expensive and/or impractical to obtain, particularly for labelled data. Population-based SHM (PBSHM) aims to address this limitation by leveraging data from multiple structures. However, data from different structures will follow distinct distributions, potentially leading to large generalisation errors for models learnt via conventional machine learning methods. To address this issue, transfer learning -- in the form of domain adaptation (DA) -- can be used to align the data distributions. Most previous approaches have only considered \emph{unsupervised} DA, where no labelled target data are available; they do not consider how to incorporate these technologies in an online framework -- updating as labels are obtained throughout the monitoring campaign. This paper proposes a Bayesian framework for DA in PBSHM, that can improve unsupervised DA mappings using a limited quantity of labelled target data. In addition, this model is integrated into an active sampling strategy to guide inspections to select the most informative observations to label -- leading to further reductions in the required labelled data to learn a target classifier. The effectiveness of this methodology is evaluated on a population of experimental bridges. Specifically, this population includes data corresponding to several damage states, as well as, a comprehensive set of environmental conditions. It is found that combining transfer learning and active learning can improve data efficiency when learning classification models in label-scarce scenarios. This result has implications for data-informed operation and maintenance of structures, suggesting a reduction in inspections over the operational lifetime of a structure -- and therefore a reduction in operational costs -- can be achieved.
☆ Leveraging Generic Time Series Foundation Models for EEG Classification
Foundation models for time series are emerging as powerful general-purpose backbones, yet their potential for domain-specific biomedical signals such as electroencephalography (EEG) remains rather unexplored. In this work, we investigate the applicability a recently proposed time series classification foundation model, to a different EEG tasks such as motor imagery classification and sleep stage prediction. We test two pretraining regimes: (a) pretraining on heterogeneous real-world time series from multiple domains, and (b) pretraining on purely synthetic data. We find that both variants yield strong performance, consistently outperforming EEGNet, a widely used convolutional baseline, and CBraMod, the most recent EEG-specific foundation model. These results suggest that generalist time series foundation models, even when pretrained on data of non-neural origin or on synthetic signals, can transfer effectively to EEG. Our findings highlight the promise of leveraging cross-domain pretrained models for brain signal analysis, suggesting that EEG may benefit from advances in the broader time series literature.
☆ Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs NeurIPS 2025
The conjugate gradient solver (CG) is a prevalent method for solving symmetric and positive definite linear systems Ax=b, where effective preconditioners are crucial for fast convergence. Traditional preconditioners rely on prescribed algorithms to offer rigorous theoretical guarantees, while limiting their ability to exploit optimization from data. Existing learning-based methods often utilize Graph Neural Networks (GNNs) to improve the performance and speed up the construction. However, their reliance on incomplete factorization leads to significant challenges: the associated triangular solve hinders GPU parallelization in practice, and introduces long-range dependencies which are difficult for GNNs to model. To address these issues, we propose a learning-based method to generate GPU-friendly preconditioners, particularly using GNNs to construct Sparse Approximate Inverse (SPAI) preconditioners, which avoids triangular solves and requires only two matrix-vector products at each CG step. The locality of matrix-vector product is compatible with the local propagation mechanism of GNNs. The flexibility of GNNs also allows our approach to be applied in a wide range of scenarios. Furthermore, we introduce a statistics-based scale-invariant loss function. Its design matches CG's property that the convergence rate depends on the condition number, rather than the absolute scale of A, leading to improved performance of the learned preconditioner. Evaluations on three PDE-derived datasets and one synthetic dataset demonstrate that our method outperforms standard preconditioners (Diagonal, IC, and traditional SPAI) and previous learning-based preconditioners on GPUs. We reduce solution time on GPUs by 40%-53% (68%-113% faster), along with better condition numbers and superior generalization performance. Source code available at https://github.com/Adversarr/LearningSparsePreconditioner4GPU
comment: NeurIPS 2025, poster
☆ BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization ICDM
Transformer-based architectures have advanced text summarization, yet their quadratic complexity limits scalability on long documents. This paper introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a novel framework that combines sparse attention, adaptive spans, and bilinear attention to address these limitations. Sparse attention reduces computational costs by focusing on the most relevant parts of the input, while adaptive spans dynamically adjust the attention ranges. Bilinear attention complements both by modeling complex token interactions within this refined context. BiSparse-AAS consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization tasks, achieving average ROUGE improvements of about 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets. By addressing efficiency, scalability, and long-sequence modeling, BiSparse-AAS provides a unified, practical solution for real-world text summarization applications.
comment: Accepted at the IEEE International Conference on Data Mining (ICDM) 2025, Washington, DC, USA
☆ Asynchronous Risk-Aware Multi-Agent Packet Routing for Ultra-Dense LEO Satellite Networks
The rise of ultra-dense LEO constellations creates a complex and asynchronous network environment, driven by their massive scale, dynamic topologies, and significant delays. This unique complexity demands an adaptive packet routing algorithm that is asynchronous, risk-aware, and capable of balancing diverse and often conflicting QoS objectives in a decentralized manner. However, existing methods fail to address this need, as they typically rely on impractical synchronous decision-making and/or risk-oblivious approaches. To tackle this gap, we introduce PRIMAL, an event-driven multi-agent routing framework designed specifically to allow each satellite to act independently on its own event-driven timeline, while managing the risk of worst-case performance degradation via a principled primal-dual approach. This is achieved by enabling agents to learn the full cost distribution of the targeted QoS objectives and constrain tail-end risks. Extensive simulations on a LEO constellation with 1584 satellites validate its superiority in effectively optimizing latency and balancing load. Compared to a recent risk-oblivious baseline, it reduces queuing delay by over 70%, and achieves a nearly 12 ms end-to-end delay reduction in loaded scenarios. This is accomplished by resolving the core conflict between naive shortest-path finding and congestion avoidance, highlighting such autonomous risk-awareness as a key to robust routing.
☆ DP-FedPGN: Finding Global Flat Minima for Differentially Private Federated Learning via Penalizing Gradient Norm
To prevent inference attacks in Federated Learning (FL) and reduce the leakage of sensitive information, Client-level Differentially Private Federated Learning (CL-DPFL) is widely used. However, current CL-DPFL methods usually result in sharper loss landscapes, which leads to a decrease in model generalization after differential privacy protection. By using Sharpness Aware Minimization (SAM), the current popular federated learning methods are to find a local flat minimum value to alleviate this problem. However, the local flatness may not reflect the global flatness in CL-DPFL. Therefore, to address this issue and seek global flat minima of models, we propose a new CL-DPFL algorithm, DP-FedPGN, in which we introduce a global gradient norm penalty to the local loss to find the global flat minimum. Moreover, by using our global gradient norm penalty, we not only find a flatter global minimum but also reduce the locally updated norm, which means that we further reduce the error of gradient clipping. From a theoretical perspective, we analyze how DP-FedPGN mitigates the performance degradation caused by DP. Meanwhile, the proposed DP-FedPGN algorithm eliminates the impact of data heterogeneity and achieves fast convergence. We also use R\'enyi DP to provide strict privacy guarantees and provide sensitivity analysis for local updates. Finally, we conduct effectiveness tests on both ResNet and Transformer models, and achieve significant improvements in six visual and natural language processing tasks compared to existing state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/DP-FedPGN
comment: 21 pages, 8 figures
☆ pDANSE: Particle-based Data-driven Nonlinear State Estimation from Nonlinear Measurements
We consider the problem of designing a data-driven nonlinear state estimation (DANSE) method that uses (noisy) nonlinear measurements of a process whose underlying state transition model (STM) is unknown. Such a process is referred to as a model-free process. A recurrent neural network (RNN) provides parameters of a Gaussian prior that characterize the state of the model-free process, using all previous measurements at a given time point. In the case of DANSE, the measurement system was linear, leading to a closed-form solution for the state posterior. However, the presence of a nonlinear measurement system renders a closed-form solution infeasible. Instead, the second-order statistics of the state posterior are computed using the nonlinear measurements observed at the time point. We address the nonlinear measurements using a reparameterization trick-based particle sampling approach, and estimate the second-order statistics of the state posterior. The proposed method is referred to as particle-based DANSE (pDANSE). The RNN of pDANSE uses sequential measurements efficiently and avoids the use of computationally intensive sequential Monte-Carlo (SMC) and/or ancestral sampling. We describe the semi-supervised learning method for pDANSE, which transitions to unsupervised learning in the absence of labeled data. Using a stochastic Lorenz-$63$ system as a benchmark process, we experimentally demonstrate the state estimation performance for four nonlinear measurement systems. We explore cubic nonlinearity and a camera-model nonlinearity where unsupervised learning is used; then we explore half-wave rectification nonlinearity and Cartesian-to-spherical nonlinearity where semi-supervised learning is used. The performance of state estimation is shown to be competitive vis-\`a-vis particle filters that have complete knowledge of the STM of the Lorenz-$63$ system.
comment: 11 pages, 10 figures, under review at IEEE Transactions on Signal Processing
☆ Minimax-Optimal Two-Sample Test with Sliced Wasserstein
We study the problem of nonparametric two-sample testing using the sliced Wasserstein (SW) distance. While prior theoretical and empirical work indicates that the SW distance offers a promising balance between strong statistical guarantees and computational efficiency, its theoretical foundations for hypothesis testing remain limited. We address this gap by proposing a permutation-based SW test and analyzing its performance. The test inherits finite-sample Type I error control from the permutation principle. Moreover, we establish non-asymptotic power bounds and show that the procedure achieves the minimax separation rate $n^{-1/2}$ over multinomial and bounded-support alternatives, matching the optimal guarantees of kernel-based tests while building on the geometric foundations of Wasserstein distances. Our analysis further quantifies the trade-off between the number of projections and statistical power. Finally, numerical experiments demonstrate that the test combines finite-sample validity with competitive power and scalability, and -- unlike kernel-based tests, which require careful kernel tuning -- it performs consistently well across all scenarios we consider.
☆ InertialAR: Autoregressive 3D Molecule Generation with Inertial Frames
Transformer-based autoregressive models have emerged as a unifying paradigm across modalities such as text and images, but their extension to 3D molecule generation remains underexplored. The gap stems from two fundamental challenges: (1) tokenizing molecules into a canonical 1D sequence of tokens that is invariant to both SE(3) transformations and atom index permutations, and (2) designing an architecture capable of modeling hybrid atom-based tokens that couple discrete atom types with continuous 3D coordinates. To address these challenges, we introduce InertialAR. InertialAR devises a canonical tokenization that aligns molecules to their inertial frames and reorders atoms to ensure SE(3) and permutation invariance. Moreover, InertialAR equips the attention mechanism with geometric awareness via geometric rotary positional encoding (GeoRoPE). In addition, it utilizes a hierarchical autoregressive paradigm to predict the next atom-based token, predicting the atom type first and then its 3D coordinates via Diffusion loss. Experimentally, InertialAR achieves state-of-the-art performance on 7 of the 10 evaluation metrics for unconditional molecule generation across QM9, GEOM-Drugs, and B3LYP. Moreover, it significantly outperforms strong baselines in controllable generation for targeted chemical functionality, attaining state-of-the-art results across all 5 metrics.
☆ FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models
AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\boldsymbol{v}$, $\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \texttt{FedAdamW} efficiently aggregates the \texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \texttt{FedAdamW} achieves a linear speedup convergence rate of $\mathcal{O}(\sqrt{(L \Delta \sigma_l^2)/(S K R \epsilon^2)}+(L \Delta)/R)$ without \textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.
☆ Thought Branches: Interpreting LLM Reasoning Requires Resampling
Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.
comment: Uzay Macar and Paul C. Bogdan contributed equally to this work, and their listed order was determined by coinflip
☆ Simplex-to-Euclidean Bijections for Categorical Flow Matching
We propose a method for learning and sampling from probability distributions supported on the simplex. Our approach maps the open simplex to Euclidean space via smooth bijections, leveraging the Aitchison geometry to define the mappings, and supports modeling categorical data by a Dirichlet interpolation that dequantizes discrete observations into continuous ones. This enables density modeling in Euclidean space through the bijection while still allowing exact recovery of the original discrete distribution. Compared to previous methods that operate on the simplex using Riemannian geometry or custom noise processes, our approach works in Euclidean space while respecting the Aitchison geometry, and achieves competitive performance on both synthetic and real-world data sets.
☆ Spectral Neural Graph Sparsification
Graphs are central to modeling complex systems in domains such as social networks, molecular chemistry, and neuroscience. While Graph Neural Networks, particularly Graph Convolutional Networks, have become standard tools for graph learning, they remain constrained by reliance on fixed structures and susceptibility to over-smoothing. We propose the Spectral Preservation Network, a new framework for graph representation learning that generates reduced graphs serving as faithful proxies of the original, enabling downstream tasks such as community detection, influence propagation, and information diffusion at a reduced computational cost. The Spectral Preservation Network introduces two key components: the Joint Graph Evolution layer and the Spectral Concordance loss. The former jointly transforms both the graph topology and the node feature matrix, allowing the structure and attributes to evolve adaptively across layers and overcoming the rigidity of static neighborhood aggregation. The latter regularizes these transformations by enforcing consistency in both the spectral properties of the graph and the feature vectors of the nodes. We evaluate the effectiveness of Spectral Preservation Network on node-level sparsification by analyzing well-established metrics and benchmarking against state-of-the-art methods. The experimental results demonstrate the superior performance and clear advantages of our approach.
☆ MVeLMA: Multimodal Vegetation Loss Modeling Architecture for Predicting Post-fire Vegetation Loss SP
Understanding post-wildfire vegetation loss is critical for developing effective ecological recovery strategies and is often challenging due to the extended time and effort required to capture the evolving ecosystem features. Recent works in this area have not fully explored all the contributing factors, their modalities, and interactions with each other. Furthermore, most research in this domain is limited by a lack of interpretability in predictive modeling, making it less useful in real-world settings. In this work, we propose a novel end-to-end ML pipeline called MVeLMA (\textbf{M}ultimodal \textbf{Ve}getation \textbf{L}oss \textbf{M}odeling \textbf{A}rchitecture) to predict county-wise vegetation loss from fire events. MVeLMA uses a multimodal feature integration pipeline and a stacked ensemble-based architecture to capture different modalities while also incorporating uncertainty estimation through probabilistic modeling. Through comprehensive experiments, we show that our model outperforms several state-of-the-art (SOTA) and baseline models in predicting post-wildfire vegetation loss. Furthermore, we generate vegetation loss confidence maps to identify high-risk counties, thereby helping targeted recovery efforts. The findings of this work have the potential to inform future disaster relief planning, ecological policy development, and wildlife recovery management.
comment: Accepted for 2025 ACM SIGSPATIAL conference
☆ Atlas-Alignment: Making Interpretability Transferable Across Language Models
Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.
☆ Estimation of aboveground biomass in a tropical dry forest: An intercomparison of airborne, unmanned, and space laser scanning
According to the Paris Climate Change Agreement, all nations are required to submit reports on their greenhouse gas emissions and absorption every two years by 2024. Consequently, forests play a crucial role in reducing carbon emissions, which is essential for meeting these obligations. Recognizing the significance of forest conservation in the global battle against climate change, Article 5 of the Paris Agreement emphasizes the need for high-quality forest data. This study focuses on enhancing methods for mapping aboveground biomass in tropical dry forests. Tropical dry forests are considered one of the least understood tropical forest environments; therefore, there is a need for accurate approaches to estimate carbon pools. We employ a comparative analysis of AGB estimates, utilizing different discrete and full-waveform laser scanning datasets in conjunction with Ordinary Least Squares and Bayesian approaches SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning were used as independent variables for extracting forest metrics. Variable selection, SVM regression tuning, and cross-validation via a machine-learning approach were applied to account for overfitting and underfitting. The results indicate that six key variables primarily related to tree height: Elev.minimum, Elev.L3, lev.MAD.mode, Elev.mode, Elev.MAD.median, and Elev.skewness, are important for AGB estimation using ALSD and ULSD , while Leaf Area Index, canopy coverage and height, terrain elevation, and full-waveform signal energy emerged as the most vital variables. AGB values estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste province ranged from 26.02 Mg/ha to 175.43 Mg/ha . The SVM regressions demonstrated a 17.89 error across all laser scanning systems, with SLSF W exhibiting the lowest error 17.07 in estimating total biomass per plot.
comment: 32 pages, 17 figures, research paper
☆ FedMuon: Accelerating Federated Learning with Matrix Orthogonalization
The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that \texttt{FedMuon} achieves a linear speedup convergence rate without the heterogeneity assumption, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.
☆ Interpretable Model-Aware Counterfactual Explanations for Random Forest
Despite their enormous predictive power, machine learning models are often unsuitable for applications in regulated industries such as finance, due to their limited capacity to provide explanations. While model-agnostic frameworks such as Shapley values have proved to be convenient and popular, they rarely align with the kinds of causal explanations that are typically sought after. Counterfactual case-based explanations, where an individual is informed of which circumstances would need to be different to cause a change in outcome, may be more intuitive and actionable. However, finding appropriate counterfactual cases is an open challenge, as is interpreting which features are most critical for the change in outcome. Here, we pose the question of counterfactual search and interpretation in terms of similarity learning, exploiting the representation learned by the random forest predictive model itself. Once a counterfactual is found, the feature importance of the explanation is computed as a function of which random forest partitions are crossed in order to reach it from the original instance. We demonstrate this method on both the MNIST hand-drawn digit dataset and the German credit dataset, finding that it generates explanations that are sparser and more useful than Shapley values.
comment: Presented at XAI-FIN-2025: International Joint Workshop on Explainable AI in Finance: Achieving Trustworthy Financial Decision-Making; November 15, 2025; Singapore
☆ Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
☆ On the Equivalence of Optimal Transport Problem and Action Matching with Optimal Vector Fields
Flow Matching (FM) method in generative modeling maps arbitrary probability distributions by constructing an interpolation between them and then learning the vector field that defines ODE for this interpolation. Recently, it was shown that FM can be modified to map distributions optimally in terms of the quadratic cost function for any initial interpolation. To achieve this, only specific optimal vector fields, which are typical for solutions of Optimal Transport (OT) problems, need to be considered during FM loss minimization. In this note, we show that considering only optimal vector fields can lead to OT in another approach: Action Matching (AM). Unlike FM, which learns a vector field for a manually chosen interpolation between given distributions, AM learns the vector field that defines ODE for an entire given sequence of distributions.
☆ Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
comment: Project page at https://ajmeek.github.io/cot_monitorability_website/
☆ FPS: Feedforward-based Parameter Selection For Efficient Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations. Addition-based methods, such as Adapters [1], introduce inference latency and engineering complexity, while selection-based methods like Gradient-based Parameter Selection (GPS) [2] require a full backward pass, which results in the same peak memory usage as full fine-tuning. To address this dilemma, we propose Feedforward-based Parameter Selection (FPS), a gradient-free method that identifies an optimal parameter subset in a single forward pass. FPS ranks parameters by the product of their magnitudes and corresponding input activations, leveraging both pre-trained knowledge and downstream data. Evaluated on $24$ visual tasks from FGVC and VTAB-1k, FPS achieves performance comparable to state-of-the-art methods while reducing peak memory usage by nearly $9 \times$ and accelerating parameter selection by about $2 \times$, offering a genuinely memory-efficient and practical solution for fine-tuning large-scale pre-trained models.
☆ Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start Recommendation
Recommender systems (RSs) are intelligent filtering methods that suggest items to users based on their inferred preferences, derived from their interaction history on the platform. Collaborative filtering-based RSs rely on users past interactions to generate recommendations. However, when a user is new to the platform, referred to as a cold-start user, there is no historical data available, making it difficult to provide personalized recommendations. To address this, rating elicitation techniques can be used to gather initial ratings or preferences on selected items, helping to build an early understanding of the user's tastes. Rating elicitation approaches are generally categorized into two types: non-personalized and personalized. Decision tree-based rating elicitation is a personalized method that queries users about their preferences at each node of the tree until sufficient information is gathered. In this paper, we propose an extension to the decision tree approach for rating elicitation in the context of music recommendation. Our method: (i) elicits not only item ratings but also preferences on attributes such as genres to better cluster users, and (ii) uses item pairs instead of single items at each node to more effectively learn user preferences. Experimental results demonstrate that both proposed enhancements lead to improved performance, particularly with a reduced number of queries.
☆ Reasoning Models Sometimes Output Illegible Chains of Thought
Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.
☆ When AI Trading Agents Compete: Adverse Selection of Meta-Orders by Reinforcement Learning-Based Market Making
We investigate the mechanisms by which medium-frequency trading agents are adversely selected by opportunistic high-frequency traders. We use reinforcement learning (RL) within a Hawkes Limit Order Book (LOB) model in order to replicate the behaviours of high-frequency market makers. In contrast to the classical models with exogenous price impact assumptions, the Hawkes model accounts for endogenous price impact and other key properties of the market (Jain et al. 2024a). Given the real-world impracticalities of the market maker updating strategies for every event in the LOB, we formulate the high-frequency market making agent via an impulse control reinforcement learning framework (Jain et al. 2025). The RL used in the simulation utilises Proximal Policy Optimisation (PPO) and self-imitation learning. To replicate the adverse selection phenomenon, we test the RL agent trading against a medium frequency trader (MFT) executing a meta-order and demonstrate that, with training against the MFT meta-order execution agent, the RL market making agent learns to capitalise on the price drift induced by the meta-order. Recent empirical studies have shown that medium-frequency traders are increasingly subject to adverse selection by high-frequency trading agents. As high-frequency trading continues to proliferate across financial markets, the slippage costs incurred by medium-frequency traders are likely to increase over time. However, we do not observe that increased profits for the market making RL agent necessarily cause significantly increased slippages for the MFT agent.
☆ MedM2T: A MultiModal Framework for Time-Aware Modeling with Electronic Health Record and Electrocardiogram Data
The inherent multimodality and heterogeneous temporal structures of medical data pose significant challenges for modeling. We propose MedM2T, a time-aware multimodal framework designed to address these complexities. MedM2T integrates: (i) Sparse Time Series Encoder to flexibly handle irregular and sparse time series, (ii) Hierarchical Time-Aware Fusion to capture both micro- and macro-temporal patterns from multiple dense time series, such as ECGs, and (iii) Bi-Modal Attention to extract cross-modal interactions, which can be extended to any number of modalities. To mitigate granularity gaps between modalities, MedM2T uses modality-specific pre-trained encoders and aligns resulting features within a shared encoder. We evaluated MedM2T on MIMIC-IV and MIMIC-IV-ECG datasets for three tasks that encompass chronic and acute disease dynamics: 90-day cardiovascular disease (CVD) prediction, in-hospital mortality prediction, and ICU length-of-stay (LOS) regression. MedM2T outperformed state-of-the-art multimodal learning frameworks and existing time series models, achieving an AUROC of 0.947 and an AUPRC of 0.706 for CVD prediction; an AUROC of 0.901 and an AUPRC of 0.558 for mortality prediction; and Mean Absolute Error (MAE) of 2.31 for LOS regression. These results highlight the robustness and broad applicability of MedM2T, positioning it as a promising tool in clinical prediction. We provide the implementation of MedM2T at https://github.com/DHLab-TSENG/MedM2T.
comment: This preprint version of the manuscript has been submitted to the IEEE Journal of Biomedical and Health Informatics (JBHI) for review. The implementation of MedM2T is available at https://github.com/DHLab-TSENG/MedM2T
☆ Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity
Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm
☆ Binary Anomaly Detection in Streaming IoT Traffic under Concept Drift
With the growing volume of Internet of Things (IoT) network traffic, machine learning (ML)-based anomaly detection is more relevant than ever. Traditional batch learning models face challenges such as high maintenance and poor adaptability to rapid anomaly changes, known as concept drift. In contrast, streaming learning integrates online and incremental learning, enabling seamless updates and concept drift detection to improve robustness. This study investigates anomaly detection in streaming IoT traffic as binary classification, comparing batch and streaming learning approaches while assessing the limitations of current IoT traffic datasets. We simulated heterogeneous network data streams by carefully mixing existing datasets and streaming the samples one by one. Our results highlight the failure of batch models to handle concept drift, but also reveal persisting limitations of current datasets to expose model limitations due to low traffic heterogeneity. We also investigated the competitiveness of tree-based ML algorithms, well-known in batch anomaly detection, and compared it to non-tree-based ones, confirming the advantages of the former. Adaptive Random Forest achieved F1-score of 0.990 $\pm$ 0.006 at one-third the computational cost of its batch counterpart. Hoeffding Adaptive Tree reached F1-score of 0.910 $\pm$ 0.007, reducing computational cost by four times, making it a viable choice for online applications despite a slight trade-off in stability.
comment: 6 pages, 11 figures, 3 tables
☆ Temporal Cardiovascular Dynamics for Improved PPG-Based Heart Rate Estimation
The oscillations of the human heart rate are inherently complex and non-linear -- they are best described by mathematical chaos, and they present a challenge when applied to the practical domain of cardiovascular health monitoring in everyday life. In this work, we study the non-linear chaotic behavior of heart rate through mutual information and introduce a novel approach for enhancing heart rate estimation in real-life conditions. Our proposed approach not only explains and handles the non-linear temporal complexity from a mathematical perspective but also improves the deep learning solutions when combined with them. We validate our proposed method on four established datasets from real-life scenarios and compare its performance with existing algorithms thoroughly with extensive ablation experiments. Our results demonstrate a substantial improvement, up to 40\%, of the proposed approach in estimating heart rate compared to traditional methods and existing machine-learning techniques while reducing the reliance on multiple sensing modalities and eliminating the need for post-processing steps.
comment: ArXiv version of the IEEE JBHI paper (https://ieeexplore.ieee.org/abstract/document/11193709)
☆ Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments EMNLP 2025
Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls. We present EnterpriseBench, a comprehensive benchmark that simulates enterprise settings, featuring 500 diverse tasks across software engineering, HR, finance, and administrative domains. Our benchmark uniquely captures key enterprise characteristics including data source fragmentation, access control hierarchies, and cross-functional workflows. Additionally, we provide a novel data generation pipeline that creates internally consistent enterprise tasks from organizational metadata. Experiments with state-of-the-art LLM agents demonstrate that even the most capable models achieve only 41.8% task completion, highlighting significant opportunities for improvement in enterprise-focused AI systems.
comment: Accepted at EMNLP 2025 Main Track
☆ HiF-DTA: Hierarchical Feature Learning Network for Drug-Target Affinity Prediction
Accurate prediction of Drug-Target Affinity (DTA) is crucial for reducing experimental costs and accelerating early screening in computational drug discovery. While sequence-based deep learning methods avoid reliance on costly 3D structures, they still overlook simultaneous modeling of global sequence semantic features and local topological structural features within drugs and proteins, and represent drugs as flat sequences without atomic-level, substructural-level, and molecular-level multi-scale features. We propose HiF-DTA, a hierarchical network that adopts a dual-pathway strategy to extract both global sequence semantic and local topological features from drug and protein sequences, and models drugs multi-scale to learn atomic, substructural, and molecular representations fused via a multi-scale bilinear attention module. Experiments on Davis, KIBA, and Metz datasets show HiF-DTA outperforms state-of-the-art baselines, with ablations confirming the importance of global-local extraction and multi-scale fusion.
comment: Accepted by International Conference on Bioinformatics and Biomedicine (BIBM 25)
☆ FOCUS: Efficient Keyframe Selection for Long Video Understanding
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.
☆ Traceable Drug Recommendation over Medical Knowledge Graphs CIKM2025
Drug recommendation (DR) systems aim to support healthcare professionals in selecting appropriate medications based on patients' medical conditions. State-of-the-art approaches utilize deep learning techniques for improving DR, but fall short in providing any insights on the derivation process of recommendations -- a critical limitation in such high-stake applications. We propose TraceDR, a novel DR system operating over a medical knowledge graph (MKG), which ensures access to large-scale and high-quality information. TraceDR simultaneously predicts drug recommendations and related evidence within a multi-task learning framework, enabling traceability of medication recommendations. For covering a more diverse set of diseases and drugs than existing works, we devise a framework for automatically constructing patient health records and release DrugRec, a new large-scale testbed for DR.
comment: Accepted to MediKS@CIKM2025
☆ Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model's inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis.
☆ T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis
In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.
comment: Main: 11 pages, Supplementary: 9 pages 10 tables, 10 figures
☆ ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction
Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols in previous literature are inconsistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.
☆ Higher-order Linear Attention
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.
comment: Project Page: https://github.com/yifanzhang-pro/HLA
☆ ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models
Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.
comment: 23 pages, 13 figures, 7 tables
☆ Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.
comment: 14 pages, 3 Figures
☆ Not All Instances Are Equally Valuable: Towards Influence-Weighted Dataset Distillation
Dataset distillation condenses large datasets into synthetic subsets, achieving performance comparable to training on the full dataset while substantially reducing storage and computation costs. Most existing dataset distillation methods assume that all real instances contribute equally to the process. In practice, real-world datasets contain both informative and redundant or even harmful instances, and directly distilling the full dataset without considering data quality can degrade model performance. In this work, we present Influence-Weighted Distillation IWD, a principled framework that leverages influence functions to explicitly account for data quality in the distillation process. IWD assigns adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful ones. Owing to its modular design, IWD can be seamlessly integrated into diverse dataset distillation frameworks. Our empirical results suggest that integrating IWD tends to improve the quality of distilled datasets and enhance model performance, with accuracy gains of up to 7.8%.
☆ FedSM: Robust Semantics-Guided Feature Mixup for Bias Reduction in Federated Learning with Long-Tail Data
Federated Learning (FL) enables collaborative model training across decentralized clients without sharing private data. However, FL suffers from biased global models due to non-IID and long-tail data distributions. We propose \textbf{FedSM}, a novel client-centric framework that mitigates this bias through semantics-guided feature mixup and lightweight classifier retraining. FedSM uses a pretrained image-text-aligned model to compute category-level semantic relevance, guiding the category selection of local features to mix-up with global prototypes to generate class-consistent pseudo-features. These features correct classifier bias, especially when data are heavily skewed. To address the concern of potential domain shift between the pretrained model and the data, we propose probabilistic category selection, enhancing feature diversity to effectively mitigate biases. All computations are performed locally, requiring minimal server overhead. Extensive experiments on long-tail datasets with various imbalanced levels demonstrate that FedSM consistently outperforms state-of-the-art methods in accuracy, with high robustness to domain shift and computational efficiency.
☆ Soft Task-Aware Routing of Experts for Equivariant Representation Learning NeurIPS 2025
Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce Soft Task-Aware Routing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at https://github.com/YonseiML/star.
comment: NeurIPS 2025
☆ Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness
We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.
☆ Feature-Function Curvature Analysis: A Geometric Framework for Explaining Differentiable Models
Explainable AI (XAI) is critical for building trust in complex machine learning models, yet mainstream attribution methods often provide an incomplete, static picture of a model's final state. By collapsing a feature's role into a single score, they are confounded by non-linearity and interactions. To address this, we introduce Feature-Function Curvature Analysis (FFCA), a novel framework that analyzes the geometry of a model's learned function. FFCA produces a 4-dimensional signature for each feature, quantifying its: (1) Impact, (2) Volatility, (3) Non-linearity, and (4) Interaction. Crucially, we extend this framework into Dynamic Archetype Analysis, which tracks the evolution of these signatures throughout the training process. This temporal view moves beyond explaining what a model learned to revealing how it learns. We provide the first direct, empirical evidence of hierarchical learning, showing that models consistently learn simple linear effects before complex interactions. Furthermore, this dynamic analysis provides novel, practical diagnostics for identifying insufficient model capacity and predicting the onset of overfitting. Our comprehensive experiments demonstrate that FFCA, through its static and dynamic components, provides the essential geometric context that transforms model explanation from simple quantification to a nuanced, trustworthy analysis of the entire learning process.
☆ MDAS-GNN: Multi-Dimensional Spatiotemporal GNN with Spatial Diffusion for Urban Traffic Risk Forecasting
Traffic accidents represent a critical public health challenge, claiming over 1.35 million lives annually worldwide. Traditional accident prediction models treat road segments independently, failing to capture complex spatial relationships and temporal dependencies in urban transportation networks. This study develops MDAS-GNN, a Multi-Dimensional Attention-based Spatial-diffusion Graph Neural Network integrating three core risk dimensions: traffic safety, infrastructure, and environmental risk. The framework employs feature-specific spatial diffusion mechanisms and multi-head temporal attention to capture dependencies across different time horizons. Evaluated on UK Department for Transport accident data across Central London, South Manchester, and SE Birmingham, MDASGNN achieves superior performance compared to established baseline methods. The model maintains consistently low prediction errors across short, medium, and long-term periods, with particular strength in long-term forecasting. Ablation studies confirm that integrated multi-dimensional features outperform singlefeature approaches, reducing prediction errors by up to 40%. This framework provides civil engineers and urban planners with advanced predictive capabilities for transportation infrastructure design, enabling data-driven decisions for road network optimization, infrastructure resource improvements, and strategic safety interventions in urban development projects.
☆ Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications
Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations--a phenomenon we term "hallucination" in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.
☆ SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference
Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over $23\%$ while efficiently adapting to dynamic workloads.
☆ A Polynomial-time Algorithm for Online Sparse Linear Regression with Improved Regret Bound under Weaker Conditions COLT 2025
In this paper, we study the problem of online sparse linear regression (OSLR) where the algorithms are restricted to accessing only $k$ out of $d$ attributes per instance for prediction, which was proved to be NP-hard. Previous work gave polynomial-time algorithms assuming the data matrix satisfies the linear independence of features, the compatibility condition, or the restricted isometry property. We introduce a new polynomial-time algorithm, which significantly improves previous regret bounds (Ito et al., 2017) under the compatibility condition that is weaker than the other two assumptions. The improvements benefit from a tighter convergence rate of the $\ell_1$-norm error of our estimators. Our algorithm leverages the well-studied Dantzig Selector, but importantly with several novel techniques, including an algorithm-dependent sampling scheme for estimating the covariance matrix, an adaptive parameter tuning scheme, and a batching online Newton step with careful initializations. We also give novel and non-trivial analyses, including an induction method for analyzing the $\ell_1$-norm error, careful analyses on the covariance of non-independent random variables, and a decomposition on the regret. We further extend our algorithm to OSLR with additional observations where the algorithms can observe additional $k_0$ attributes after each prediction, and improve previous regret bounds (Kale et al., 2017; Ito et al., 2017).
comment: A minor algorithmic error in our paper presented on COLT 2025 has been corrected in this arXiv update. We also have updated the pseudo-code of the algorithm. Our theoretical analyses, as well as all theoretical bounds, remain unaffected by those changes
☆ FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction
Fast and accurate simulation of dynamical systems is a fundamental challenge across scientific and engineering domains. Traditional numerical integrators often face a trade-off between accuracy and computational efficiency, while existing neural network-based approaches typically require training a separate model for each case. To overcome these limitations, we introduce a novel multi-modal foundation model for large-scale simulations of differential equations: FMint-SDE (Foundation Model based on Initialization for stochastic differential equations). Based on a decoder-only transformer with in-context learning, FMint-SDE leverages numerical and textual modalities to learn a universal error-correction scheme. It is trained using prompted sequences of coarse solutions generated by conventional solvers, enabling broad generalization across diverse systems. We evaluate our models on a suite of challenging SDE benchmarks spanning applications in molecular dynamics, mechanical systems, finance, and biology. Experimental results show that our approach achieves a superior accuracy-efficiency tradeoff compared to classical solvers, underscoring the potential of FMint-SDE as a general-purpose simulation tool for dynamical systems.
☆ Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.
☆ Exploring Landscapes for Better Minima along Valleys
Finding lower and better-generalizing minima is crucial for deep learning. However, most existing optimizers stop searching the parameter space once they reach a local minimum. Given the complex geometric properties of the loss landscape, it is difficult to guarantee that such a point is the lowest or provides the best generalization. To address this, we propose an adaptor "E" for gradient-based optimizers. The adapted optimizer tends to continue exploring along landscape valleys (areas with low and nearly identical losses) in order to search for potentially better local minima even after reaching a local minimum. This approach increases the likelihood of finding a lower and flatter local minimum, which is often associated with better generalization. We also provide a proof of convergence for the adapted optimizers in both convex and non-convex scenarios for completeness. Finally, we demonstrate their effectiveness in an important but notoriously difficult training scenario, large-batch training, where Lamb is the benchmark optimizer. Our testing results show that the adapted Lamb, ALTO, increases the test accuracy (generalization) of the current state-of-the-art optimizer by an average of 2.5% across a variety of large-batch training tasks. This work potentially opens a new research direction in the design of optimization algorithms.
comment: Neurips 2025 poster
☆ SERVIMON: AI-Driven Predictive Maintenance and Real-Time Monitoring for Astronomical Observatories
Objective: ServiMon is designed to offer a scalable and intelligent pipeline for data collection and auditing to monitor distributed astronomical systems such as the ASTRI Mini-Array. The system enhances quality control, predictive maintenance, and real-time anomaly detection for telescope operations. Methods: ServiMon integrates cloud-native technologies-including Prometheus, Grafana, Cassandra, Kafka, and InfluxDB-for telemetry collection and processing. It employs machine learning algorithms, notably Isolation Forest, to detect anomalies in Cassandra performance metrics. Key indicators such as read/write latency, throughput, and memory usage are continuously monitored, stored as time-series data, and preprocessed for feature engineering. Anomalies detected by the model are logged in InfluxDB v2 and accessed via Flux for real-time monitoring and visualization. Results: AI-based anomaly detection increases system resilience by identifying performance degradation at an early stage, minimizing downtime, and optimizing telescope operations. Additionally, ServiMon supports astrostatistical analysis by correlating telemetry with observational data, thus enhancing scientific data quality. AI-generated alerts also improve real-time monitoring, enabling proactive system management. Conclusion: ServiMon's scalable framework proves effective for predictive maintenance and real-time monitoring of astronomical infrastructures. By leveraging cloud and edge computing, it is adaptable to future large-scale experiments, optimizing both performance and cost. The combination of machine learning and big data analytics makes ServiMon a robust and flexible solution for modern and next-generation observational astronomy.
comment: Accepted for publication in IAU Symposium 397: Exploring the Universe with Artificial Intelligence (UniversAI 2025), Cambridge University Press. Editors: C. Sterken, J. Hearnshaw & D. Valls-Gabaud
☆ Relation-Aware Bayesian Optimization of DBMS Configurations Guided by Affinity Scores
Database Management Systems (DBMSs) are fundamental for managing large-scale and heterogeneous data, and their performance is critically influenced by configuration parameters. Effective tuning of these parameters is essential for adapting to diverse workloads and maximizing throughput while minimizing latency. Recent research has focused on automated configuration optimization using machine learning; however, existing approaches still exhibit several key limitations. Most tuning frameworks disregard the dependencies among parameters, assuming that each operates independently. This simplification prevents optimizers from leveraging relational effects across parameters, limiting their capacity to capture performancesensitive interactions. Moreover, to reduce the complexity of the high-dimensional search space, prior work often selects only the top few parameters for optimization, overlooking others that contribute meaningfully to performance. Bayesian Optimization (BO), the most common method for automatic tuning, is also constrained by its reliance on surrogate models, which can lead to unstable predictions and inefficient exploration. To overcome these limitations, we propose RelTune, a novel framework that represents parameter dependencies as a Relational Graph and learns GNN-based latent embeddings that encode performancerelevant semantics. RelTune further introduces Hybrid-Score-Guided Bayesian Optimization (HBO), which combines surrogate predictions with an Affinity Score measuring proximity to previously high-performing configurations. Experimental results on multiple DBMSs and workloads demonstrate that RelTune achieves faster convergence and higher optimization efficiency than conventional BO-based methods, achieving state-of-the-art performance across all evaluated scenarios.
comment: 13 pages
☆ FairAD: Computationally Efficient Fair Graph Clustering via Algebraic Distance CIKM 2025
Due to the growing concern about unsavory behaviors of machine learning models toward certain demographic groups, the notion of 'fairness' has recently drawn much attention from the community, thereby motivating the study of fairness in graph clustering. Fair graph clustering aims to partition the set of nodes in a graph into $k$ disjoint clusters such that the proportion of each protected group within each cluster is consistent with the proportion of that group in the entire dataset. It is, however, computationally challenging to incorporate fairness constraints into existing graph clustering algorithms, particularly for large graphs. To address this problem, we propose FairAD, a computationally efficient fair graph clustering method. It first constructs a new affinity matrix based on the notion of algebraic distance such that fairness constraints are imposed. A graph coarsening process is then performed on this affinity matrix to find representative nodes that correspond to $k$ clusters. Finally, a constrained minimization problem is solved to obtain the solution of fair clustering. Experiment results on the modified stochastic block model and six public datasets show that FairAD can achieve fair clustering while being up to 40 times faster compared to state-of-the-art fair graph clustering algorithms.
comment: ACM CIKM 2025
☆ Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring
This study explored the utilities of rationales generated by GPT-4.1 and GPT-5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data. Essay-based scoring was compared with rationale-based scoring. The study found in general essay-based scoring performed better than rationale-based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues. The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay-based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay-based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature.
comment: 12 pages, 3 figures
☆ AI Agents in Drug Discovery
Artificial intelligence (AI) agents are emerging as transformative tools in drug discovery, with the ability to autonomously reason, act, and learn through complicated research workflows. Building on large language models (LLMs) coupled with perception, computation, action, and memory tools, these agentic AI systems could integrate diverse biomedical data, execute tasks, carry out experiments via robotic platforms, and iteratively refine hypotheses in closed loops. We provide a conceptual and technical overview of agentic AI architectures, ranging from ReAct and Reflection to Supervisor and Swarm systems, and illustrate their applications across key stages of drug discovery, including literature synthesis, toxicity prediction, automated protocol generation, small-molecule synthesis, drug repurposing, and end-to-end decision-making. To our knowledge, this represents the first comprehensive work to present real-world implementations and quantifiable impacts of agentic AI systems deployed in operational drug discovery settings. Early implementations demonstrate substantial gains in speed, reproducibility, and scalability, compressing workflows that once took months into hours while maintaining scientific traceability. We discuss the current challenges related to data heterogeneity, system reliability, privacy, and benchmarking, and outline future directions towards technology in support of science and translation.
comment: 45 pages, 12 figures
☆ AURA: A Reinforcement Learning Framework for AI-Driven Adaptive Conversational Surveys
Conventional online surveys provide limited personalization, often resulting in low engagement and superficial responses. Although AI survey chatbots improve convenience, most are still reactive: they rely on fixed dialogue trees or static prompt templates and therefore cannot adapt within a session to fit individual users, which leads to generic follow-ups and weak response quality. We address these limitations with AURA (Adaptive Understanding through Reinforcement Learning for Assessment), a reinforcement learning framework for AI-driven adaptive conversational surveys. AURA quantifies response quality using a four-dimensional LSDE metric (Length, Self-disclosure, Emotion, and Specificity) and selects follow-up question types via an epsilon-greedy policy that updates the expected quality gain within each session. Initialized with priors extracted from 96 prior campus-climate conversations (467 total chatbot-user exchanges), the system balances exploration and exploitation across 10-15 dialogue exchanges, dynamically adapting to individual participants in real time. In controlled evaluations, AURA achieved a +0.12 mean gain in response quality and a statistically significant improvement over non-adaptive baselines (p=0.044, d=0.66), driven by a 63% reduction in specification prompts and a 10x increase in validation behavior. These results demonstrate that reinforcement learning can give survey chatbots improved adaptivity, transforming static questionnaires into interactive, self-improving assessment systems.
☆ Group-Sensitive Offline Contextual Bandits
Offline contextual bandits allow one to learn policies from historical/offline data without requiring online interaction. However, offline policy optimization that maximizes overall expected rewards can unintentionally amplify the reward disparities across groups. As a result, some groups might benefit more than others from the learned policy, raising concerns about fairness, especially when the resources are limited. In this paper, we study a group-sensitive fairness constraint in offline contextual bandits, reducing group-wise reward disparities that may arise during policy learning. We tackle the following common-parity requirements: the reward disparity is constrained within some user-defined threshold or the reward disparity should be minimized during policy optimization. We propose a constrained offline policy optimization framework by introducing group-wise reward disparity constraints into an off-policy gradient-based optimization procedure. To improve the estimation of the group-wise reward disparity during training, we employ a doubly robust estimator and further provide a convergence guarantee for policy optimization. Empirical results in synthetic and real-world datasets demonstrate that our method effectively reduces reward disparities while maintaining competitive overall performance.
♻ ☆ On Uncertainty Calibration for Equivariant Functions
Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.
comment: Under review at Transactions on Machine Learning Research (TMLR). Code is available at https://github.com/EdwardBerman/EquiUQ . Excited to share this paper, comments welcome :D. This version fixed a couple typos in the previous upload
♻ ☆ On the limitation of evaluating machine unlearning using only a single training seed
Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.
comment: mini paper, 2 figures
♻ ☆ Conformal Object Detection by Sequential Risk Control
Recent advances in object detectors have led to their adoption for industrial uses. However, their deployment in safety-critical applications is hindered by the inherent lack of reliability of neural networks and the complex structure of object detection models. To address these challenges, we turn to Conformal Prediction, a post-hoc predictive uncertainty quantification procedure with statistical guarantees that are valid for any dataset size, without requiring prior knowledge on the model or data distribution. Our contribution is manifold. First, we formally define the problem of Conformal Object Detection (COD). We introduce a novel method, Sequential Conformal Risk Control (SeqCRC), that extends the statistical guarantees of Conformal Risk Control to two sequential tasks with two parameters, as required in the COD setting. Then, we present old and new loss functions and prediction sets suited to applying SeqCRC to different cases and certification requirements. Finally, we present a conformal toolkit for replication and further exploration of our method. Using this toolkit, we perform extensive experiments that validate our approach and emphasize trade-offs and other practical consequences.
comment: 29 pages, 12 figures
♻ ☆ Kernel conditional tests from learning-theoretic bounds NeurIPS 2025
We propose a framework for hypothesis testing on conditional probability distributions, which we then use to construct statistical tests of functionals of conditional distributions. These tests identify the inputs where the functionals differ with high probability, and include tests of conditional moments or two-sample tests. Our key idea is to transform confidence bounds of a learning method into a test of conditional expectations. We instantiate this principle for kernel ridge regression (KRR) with subgaussian noise. An intermediate data embedding then enables more general tests -- including conditional two-sample tests -- via kernel mean embeddings of distributions. To have guarantees in this setting, we generalize existing pointwise-in-time or time-uniform confidence bounds for KRR to previously-inaccessible yet essential cases such as infinite-dimensional outputs with non-trace-class kernels. These bounds also circumvent the need for independent data, allowing for instance online sampling. To make our tests readily applicable in practice, we introduce bootstrapping schemes leveraging the parametric form of testing thresholds identified in theory to avoid tuning inaccessible parameters. We illustrate the tests on examples, including one in process monitoring and comparison of dynamical systems. Overall, our results establish a comprehensive foundation for conditional testing on functionals, from theoretical guarantees to an algorithmic implementation, and advance the state of the art on confidence bounds for vector-valued least squares estimation.
comment: 46 pages, 8 figures, 9 tables. Accepted at NeurIPS 2025; to appear in the proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems. Reviews and discussion: https://openreview.net/forum?id=hJKDwf32Xu
♻ ☆ TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
comment: 30 pages, 18 figures, 13 tables
♻ ☆ Generative Adversarial Networks for High-Dimensional Item Factor Analysis: A Deep Adversarial Learning Algorithm
Advances in deep learning and representation learning have transformed item factor analysis (IFA) in the item response theory (IRT) literature by enabling more efficient and accurate parameter estimation. Variational Autoencoders (VAEs) have been one of the most impactful techniques in modeling high-dimensional latent variables in this context. However, the limited expressiveness of the inference model based on traditional VAEs can still hinder the estimation performance. We introduce Adversarial Variational Bayes (AVB) algorithms as an improvement to VAEs for IFA with improved flexibility and accuracy. By bridging the strengths of VAEs and Generative Adversarial Networks (GANs), AVB incorporates an auxiliary discriminator network to reframe the estimation process as a two-player adversarial game and removes the restrictive assumption of standard normal distributions in the inference model. Theoretically, AVB can achieve similar or higher likelihood compared to VAEs. A further enhanced algorithm, Importance-weighted Adversarial Variational Bayes (IWAVB) is proposed and compared with Importance-weighted Autoencoders (IWAE). In an exploratory analysis of empirical data, IWAVB demonstrated superior expressiveness by achieving a higher likelihood compared to IWAE. In confirmatory analysis with simulated data, IWAVB achieved similar mean-square error results to IWAE while consistently achieving higher likelihoods. When latent variables followed a multimodal distribution, IWAVB outperformed IWAE. With its innovative use of GANs, IWAVB is shown to have the potential to extend IFA to handle large-scale data, facilitating the potential integration of psychometrics and multimodal data analysis.
♻ ☆ Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
♻ ☆ UdonCare: Hierarchy Pruning for Unseen Domain Discovery in Predictive Healthcare
Healthcare providers often divide patient populations into cohorts based on shared clinical factors, such as medical history, to deliver personalized healthcare services. This idea has also been adopted in clinical prediction models, where it presents a vital challenge: capturing both global and cohort-specific patterns while enabling model generalization to unseen domains. Addressing this challenge falls under the scope of domain generalization (DG). However, conventional DG approaches often struggle in clinical settings due to the absence of explicit domain labels and the inherent gap in medical knowledge. To address this, we propose UdonCare, a hierarchy-guided method that iteratively divides patients into latent domains and decomposes domain-invariant (label) information from patient data. Our method identifies patient domains by pruning medical ontologies (e.g. ICD-9-CM hierarchy). On two public datasets, MIMIC-III and MIMIC-IV, UdonCare shows superiority over eight baselines across four clinical prediction tasks with substantial domain gaps, highlighting the untapped potential of medical knowledge in guiding clinical domain generalization problems.
♻ ☆ I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models NeurIPS 2025
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
comment: Accepted at the 5th Workshop on Mathematical Reasoning and AI (MATH-AI), NeurIPS 2025
♻ ☆ Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees
Discrete diffusion models have recently gained significant prominence in applications involving natural language and graph data. A key factor influencing their effectiveness is the efficiency of discretized samplers. Among these, $\tau$-leaping samplers have become particularly popular due to their theoretical and empirical success. However, existing theoretical analyses of $\tau$-leaping often rely on somewhat restrictive and difficult-to-verify regularity assumptions, and their convergence bounds contain quadratic dependence on the vocabulary size. In this work, we introduce a new analytical approach for discrete diffusion models that removes the need for such assumptions. For the standard $\tau$-leaping method, we establish convergence guarantees in KL divergence that scale linearly with vocabulary size, improving upon prior results with quadratic dependence. Our approach is also more broadly applicable: it provides the first convergence guarantees for other widely used samplers, including the Euler method and Tweedie $\tau$-leaping. Central to our approach is a novel technique based on differential inequalities, offering a more flexible alternative to the traditional Girsanov change-of-measure methods. This technique may also be of independent interest for the analysis of other stochastic processes.
♻ ☆ Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models
Discrete state space diffusion models have shown significant advantages in applications involving discrete data, such as text and image generation. It has also been observed that their performance is highly sensitive to the choice of rate matrices, particularly between uniform and absorbing rate matrices. While empirical results suggest that absorbing rate matrices often yield better generation quality compared to uniform rate matrices, existing theoretical works have largely focused on the uniform rate matrices case. Notably, convergence guarantees and error analyses for absorbing diffusion models are still missing. In this work, we provide the first finite-time error bounds and convergence rate analysis for discrete diffusion models using absorbing rate matrices. We begin by deriving an upper bound on the KL divergence of the forward process, introducing a surrogate initialization distribution to address the challenge posed by the absorbing stationary distribution, which is a singleton and causes the KL divergence to be ill-defined. We then establish the first convergence guarantees for both the $\tau$-leaping and uniformization samplers under absorbing rate matrices, demonstrating improved rates over their counterparts using uniform rate matrices. Furthermore, under suitable assumptions, we provide convergence guarantees without early stopping. Our analysis introduces several new technical tools to address challenges unique to absorbing rate matrices. These include a Jensen-type argument for bounding forward process convergence, novel techniques for bounding absorbing score functions, and a non-divergent upper bound on the score near initialization that removes the need of early-stopping.
♻ ☆ Persistent Homology for Structural Characterization in Disordered Systems
We propose a unified framework based on persistent homology (PH) to characterize both local and global structures in disordered systems. It can simultaneously generate local and global descriptors using the same algorithm and data structure, and has shown to be highly effective and interpretable in predicting particle rearrangements and classifying global phases. We also demonstrated that using a single variable enables a linear SVM to achieve nearly perfect three-phase classification. Inspired by this discovery, we define a non-parametric metric, the Separation Index (SI), which not only achieves this classification without sacrificing significant performance but also establishes a connection between particle environments and the global phase structure. Our methods provide an effective framework for understanding and analyzing the properties of disordered materials, with broad potential applications in materials science and even wider studies of complex systems.
comment: 24 pages, 19 figures
♻ ☆ SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks EMNLP 2025
Direct alignment algorithms have proven an effective step for aligning language models to human-desired behaviors. Current variants of the Direct Preference Optimization objective have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected equally by each word in a sequence but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best balance reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments illustrate the effectiveness of our approach at aligning to preference proxies, including sentiment control, helpfulness and harmlessness, and summary quality. Our method obtains +10% and +3% win rate points in summarization and dialogue scenarios, respectively, without compromising model reasoning or the relevancy and faithfulness of the summary response.
comment: 27 pages, 9 figures, 5 tables. Accepted to EMNLP 2025
♻ ☆ Smooth Flow Matching
Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.
comment: 66 pages, 8 figures
♻ ☆ HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose $\textit{HELIOS}$, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. $\textit{First}$, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. $\textit{Second}$, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$ higher throughput and $15.14\times$ larger batch size compared to existing EE-LLM frameworks.
♻ ☆ Supervised Quadratic Feature Analysis: Information Geometry Approach for Dimensionality Reduction
Supervised dimensionality reduction maps labeled data into a low-dimensional feature space while preserving class discriminability. A common approach is to maximize a statistical measure of dissimilarity between classes in the feature space. Information geometry provides an alternative framework for measuring class dissimilarity, with the potential for improved insights and novel applications. Information geometry, which is grounded in Riemannian geometry, uses the Fisher information metric, a local measure of discriminability that induces the Fisher-Rao distance. Here, we present Supervised Quadratic Feature Analysis (SQFA), a linear dimensionality reduction method that maximizes Fisher-Rao distances between class-conditional distributions, under Gaussian assumptions. We motivate the Fisher-Rao distance as a good proxy for discriminability. We show that SQFA features support good classification performance with Quadratic Discriminant Analysis (QDA) on three real-world datasets. SQFA provides a novel framework for supervised dimensionality reduction, motivating future research in applying information geometry to machine learning and neuroscience.
comment: 24 pages, 15 figures
♻ ☆ Resource-Adaptive Successive Doubling for Hyperparameter Optimization with Large Datasets on High-Performance Computing Systems
On High-Performance Computing (HPC) systems, several hyperparameter configurations can be evaluated in parallel to speed up the Hyperparameter Optimization (HPO) process. State-of-the-art HPO methods follow a bandit-based approach and build on top of successive halving, where the final performance of a combination is estimated based on a lower than fully trained fidelity performance metric and more promising combinations are assigned more resources over time. Frequently, the number of epochs is treated as a resource, letting more promising combinations train longer. Another option is to use the number of workers as a resource and directly allocate more workers to more promising configurations via data-parallel training. This article proposes a novel Resource-Adaptive Successive Doubling Algorithm (RASDA), which combines a resource-adaptive successive doubling scheme with the plain Asynchronous Successive Halving Algorithm (ASHA). Scalability of this approach is shown on up to 1,024 Graphics Processing Units (GPUs) on modern HPC systems. It is applied to different types of Neural Networks (NNs) and trained on large datasets from the Computer Vision (CV), Computational Fluid Dynamics (CFD), and Additive Manufacturing (AM) domains, where performing more than one full training run is usually infeasible. Empirical results show that RASDA outperforms ASHA by a factor of up to 1.9 with respect to the runtime. At the same time, the solution quality of final ASHA models is maintained or even surpassed by the implicit batch size scheduling of RASDA. With RASDA, systematic HPO is applied to a terabyte-scale scientific dataset for the first time in the literature, enabling efficient optimization of complex models on massive scientific data. The implementation of RASDA is available on https://github.com/olympiquemarcel/rasda
♻ ☆ Data-Driven Stochastic Optimal Control in Reproducing Kernel Hilbert Spaces
This paper proposes a fully data-driven approach for optimal control of nonlinear control-affine systems represented by a stochastic diffusion. The focus is on the scenario where both the nonlinear dynamics and stage cost functions are unknown, while only a control penalty function and constraints are provided. To this end, we embed state probability densities into a reproducing kernel Hilbert space (RKHS) to leverage recent advances in operator regression, thereby identifying Markov transition operators associated with controlled diffusion processes. This operator learning approach integrates naturally with convex operator-theoretic Hamilton-Jacobi-Bellman recursions that scale linearly with state dimensionality, effectively solving a wide range of nonlinear optimal control problems. Numerical results demonstrate its ability to address diverse nonlinear control tasks, including the depth regulation of an autonomous underwater vehicle.
comment: author-submitted electronic preprint version: 19 pages, 5 figures, 3 tables
♻ ☆ A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees NeurIPS 2025
Finding an $\epsilon$-stationary point of a nonconvex function with a Lipschitz continuous Hessian is a central problem in optimization. Regularized Newton methods are a classical tool and have been studied extensively, yet they still face a trade-off between global and local convergence. Whether a parameter-free algorithm of this type can simultaneously achieve optimal global complexity and quadratic local convergence remains an open question. To bridge this long-standing gap, we propose a new class of regularizers constructed from the current and previous gradients, and leverage the conjugate gradient approach with a negative curvature monitor to solve the regularized Newton equation. The proposed algorithm is adaptive, requiring no prior knowledge of the Hessian Lipschitz constant, and achieves a global complexity of $O(\epsilon^{-3/2})$ in terms of the second-order oracle calls, and $\tilde{O}(\epsilon^{-7/4})$ for Hessian-vector products, respectively. When the iterates converge to a point where the Hessian is positive definite, the method exhibits quadratic local convergence. Preliminary numerical results, including training the physics-informed neural networks, illustrate the competitiveness of our algorithm.
comment: NeurIPS 2025
♻ ☆ A Practical Introduction to Kernel Discrepancies: MMD, HSIC & KSD
This article provides a practical introduction to kernel discrepancies, focusing on the Maximum Mean Discrepancy (MMD), the Hilbert-Schmidt Independence Criterion (HSIC), and the Kernel Stein Discrepancy (KSD). Various estimators for these discrepancies are presented, including the commonly-used V-statistics and U-statistics, as well as several forms of the more computationally-efficient incomplete U-statistics. The importance of the choice of kernel bandwidth is stressed, showing how it affects the behaviour of the discrepancy estimation. Adaptive estimators are introduced, which combine multiple estimators with various kernels, addressing the problem of kernel selection.
comment: 26 pages, 7 figures, v2 fixes Equation 16, v3 fixes cross-covariance notation p3
♻ ☆ Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased
When using machine learning for imbalanced binary classification problems, it is common to subsample the majority class to create a (more) balanced training dataset. This biases the model's predictions because the model learns from data whose data generating process differs from new data. One way of accounting for this bias is analytically mapping the resulting predictions to new values based on the sampling rate for the majority class. We show that calibrating a random forest this way has negative consequences, including prevalence estimates that depend on both the number of predictors considered at each split in the random forest and the sampling rate used. We explain the former using known properties of random forests and analytical calibration. Through investigating the latter issue, we made a surprising discovery - contrary to the widespread belief that decision trees are biased towards the majority class, they actually can be biased towards the minority class.
♻ ☆ RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks
Solving long sequential tasks poses a significant challenge in embodied artificial intelligence. Enabling a robotic system to perform diverse sequential tasks with a broad range of manipulation skills is an active area of research. In this work, we present a Hybrid Hierarchical Learning framework, the Robotic Manipulation Network (ROMAN), to address the challenge of solving multiple complex tasks over long time horizons in robotic manipulation. ROMAN achieves task versatility and robust failure recovery by integrating behavioural cloning, imitation learning, and reinforcement learning. It consists of a central manipulation network that coordinates an ensemble of various neural networks, each specialising in distinct re-combinable sub-tasks to generate their correct in-sequence actions for solving complex long-horizon manipulation tasks. Experimental results show that by orchestrating and activating these specialised manipulation experts, ROMAN generates correct sequential activations for accomplishing long sequences of sophisticated manipulation tasks and achieving adaptive behaviours beyond demonstrations, while exhibiting robustness to various sensory noises. These results demonstrate the significance and versatility of ROMAN's dynamic adaptability featuring autonomous failure recovery capabilities, and highlight its potential for various autonomous manipulation tasks that demand adaptive motor skills.
comment: To appear in Nature Machine Intelligence. Includes the main and supplementary manuscript. Total of 70 pages, with a total of 9 Figures and 17 Tables
♻ ☆ DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions
We consider the Inverse Optimal Stopping (IOS) problem where, based on stopped expert trajectories, one aims to recover the optimal stopping region through the continuation and stopping gain functions approximation. The uniqueness of the stopping region allows the use of IOS in real-world applications with safety concerns. Although current state-of-the-art inverse reinforcement learning methods recover both a Q-function and the corresponding optimal policy, they fail to account for specific challenges posed by optimal stopping problems. These include data sparsity near the stopping region, the non-Markovian nature of the continuation gain, a proper treatment of boundary conditions, the need for a stable offline approach for risk-sensitive applications, and a lack of a quality evaluation metric. These challenges are addressed with the proposed Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping (DO-IQS), which incorporates temporal information by approximating the cumulative continuation gain together with the world dynamics and the Q-function without querying to the environment. In addition, a confidence-based oversampling approach is proposed to treat the data sparsity problem. We demonstrate the performance of our models on real and artificial data including an optimal intervention for the critical events problem.
♻ ☆ PoLAR: Polar-Decomposed Low-Rank Adapter Representation
We show that low-rank adaptation of large-scale models suffers from a low stable rank that is well below the linear algebraic rank of the subspace, degrading fine-tuning performance. To mitigate the underutilization of the allocated subspace, we propose PoLAR, a parameterization inspired by the polar decomposition that factorizes the low-rank update into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix. Our theory shows that PoLAR yields an exponentially faster convergence rate on a canonical low-rank adaptation problem. Pairing the parameterization with Riemannian optimization leads to consistent gains on three different benchmarks testing general language understanding, commonsense reasoning, and mathematical problem solving with base model sizes ranging from 350M to 27B.
♻ ☆ A tutorial on discovering and quantifying the effect of latent causal sources of multimodal EHR data
We provide an accessible description of a peer-reviewed generalizable causal machine learning pipeline to (i) discover latent causal sources of large-scale electronic health records observations, and (ii) quantify the source causal effects on clinical outcomes. We illustrate how imperfect multimodal clinical data can be processed, decomposed into probabilistic independent latent sources, and used to train taskspecific causal models from which individual causal effects can be estimated. We summarize the findings of the two real-world applications of the approach to date as a demonstration of its versatility and utility for medical discovery at scale.
comment: Accepted at the 1st Multimodal Representation Learning for Healthcare EurIPS 2025 Workshop (https://multimodal-rep-learning-for-health.github.io/)
♻ ☆ NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration
Large language models (LLMs) have recently demonstrated the ability to act as function call agents by invoking external tools, enabling them to solve tasks beyond their static knowledge. However, existing agents typically call tools step by step at a time without a global view of task structure. As tools depend on each other, this leads to error accumulation and limited scalability, particularly when scaling to thousands of tools. To address these limitations, we propose NaviAgent, a novel bilevel architecture that decouples task planning from tool execution through graph-based modeling of the tool ecosystem. At the task-planning level, the LLM-based agent decides whether to respond directly, clarify user intent, invoke a toolchain, or execute tool outputs, ensuring broad coverage of interaction scenarios independent of inter-tool complexity. At the execution level, a continuously evolving Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, guiding the agent to generate scalable and robust invocation sequences. By incorporating feedback from real tool interactions, NaviAgent supports closed-loop optimization of planning and execution, moving beyond tool calling toward adaptive navigation of large-scale tool ecosystems. Experiments show that NaviAgent achieves the best task success rates across models and tasks, and integrating TWMN further boosts performance by up to 17 points on complex tasks, underscoring its key role in toolchain orchestration.
♻ ☆ Byzantine Resilient Federated Multi-Task Representation Learning
In this paper, we propose BR-MTRL, a Byzantine-resilient multi-task representation learning framework that handles faulty or malicious agents. Our approach leverages representation learning through a shared neural network model, where all clients share fixed layers, except for a client-specific final layer. This structure captures shared features among clients while enabling individual adaptation, making it a promising approach for leveraging client data and computational power in heterogeneous federated settings to learn personalized models. To learn the model, we employ an alternating gradient descent strategy: each client optimizes its local model, updates its final layer, and sends estimates of the shared representation to a central server for aggregation. To defend against Byzantine agents, we employ two robust aggregation methods for client-server communication, Geometric Median and Krum. Our method enables personalized learning while maintaining resilience in distributed settings. We implemented the proposed algorithm in a federated testbed built using Amazon Web Services (AWS) platform and compared its performance with various benchmark algorithms and their variations. Through experiments using real-world datasets, including CIFAR-10 and FEMNIST, we demonstrated the effectiveness and robustness of our approach and its transferability to new unseen clients with limited data, even in the presence of Byzantine adversaries.
♻ ☆ Convergence of continuous-time stochastic gradient descent with applications to deep neural networks
We study a continuous-time approximation of the stochastic gradient descent process for minimizing the population expected loss in learning problems. The main results establish general sufficient conditions for the convergence, extending the results of Chatterjee (2022) established for (nonstochastic) gradient descent. We show how the main result can be applied to the case of overparametrized neural network training.
♻ ☆ ESTformer: Transformer utilising spatiotemporal dependencies for electroencephalogram super-resolution
Towards practical applications of Electroencephalography (EEG), lightweight acquisition devices garner significant attention. However, EEG channel selection methods are commonly data-sensitive and cannot establish a unified sound paradigm for EEG acquisition devices. Through reverse conceptualisation, we formulated EEG applications in an EEG super-resolution (SR) manner, but suffered from high computation costs, extra interpolation bias, and few insights into spatiotemporal dependency modelling. To this end, we propose ESTformer, an EEG SR framework that utilises spatiotemporal dependencies based on the transformer. ESTformer applies positional encoding methods and a multihead self-attention mechanism to the space and time dimensions, which can learn spatial structural correlations and temporal functional variations. ESTformer, with the fixed mask strategy, adopts a mask token to upsample low-resolution (LR) EEG data in the case of disturbance from mathematical interpolation methods. On this basis, we designed various transformer blocks to construct a spatial interpolation module (SIM) and a temporal reconstruction module (TRM). Finally, ESTformer cascades the SIM and TRM to capture and model the spatiotemporal dependencies for EEG SR with fidelity. Extensive experimental results on two EEG datasets show the effectiveness of ESTformer against previous state-of-the-art methods, demonstrating the versatility of the Transformer for EEG SR tasks. The superiority of the SR data was verified in an EEG-based person identification and emotion recognition task, achieving a 2% to 38% improvement compared with the LR data at different sampling scales.
comment: Accepted by Knowledge-Based Systems
♻ ☆ Generating Auxiliary Tasks with Reinforcement Learning
Auxiliary Learning (AL) is a form of multi-task learning in which a model trains on auxiliary tasks to boost performance on a primary objective. While AL has improved generalization across domains such as navigation, image classification, and NLP, it often depends on human-labeled auxiliary tasks that are costly to design and require domain expertise. Meta-learning approaches mitigate this by learning to generate auxiliary tasks, but typically rely on gradient based bi-level optimization, adding substantial computational and implementation overhead. We propose RL-AUX, a reinforcement-learning (RL) framework that dynamically creates auxiliary tasks by assigning auxiliary labels to each training example, rewarding the agent whenever its selections improve the performance on the primary task. We also explore learning per-example weights for the auxiliary loss. On CIFAR-100 grouped into 20 superclasses, our RL method outperforms human-labeled auxiliary tasks and matches the performance of a prominent bi-level optimization baseline. We present similarly strong results on other classification datasets. These results suggest RL is a viable path to generating effective auxiliary tasks.
♻ ☆ Accelerated Rates between Stochastic and Adversarial Online Convex Optimization NeurIPS 2022
Stochastic and adversarial data are two widely studied settings in online learning. But many optimization tasks are neither i.i.d. nor fully adversarial, which makes it of fundamental interest to get a better theoretical understanding of the world between these extremes. In this work we establish novel regret bounds for online convex optimization in a setting that interpolates between stochastic i.i.d. and fully adversarial losses. By exploiting smoothness of the expected losses, these bounds replace a dependence on the maximum gradient length by the variance of the gradients, which was previously known only for linear losses. In addition, they weaken the i.i.d. assumption by allowing, for example, adversarially poisoned rounds, which were previously considered in the related expert and bandit settings. In the fully i.i.d. case, our regret bounds match the rates one would expect from results in stochastic acceleration, and we also recover the optimal stochastically accelerated rates via online-to-batch conversion. In the fully adversarial case our bounds gracefully deteriorate to match the minimax regret. We further provide lower bounds showing that our regret upper bounds are tight for all intermediate regimes in terms of the stochastic variance and the adversarial variation of the loss gradients.
comment: There is an unfixable mistake in the proof of Lemma 15, as kindly brought to our attention by Peng Zhao. The mistake is not present in the earlier NeurIPS 2022 conference version of the paper, which does not include this result
♻ ☆ Token Distillation: Attention-aware Input Embeddings For New Tokens
Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules. In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.
comment: Additional experiments + clearer method name compared to the May 2025 version
♻ ☆ Fast Adversarial Training against Sparse Attacks Requires Loss Smoothing
This paper studies fast adversarial training against sparse adversarial perturbations bounded by $l_0$ norm. We demonstrate the challenges of employing $1$-step attacks on $l_0$ bounded perturbations for fast adversarial training, including degraded performance and the occurrence of catastrophic overfitting (CO). We highlight that CO in $l_0$ adversarial training is caused by sub-optimal perturbation locations of $1$-step attack. Theoretical and empirical analyses reveal that the loss landscape of $l_0$ adversarial training is more craggy compared to its $l_\infty$, $l_2$ and $l_1$ counterparts. Moreover, we corroborate that the craggy loss landscape can aggravate CO. To address these issues, we propose Fast-LS-$l_0$ that incorporates soft labels and the trade-off loss function to smooth the adversarial loss landscape. Extensive experiments demonstrate our method can overcome the challenge of catastrophic overfitting, achieve state-of-the-art performance, and narrow down the performance gap between $1$-step and multi-step adversarial training against sparse attacks.
♻ ☆ Fair Play for Individuals, Foul Play for Groups? Auditing Anonymization's Impact on ML Fairness ECAI 2025
Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to fourfold. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.
comment: ECAI 2025
♻ ☆ Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts (LoT), the first landscape visualization tool to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset. We represent the textual states in a trajectory as numerical features that quantify the states' distances to the answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt LoT to a model that predicts the property they observe. We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories. Empirically, this verifier boosts the reasoning accuracy and the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.
♻ ☆ Eliciting Secret Knowledge from Language Models
We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in all settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. Our white-box techniques based on logit lens and sparse autoencoders (SAEs) also consistently increase the success rate of the LLM auditor, but are less effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.
♻ ☆ RTNinja: a generalized machine learning framework for analyzing random telegraph noise signals in nanoelectronic devices
Random telegraph noise is a prevalent variability phenomenon in nanoelectronic devices, arising from stochastic carrier exchange at defect sites and critically impacting device reliability and performance. Conventional analysis techniques often rely on restrictive assumptions or manual interventions, limiting their applicability to complex, noisy datasets. Here, we introduce RTNinja, a generalized, fully automated machine learning framework for the unsupervised analysis of random telegraph noise signals. RTNinja deconvolves complex signals to identify the number and characteristics of hidden individual sources, without requiring prior knowledge of the system. The framework comprises two modular components: LevelsExtractor, which uses Bayesian inference and model selection to denoise and discretize the signal; and SourcesMapper, which infers source configurations through probabilistic clustering and optimization. To evaluate performance, we developed a Monte Carlo simulator that generates labeled datasets spanning broad signal-to-noise ratios and source complexities; across 7000 such datasets, RTNinja consistently demonstrated high-fidelity signal reconstruction and accurate extraction of source amplitudes and activity patterns. Our results demonstrate that RTNinja offers a robust, scalable, and device-agnostic tool for random telegraph noise characterization, enabling large-scale statistical benchmarking, reliability-centric technology qualification, predictive failure modeling, and device physics exploration in next-generation nanoelectronics.
♻ ☆ DualOptim: Enhancing Efficacy and Stability in Machine Unlearning with Dual Optimizers
Existing machine unlearning (MU) approaches exhibit significant sensitivity to hyperparameters, requiring meticulous tuning that limits practical deployment. In this work, we first empirically demonstrate the instability and suboptimal performance of existing popular MU methods when deployed in different scenarios. To address this issue, we propose Dual Optimizer (DualOptim), which incorporates adaptive learning rate and decoupled momentum factors. Empirical and theoretical evidence demonstrates that DualOptim contributes to effective and stable unlearning. Through extensive experiments, we show that DualOptim can significantly boost MU efficacy and stability across diverse tasks, including image classification, image generation, and large language models, making it a versatile approach to empower existing MU algorithms.
♻ ☆ An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.
comment: 8 pages, 2 figures
♻ ☆ Qini Curve Estimation under Clustered Network Interference
Qini curves are a widely used tool for assessing treatment policies under allocation constraints as they visualize the incremental gain of a new treatment policy versus the cost of its implementation. Standard Qini curve estimation assumes no interference between units: that is, that treating one unit does not influence the outcome of any other unit. In many real-life applications such as public policy or marketing, however, the presence of interference is common. Ignoring interference in these scenarios can lead to systematically biased Qini curves that over- or under-estimate a treatment policy's cost-effectiveness. In this paper, we address the problem of Qini curve estimation under clustered network interference, where interfering units form independent clusters. We propose a formal description of the problem setting with an experimental study design under which we can account for clustered network interference. Within this framework, we describe three estimation strategies, each suited to different conditions, and provide guidance for selecting the most appropriate approach by highlighting the inherent bias-variance trade-offs. To complement our theoretical analysis, we introduce a marketplace simulator that replicates clustered network interference in a typical e-commerce environment, allowing us to evaluate and compare the proposed strategies in practice.
♻ ☆ Geometry-Aware Edge Pooling for Graph Neural Networks NeurIPS
Graph Neural Networks (GNNs) have shown significant success for graph-based tasks. Motivated by the prevalence of large datasets in real-world applications, pooling layers are crucial components of GNNs. By reducing the size of input graphs, pooling enables faster training and potentially better generalisation. However, existing pooling operations often optimise for the learning task at the expense of discarding fundamental graph structures, thus reducing interpretability. This leads to unreliable performance across dataset types, downstream tasks and pooling ratios. Addressing these concerns, we propose novel graph pooling layers for structure-aware pooling via edge collapses. Our methods leverage diffusion geometry and iteratively reduce a graph's size while preserving both its metric structure and its structural diversity. We guide pooling using magnitude, an isometry-invariant diversity measure, which permits us to control the fidelity of the pooling process. Further, we use the spread of a metric space as a faster and more stable alternative ensuring computational efficiency. Empirical results demonstrate that our methods (i) achieve top performance compared to alternative pooling layers across a range of diverse graph classification tasks, (ii) preserve key spectral properties of the input graphs, and (iii) retain high accuracy across varying pooling ratios.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS) 2025. Our code is available at https://github.com/aidos-lab/mag_edge_pool
♻ ☆ Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.
♻ ☆ Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases
Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.
♻ ☆ Towards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI
Explainable AI (XAI) aims to make AI systems more transparent, yet many practices emphasise mathematical rigour over practical user needs. We propose an alternative to this model-centric approach by following a design thinking process for the emerging XAI field of training data attribution (TDA), which risks repeating solutionist patterns seen in other subfields. However, because TDA is in its early stages, there is a valuable opportunity to shape its direction through user-centred practices. We engage directly with machine learning developers via a needfinding interview study (N=6) and a scenario-based interactive user study (N=31) to ground explanations in real workflows. Our exploration of the TDA design space reveals novel tasks for data-centric explanations useful to developers, such as grouping training samples behind specific model behaviours or identifying undersampled data. We invite the TDA, XAI, and HCI communities to engage with these tasks to strengthen their research's practical relevance and human impact.
♻ ☆ SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
Parameter-Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), are indispensable for efficiently customizing Large Language Models (LLMs). However, vanilla LoRA suffers from slow convergence speed and knowledge forgetting problems. Recent studies have leveraged the power of designed LoRA initialization, to enhance the fine-tuning efficiency, or to preserve knowledge in the pre-trained LLM. However, none of these works can address the two cases at the same time. To this end, we introduce Subspace-Constrained LoRA (SC-LoRA), a novel LoRA initialization framework engineered to navigate the trade-off between efficient fine-tuning and knowledge preservation. We achieve this by constraining the output of trainable LoRA adapters in a low-rank subspace, where the context information of fine-tuning data is most preserved while the context information of preserved knowledge is least retained, in a balanced way. Such constraint enables the trainable weights to primarily focus on the main features of fine-tuning data while avoiding damaging the preserved knowledge features. We provide theoretical analysis on our method, and conduct extensive experiments including safety preservation and world knowledge preservation, on various downstream tasks. In our experiments, SC-LoRA succeeds in delivering superior fine-tuning performance while markedly diminishing knowledge forgetting, surpassing contemporary LoRA initialization methods.
♻ ☆ Manifold Learning for Hyperspectral Images
Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.
♻ ☆ Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates
Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.
♻ ☆ Towards a Generalizable AI for Materials Discovery: Validation through Immersion Coolant Screening
Artificial intelligence (AI) has emerged as a powerful accelerator of materials discovery, yet most existing models remain problem-specific, requiring additional data collection and retraining for each new property. Here we introduce and validate GATE (Geometrically Aligned Transfer Encoder) -- a generalizable AI framework that jointly learns 34 physicochemical properties spanning thermal, electrical, mechanical, and optical domains. By aligning these properties within a shared geometric space, GATE captures cross-property correlations that reduce disjoint-property bias -- a key factor causing false positives in multi-criteria screening. To demonstrate its generalizable utility, GATE -- without any problem-specific model reconfiguration -- applied to the discovery of immersion cooling fluids for data centers, a stringent real-world challenge defined by the Open Compute Project (OCP). Screening billions of candidates, GATE identified 92,861 molecules as promising for practical deployment. Four were experimentally or literarily validated, showing strong agreement with wet-lab measurements and performance comparable to or exceeding a commercial coolant. These results establish GATE as a generalizable AI platform readily applicable across diverse materials discovery tasks.
comment: 16 pages, 4 figures
♻ ☆ Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies
Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.
♻ ☆ Offline Clustering of Preference Learning with Active-data Augmentation
Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C$^2$PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C$^2$PL. In this setting, our second algorithm, A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative dimensions of the test user's preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.
♻ ☆ Aeolus: A Multi-structural Flight Delay Dataset
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data
♻ ☆ LLM Based Long Code Translation using Identifier Replacement
In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
♻ ☆ Kernel Mean Embedding Topology: Weak and Strong Forms for Stochastic Kernels and Implications for Model Learning
We introduce a novel topology, called Kernel Mean Embedding Topology, for stochastic kernels, in a weak and strong form. This topology, defined on the spaces of Bochner integrable functions from a signal space to a space of probability measures endowed with a Hilbert space structure, allows for a versatile formulation. This construction allows one to obtain both a strong and weak formulation. (i) For its weak formulation, we highlight the utility on relaxed policy spaces, and investigate connections with the Young narrow topology and Borkar (or \( w^* \))-topology, and establish equivalence properties. We report that, while both the \( w^* \)-topology and kernel mean embedding topology are relatively compact, they are not closed. Conversely, while the Young narrow topology is closed, it lacks relative compactness. (ii) We show that the strong form provides an appropriate formulation for placing topologies on spaces of models characterized by stochastic kernels with explicit robustness and learning theoretic implications on optimal stochastic control under discounted or average cost criteria. (iii) We thus show that this topology possesses several properties making it ideal to study optimality and approximations (under the weak formulation) and robustness (under the strong formulation) for many applications.
comment: 37 pages
♻ ☆ LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. Project page: https://fsoft-aic.github.io/fsoft-LibMoE.github.io.
comment: 15 pages, 9 figures
♻ ☆ A Robust and Non-Iterative Tensor Decomposition Method with Automatic Thresholding
Recent advances in IoT and biometric sensing technologies have led to the generation of massive and high-dimensional tensor data, yet achieving accurate and efficient low-rank approximation remains a major challenge. Most existing tensor decomposition methods require predefined ranks and iterative optimization, resulting in high computational costs and dependence on analyst expertise. This study proposes a novel tensor low-rank approximation method that eliminates both prior rank specification and iterative optimization. The method applies statistical singular value hard thresholding to each mode-wise unfolded matrix to automatically extract statistically significant components, effectively reducing noise while preserving the intrinsic structure. Theoretically, the optimal thresholds for each mode are derived from the asymptotic properties of the Marcenko-Pastur distribution. Simulation experiments demonstrate that the proposed method outperforms conventional approaches (HOSVD, HOOI, and Tucker-L2E) in both estimation accuracy and computational efficiency. These results indicate that the proposed approach provides a theoretically grounded, fully automatic, and non-iterative framework for tensor decomposition.
comment: 26 pages, 3 figures
♻ ☆ Accelerating Data Generation for Nonlinear temporal PDEs via homologous perturbation in solution space
Data-driven deep learning methods like neural operators have advanced in solving nonlinear temporal partial differential equations (PDEs). However, these methods require large quantities of solution pairs\u2014the solution functions and right-hand sides (RHS) of the equations. These pairs are typically generated via traditional numerical methods, which need thousands of time steps iterations far more than the dozens required for training, creating heavy computational and temporal overheads. To address these challenges, we propose a novel data generation algorithm, called HOmologous Perturbation in Solution Space (HOPSS), which directly generates training datasets with fewer time steps rather than following the traditional approach of generating large time steps datasets. This algorithm simultaneously accelerates dataset generation and preserves the approximate precision required for model training. Specifically, we first obtain a set of base solution functions from a reliable solver, usually with thousands of time steps, and then align them in time steps with training datasets by downsampling. Subsequently, we propose a "homologous perturbation" approach: by combining two solution functions (one as the primary function, the other as a homologous perturbation term scaled by a small scalar) with random noise, we efficiently generate comparable-precision PDE data points. Finally, using these data points, we compute the variation in the original equation's RHS to form new solution pairs. Theoretical and experimental results show HOPSS lowers time complexity. For example, on the Navier-Stokes equation, it generates 10,000 samples in approximately 10% of traditional methods' time, with comparable model training performance.
♻ ☆ Hybrid Dual-Batch and Cyclic Progressive Learning for Efficient Distributed Training
Distributed machine learning is critical for training deep learning models on large datasets with numerous parameters. Current research primarily focuses on leveraging additional hardware resources and powerful computing units to accelerate the training process. As a result, larger batch sizes are often employed to speed up training. However, training with large batch sizes can lead to lower accuracy due to poor generalization. To address this issue, we propose the dual-batch learning scheme, a distributed training method built on the parameter server framework. This approach maximizes training efficiency by utilizing the largest batch size that the hardware can support while incorporating a smaller batch size to enhance model generalization. By using two different batch sizes simultaneously, this method improves accuracy with minimal additional training time. Additionally, to mitigate the time overhead caused by dual-batch learning, we propose the cyclic progressive learning scheme. This technique repeatedly and gradually increases image resolution from low to high during training, thereby reducing training time. By combining cyclic progressive learning with dual-batch learning, our hybrid approach improves both model generalization and training efficiency. Experimental results with ResNet-18 demonstrate that, compared to conventional training methods, our approach improves accuracy by 3.3% while reducing training time by 10.1% on CIFAR-100, and further achieves a 34.8% reduction in training time on ImageNet.
♻ ☆ Querying functional and structural niches on spatial transcriptomics data
Cells in multicellular organisms coordinate to form functional and structural niches. With spatial transcriptomics enabling gene expression profiling in spatial contexts, it has been revealed that spatial niches serve as cohesive and recurrent units in physiological and pathological processes. These observations suggest universal tissue organization principles encoded by conserved niche patterns, and call for a query-based niche analytical paradigm beyond current computational tools. In this work, we defined the Niche Query Task, which is to identify similar niches across ST samples given a niche of interest (NOI). We further developed QueST, a specialized method for solving this task. QueST models each niche as a subgraph, uses contrastive learning to learn discriminative niche embeddings, and incorporates adversarial training to mitigate batch effects. In simulations and benchmark datasets, QueST outperformed existing methods repurposed for niche querying, accurately capturing niche structures in heterogeneous environments and demonstrating strong generalizability across diverse sequencing platforms. Applied to tertiary lymphoid structures in renal and lung cancers, QueST revealed functionally distinct niches associated with patient prognosis and uncovered conserved and divergent spatial architectures across cancer types. These results demonstrate that QueST enables systematic, quantitative profiling of spatial niches across samples, providing a powerful tool to dissect spatial tissue architecture in health and disease.
♻ ☆ A Framework for Objective-Driven Dynamical Stochastic Fields
Fields offer a versatile approach for describing complex systems composed of interacting and dynamic components. In particular, some of these dynamical and stochastic systems may exhibit goal-directed behaviors aimed at achieving specific objectives, which we refer to as $\textit{intelligent fields}$. However, due to their inherent complexity, it remains challenging to develop a formal theoretical description of such systems and to effectively translate these descriptions into practical applications. In this paper, we propose three fundamental principles to establish a theoretical framework for understanding intelligent fields: complete configuration, locality, and purposefulness. Moreover, we explore methodologies for designing such fields from the perspective of artificial intelligence applications. This initial investigation aims to lay the groundwork for future theoretical developments and practical advances in understanding and harnessing the potential of such objective-driven dynamical stochastic fields.
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
♻ ☆ Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers
Recent advances in transformer architectures deeply enhanced long-context language modeling. Among them, HyperAttention achieves competitive efficiency by combining a single-level LSH-based clustering with uniform residual sampling. However, HyperAttention fails to find all significant keys, which in turn raises the overall perplexity. We propose a pre-scoring mechanism that prioritizes significant keys before applying HyperAttention. We introduce three scoring methods: $k$-means and kernel $k$-means clustering, $k$-median clustering, and leverage score-based ranking (inspired by LevAttention) to filter keys effectively. We further replace HyperAttention's original uniform residual sampling, relying exclusively on our pre-scoring mechanism. Experiments on ChatGLM2 (131k token context) reduce perplexity from 12 to 8.3, which outperforms standard HyperAttention. Moreover, when running on the Vision-Transformer (ViT), our method shows that it can guarantee similar accuracy compared with LevAttention, and will surpass LevAttention given specific parameters. Although this method introduces some computational overhead, its combination with HyperAttention achieves up to 20 times faster than FlashAttention, providing a balanced trade-off between speed and modeling accuracy. Our results highlight the effectiveness of integrating pre-scoring into hierarchical attention mechanisms, significantly improving transformer efficiency.
♻ ☆ Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.
♻ ☆ Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion
Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we introduce the "moment sampler," an asymptotically equivalent but more tractable and interpretable alternative to MaskGIT, which employs a "choose-then-sample" approach by selecting unmasking positions before sampling tokens. In addition, we improve the efficiency of choose-then-sample algorithms through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.
comment: 23 pages, fixed cleveref-related issue
♻ ☆ Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training
Pre-training has proven effective in addressing data scarcity and performance limitations in solving PDE problems with neural operators. However, challenges remain due to the heterogeneity of PDE datasets in equation types, which leads to high errors in mixed training. Additionally, dense pre-training models that scale parameters by increasing network width or depth incur significant inference costs. To tackle these challenges, we propose a novel Mixture-of-Experts Pre-training Operator Transformer (MoE-POT), a sparse-activated architecture that scales parameters efficiently while controlling inference costs. Specifically, our model adopts a layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks during inference, enabling the model to focus on equation-specific features. Meanwhile, we also integrate 2 shared experts, aiming to capture common properties of PDE and reduce redundancy among routed experts. The final output is computed as the weighted average of the results from all activated experts. We pre-train models with parameters from 30M to 0.5B on 6 public PDE datasets. Our model with 90M activated parameters achieves up to a 40% reduction in zero-shot error compared with existing models with 120M activated parameters. Additionally, we conduct interpretability analysis, showing that dataset types can be inferred from router-gating network decisions, which validates the rationality and effectiveness of the MoE architecture.
♻ ☆ Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
The proliferation of mobile devices generates a massive volume of time series across various domains, where effective time series forecasting enables a variety of real-world applications. This study focuses on a new problem of source-free domain adaptation for time series forecasting. It aims to adapt a pretrained model from sufficient source time series to the sparse target time series domain without access to the source data, embracing data protection regulations. To achieve this, we propose TimePD, the first source-free time series forecasting framework with proxy denoising, where large language models (LLMs) are employed to benefit from their generalization capabilities. Specifically, TimePD consists of three key components: (1) dual-branch invariant disentangled feature learning that enforces representation- and gradient-wise invariance by means of season-trend decomposition; (2) lightweight, parameter-free proxy denoising that dynamically calibrates systematic biases of LLMs; and (3) knowledge distillation that bidirectionally aligns the denoised prediction and the original target prediction. Extensive experiments on real-world datasets offer insight into the effectiveness of the proposed TimePD, outperforming SOTA baselines by 9.3% on average.
♻ ☆ AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy NeurIPS 2025
Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model's ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.
comment: Accepted at NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ DrivAerStar: An Industrial-Grade CFD Dataset for Vehicle Aerodynamic Optimization
Vehicle aerodynamics optimization has become critical for automotive electrification, where drag reduction directly determines electric vehicle range and energy efficiency. Traditional approaches face an intractable trade-off: computationally expensive Computational Fluid Dynamics (CFD) simulations requiring weeks per design iteration, or simplified models that sacrifice production-grade accuracy. While machine learning offers transformative potential, existing datasets exhibit fundamental limitations -- inadequate mesh resolution, missing vehicle components, and validation errors exceeding 5% -- preventing deployment in industrial workflows. We present DrivAerStar, comprising 12,000 industrial-grade automotive CFD simulations generated using STAR-CCM+${}^\unicode{xAE}$ software. The dataset systematically explores three vehicle configurations through 20 Computer Aided Design (CAD) parameters via Free Form Deformation (FFD) algorithms, including complete engine compartments and cooling systems with realistic internal airflow. DrivAerStar achieves wind tunnel validation accuracy below 1.04% -- a five-fold improvement over existing datasets -- through refined mesh strategies with strict wall $y^+$ control. Benchmarks demonstrate that models trained on this data achieve production-ready accuracy while reducing computational costs from weeks to minutes. This represents the first dataset bridging academic machine learning research and industrial CFD practice, establishing a new standard for data-driven aerodynamic optimization in automotive development. Beyond automotive applications, DrivAerStar demonstrates a paradigm for integrating high-fidelity physics simulations with Artificial Intelligence (AI) across engineering disciplines where computational constraints currently limit innovation.
♻ ☆ Don't throw the baby out with the bathwater: How and why deep learning for ARC
The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems. Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc. The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains. Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time. We demonstrate that fully committing to deep learning's capacity to acquire novel abstractions yields state-of-the-art performance on ARC. Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks. Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning. We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques. We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT. An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%). Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning.
comment: 13 pages, 6 figures
♻ ☆ Training a Generally Curious Agent ICML 2025
Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
comment: ICML 2025. Project Website: https://paprika-llm.github.io
♻ ☆ Scaling Diffusion Transformers Efficiently via $μ$P NeurIPS 2025
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.
comment: Accepted by NeurIPS 2025, 38 pages, 10 figures, 17 tables
Human-Computer Interaction 33
☆ Personalized AI Scaffolds Synergistic Multi-Turn Collaboration in Creative Work
As AI becomes more deeply embedded in knowledge work, building assistants that support human creativity and expertise becomes more important. Yet achieving synergy in human-AI collaboration is not easy. Providing AI with detailed information about a user's demographics, psychological attributes, divergent thinking, and domain expertise may improve performance by scaffolding more effective multi-turn interactions. We implemented a personalized LLM-based assistant, informed by users' psychometric profiles and an AI-guided interview about their work style, to help users complete a marketing task for a fictional startup. We randomized 331 participants to work with AI that was either generic (n = 116), partially personalized (n = 114), or fully personalized (n=101). Participants working with personalized AI produce marketing campaigns of significantly higher quality and creativity, beyond what AI alone could have produced. Compared to generic AI, personalized AI leads to higher self-reported levels of assistance and feedback, while also increasing participant trust and confidence. Causal mediation analysis shows that personalization improves performance indirectly by enhancing collective memory, attention, and reasoning in the human-AI interaction. These findings provide a theory-driven framework in which personalization functions as external scaffolding that builds common ground and shared partner models, reducing uncertainty and enhancing joint cognition. This informs the design of future AI assistants that maximize synergy and support human creative potential while limiting negative homogenization.
☆ Delegate Pricing Decisions to an Algorithm? Experimental Evidence
We analyze the delegation of pricing by participants, representing firms, to a collusive, self-learning algorithm in a repeated Bertrand experiment. In the baseline treatment, participants set prices themselves. In the other treatments, participants can either delegate pricing to the algorithm at the beginning of each supergame or receive algorithmic recommendations that they can override. Participants delegate more when they can override the algorithm's decisions. In both algorithmic treatments, prices are lower than in the baseline. Our results indicate that while self-learning pricing algorithms can be collusive, they can foster competition rather than collusion with humans-in-the-loop.
☆ Beyond Visualization: Building Decision Intelligence Through Iterative Dashboard Refinement
Effective business intelligence (BI) dashboards evolve through iterative refinement rather than single-pass design. Addressing the lack of structured improvement frameworks in BI practice, this study documents the four-stage evolution of a Power BI dashboard analyzing profitability decline in a fictional retail firm, Global Superstore. Using a dataset of \$12.64 million in sales across seven markets and three product categories, the project demonstrates how feedback-driven iteration and gap analysis convert exploratory visuals into decision-support tools. Guided by four executive questions on profitability, market prioritization, discount effects, and shipping costs, each iteration resolved analytical or interpretive shortcomings identified through collaborative review. Key findings include margin erosion in furniture (6.94% vs. 13.99% for technology), a 20% discount threshold beyond which profitability declined, and \$1.35 million in unrecovered shipping costs. Contributions include: (a) a replicable feedback-driven methodology grounded in iterative gap analysis; (b) DAX-based technical enhancements improving interpretive clarity; (c) an inductively derived six-element narrative framework; and (d) evidence that narrative coherence emerges organically through structured refinement. The methodology suggests transferable value for both BI practitioners and educators, pending validation across diverse organizational contexts.
☆ CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments
As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.
☆ Beyond Demographics: Behavioural Segmentation and Spatial Analytics to Enhance Visitor Experience at The British Museum
This study explores visitor behaviour at The British Museum using data science methods applied to novel sources, including audio guide usage logs and TripAdvisor reviews. Analysing 42,000 visitor journeys and over 50,000 reviews, we identify key drivers of satisfaction, segment visitors by behavioural patterns, examine tour engagement, model spatial navigation, and investigate room popularity. Behavioural clustering uncovered four distinct visitor types: Committed Trekkers, Leisurely Explorers, Targeted Visitors, and Speedy Samplers, each characterised by different levels of engagement and movement. Tour usage analysis revealed high drop-off rates and variation in completion rates across different language groups. Spatial flow modelling revealed that accessibility and proximity, particularly aversion to stairs, shaped visitor paths more than thematic organisation. Room popularity was more strongly predicted by physical accessibility than curatorial content. We propose practical strategies for improving engagement and flow, offering a scalable framework for visitor-centred, data-informed museum planning.
comment: 20 pages, 5 figures, under review
☆ Independent Clinical Evaluation of General-Purpose LLM Responses to Signals of Suicide Risk
We introduce findings and methods to facilitate evidence-based discussion about how large language models (LLMs) should behave in response to user signals of risk of suicidal thoughts and behaviors (STB). People are already using LLMs as mental health resources, and several recent incidents implicate LLMs in mental health crises. Despite growing attention, few studies have been able to effectively generalize clinical guidelines to LLM use cases, and fewer still have proposed methodologies that can be iteratively applied as knowledge improves about the elements of human-AI interaction most in need of study. We introduce an assessment of LLM alignment with guidelines for ethical communication, adapted from clinical principles and applied to expressions of risk factors for STB in multi-turn conversations. Using a codebook created and validated by clinicians, mobilizing the volunteer participation of practicing therapists and trainees (N=43) based in the U.S., and using generalized linear mixed-effects models for statistical analysis, we assess a single fully open-source LLM, OLMo-2-32b. We show how to assess when a model deviates from clinically informed guidelines in a way that may pose a hazard and (thanks to its open nature) facilitates future investigation as to why. We find that contrary to clinical best practice, OLMo-2-32b, and, possibly by extension, other LLMs, will become less likely to invite continued dialog as users send more signals of STB risk in multi-turn settings. We also show that OLMo-2-32b responds differently depending on the risk factor expressed. This empirical evidence highlights that just as chatbots pose hazards if their responses reinforce delusions or assist in suicidal acts, they may also discourage further help-seeking or cause feelings of dismissal or abandonment by withdrawing from conversations when STB risk is expressed.
☆ Preliminary Prototyping of Avoidance Behaviors Triggered by a User's Physical Approach to a Robot
Human-robot interaction frequently involves physical proximity or contact. In human-human settings, people flexibly accept, reject, or tolerate such approaches depending on the relationship and context. We explore the design of a robot's rejective internal state and corresponding avoidance behaviors, such as withdrawing or pushing away, when a person approaches. We model the accumulation and decay of discomfort as a function of interpersonal distance, and implement tolerance (endurance) and limit-exceeding avoidance driven by the Dominance axis of the PAD affect model. The behaviors and their intensities are realized on an arm robot. Results illustrate a coherent pipeline from internal state parameters to graded endurance motions and, once a limit is crossed, to avoidance actions.
comment: Workshop on Socially Aware and Cooperative Intelligent Systems in HAI 2025
☆ "Koyi Sawaal Nahi Hai": Reimagining Maternal Health Chatbots for Collective, Culturally Grounded Care
In recent years, LLM-based maternal health chatbots have been widely deployed in low-resource settings, but they often ignore real-world contexts where women may not own phones, have limited literacy, and share decision-making within families. Through the deployment of a WhatsApp-based maternal health chatbot with 48 pregnant women in Lahore, Pakistan, we examine barriers to use in populations where phones are shared, decision-making is collective, and literacy varies. We complement this with focus group discussions with obstetric clinicians. Our findings reveal how adoption is shaped by proxy consent and family mediation, intermittent phone access, silence around asking questions, infrastructural breakdowns, and contested authority. We frame barriers to non-use as culturally conditioned rather than individual choices, and introduce the Relational Chatbot Design Grammar (RCDG): four commitments that enable mediated decision-making, recognize silence as engagement, support episodic use, and treat fragility as baseline to reorient maternal health chatbots toward culturally grounded, collective care.
☆ Sustaining Cyber Awareness: The Long-Term Impact of Continuous Phishing Training and Emotional Triggers
Phishing constitutes more than 90\% of successful cyberattacks globally, remaining one of the most persistent threats to organizational security. Despite organizations tripling their cybersecurity budgets between 2015 and 2025, the human factor continues to pose a critical vulnerability. This study presents a 12-month longitudinal investigation examining how continuous cybersecurity training and emotional cues affect employee susceptibility to phishing. The experiment involved 20 organizations and over 1,300 employees who collectively received more than 13,000 simulated phishing emails engineered with diverse emotional, contextual, and structural characteristics. Behavioral responses were analyzed using non-parametric correlation and regression models to assess the influence of psychological manipulation, message personalization, and perceived email source. Results demonstrate that sustained phishing simulations and targeted training programs lead to a significant reduction in employee susceptibility, halving successful compromise rates within six months. Additionally, employee turnover introduces measurable fluctuations in awareness levels, underscoring the necessity of maintaining continuous training initiatives. These findings provide one of the few long-term perspectives on phishing awareness efficacy, highlighting the strategic importance of ongoing behavioral interventions in strengthening organizational cyber resilience. In order to support open science, we published our email templates, source code, and other materials at https://github.com/CorporatePhishingStudy
comment: 9 pages, 4 figures, IEEE BigData 2025 Conference, repository: https://github.com/CorporatePhishingStudy
☆ Inferring trust in recommendation systems from brain, behavioural, and physiological data
As people nowadays increasingly rely on artificial intelligence (AI) to curate information and make decisions, assigning the appropriate amount of trust in automated intelligent systems has become ever more important. However, current measurements of trust in automation still largely rely on self-reports that are subjective and disruptive to the user. Here, we take music recommendation as a model to investigate the neural and cognitive processes underlying trust in automation. We observed that system accuracy was directly related to users' trust and modulated the influence of recommendation cues on music preference. Modelling users' reward encoding process with a reinforcement learning model further revealed that system accuracy, expected reward, and prediction error were related to oscillatory neural activity recorded via EEG and changes in pupil diameter. Our results provide a neurally grounded account of calibrating trust in automation and highlight the promises of a multimodal approach towards developing trustable AI systems.
☆ ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models
Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.
comment: 23 pages, 13 figures, 7 tables
☆ Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication
Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.
comment: Accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering
☆ MARIA: A Framework for Marginal Risk Assessment without Ground Truth in AI Systems
Before deploying an AI system to replace an existing process, it must be compared with the incumbent to ensure improvement without added risk. Traditional evaluation relies on ground truth for both systems, but this is often unavailable due to delayed or unknowable outcomes, high costs, or incomplete data, especially for long-standing systems deemed safe by convention. The more practical solution is not to compute absolute risk but the difference between systems. We therefore propose a marginal risk assessment framework, that avoids dependence on ground truth or absolute risk. It emphasizes three kinds of relative evaluation methodology, including predictability, capability and interaction dominance. By shifting focus from absolute to relative evaluation, our approach equips software teams with actionable guidance: identifying where AI enhances outcomes, where it introduces new risks, and how to adopt such systems responsibly.
comment: 9 pages, 1 figure
☆ AURA: A Reinforcement Learning Framework for AI-Driven Adaptive Conversational Surveys
Conventional online surveys provide limited personalization, often resulting in low engagement and superficial responses. Although AI survey chatbots improve convenience, most are still reactive: they rely on fixed dialogue trees or static prompt templates and therefore cannot adapt within a session to fit individual users, which leads to generic follow-ups and weak response quality. We address these limitations with AURA (Adaptive Understanding through Reinforcement Learning for Assessment), a reinforcement learning framework for AI-driven adaptive conversational surveys. AURA quantifies response quality using a four-dimensional LSDE metric (Length, Self-disclosure, Emotion, and Specificity) and selects follow-up question types via an epsilon-greedy policy that updates the expected quality gain within each session. Initialized with priors extracted from 96 prior campus-climate conversations (467 total chatbot-user exchanges), the system balances exploration and exploitation across 10-15 dialogue exchanges, dynamically adapting to individual participants in real time. In controlled evaluations, AURA achieved a +0.12 mean gain in response quality and a statistically significant improvement over non-adaptive baselines (p=0.044, d=0.66), driven by a 63% reduction in specification prompts and a 10x increase in validation behavior. These results demonstrate that reinforcement learning can give survey chatbots improved adaptivity, transforming static questionnaires into interactive, self-improving assessment systems.
☆ Functional connectivity guided deep neural network for decoding high-level visual imagery
This study introduces a pioneering approach in brain-computer interface (BCI) technology, featuring our novel concept of high-level visual imagery for non-invasive electroencephalography (EEG)-based communication. High-level visual imagery, as proposed in our work, involves the user engaging in the mental visualization of complex upper limb movements. This innovative approach significantly enhances the BCI system, facilitating the extension of its applications to more sophisticated tasks such as EEG-based robotic arm control. By leveraging this advanced form of visual imagery, our study opens new horizons for intricate and intuitive mind-controlled interfaces. We developed an advanced deep learning architecture that integrates functional connectivity metrics with a convolutional neural network-image transformer. This framework is adept at decoding subtle user intentions, addressing the spatial variability in high-level visual tasks, and effectively translating these into precise commands for robotic arm control. Our comprehensive offline and pseudo-online evaluations demonstrate the framework's efficacy in real-time applications, including the nuanced control of robotic arms. The robustness of our approach is further validated through leave-one-subject-out cross-validation, marking a significant step towards versatile, subject-independent BCI applications. This research highlights the transformative impact of advanced visual imagery and deep learning in enhancing the usability and adaptability of BCI systems, particularly in robotic arm manipulation.
comment: 34 pages, 8 figures, 6 tables
☆ Adaptive Human-Computer Interaction Strategies Through Reinforcement Learning in Complex
This study addresses the challenges of dynamics and complexity in intelligent human-computer interaction and proposes a reinforcement learning-based optimization framework to improve long-term returns and overall experience. Human-computer interaction is modeled as a Markov decision process, with state space, action space, reward function, and discount factor defined to capture the dynamics of user input, system feedback, and interaction environment. The method combines policy function, value function, and advantage function, updates parameters through policy gradient, and continuously adjusts during interaction to balance immediate feedback and long-term benefits. To validate the framework, multimodal dialog and scene-aware datasets are used as the experimental platform, with multiple sensitivity experiments conducted on key factors such as discount factor, exploration rate decay, environmental noise, and data imbalance. Evaluation is carried out using cumulative reward, average episode reward, convergence speed, and task success rate. Results show that the proposed method outperforms existing approaches across several metrics, achieving higher task completion while maintaining strategy stability. Comparative experiments further confirm its advantages in interaction efficiency and long-term return, demonstrating the significant value of reinforcement learning in optimizing human-computer interaction.
☆ Investigating Search Among Physical and Virtual Objects Under Different Lighting Conditions
By situating computer-generated content in the physical world, mobile augmented reality (AR) can support many tasks that involve effective search and inspection of physical environments. Currently, there is limited information regarding the viability of using AR in realistic wide-area outdoor environments and how AR experiences affect human behavior in these environments. Here, we conducted a wide-area outdoor AR user study (n = 48) using a commercially available AR headset (Microsoft Hololens 2) to compare (1) user interactions with physical and virtual objects in the environment (2) the effects of different lighting conditions on user behavior and AR experience and (3) the impact of varying cognitive load on AR task performance. Participants engaged in a treasure hunt task where they searched for and classified virtual target items (green ``gems") in an augmented outdoor courtyard scene populated with physical and virtual objects. Cognitive load was manipulated so that in half the search trials users were required to monitor an audio stream and respond to specific target sounds. Walking paths, head orientation and eye gaze information were measured, and users were queried about their memory of encountered objects and provided feedback on the experience. Key findings included (1) Participants self-reported significantly lower comfort in the ambient natural light condition, with virtual objects more visible and participants more likely to walk into physical objects at night; (2) recall for physical objects was worse than for virtual objects, (3) participants discovered more gems hidden behind virtual objects than physical objects, implying higher attention on virtual objects and (4) dual-tasking modified search behavior. These results suggest there are important technical, perceptual and cognitive factors that must be considered.
comment: Journal Article, 11 pages. Published in IEEE Transactions on Visualization and Computer Graphics (TVCG) in 2022
☆ Understanding, Demystifying and Challenging Perceptions of Gig Worker Vulnerabilities
Gig workers face several vulnerabilities, which are rarely discussed among peers due to the absence of infrastructure for mutual support. To understand how individual gig workers perceive such vulnerabilities and why they continue to pursue such labor, we conducted a scalable two-phase study to probe their rationales. In Phase I, participants (N = 236) rated their agreement with five commonly misconstrued vulnerabilities. In Phase II, we challenged participants who held one or more myth(s) (N = 204) to defend their views, after which we presented an expert- or LLM-generated counterargument to their rationale. Our findings show how workers are underexposed to the personal and shared vulnerabilities of gig work, revealing a knowledge gap where persuasive interventions may help workers recognize such hidden conditions. We discuss the implications of our results to support collective bargaining of workers' rights and reflect on the effectiveness of different persuasion strategies.
☆ Spot The Ball: A Benchmark for Visual Social Inference
Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($\leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.
☆ Tailored robotic training improves hand function and proprioceptive processing in stroke survivors with proprioceptive deficits: A randomized controlled trial
Precision rehabilitation aims to tailor movement training to improve outcomes. We tested whether proprioceptively-tailored robotic training improves hand function and neural processing in stroke survivors. Using a robotic finger exoskeleton, we tested two proprioceptively-tailored approaches: Propriopixel Training, which uses robot-facilitated, gamified movements to enhance proprioceptive processing, and Virtual Assistance Training, which reduces robotic aid to increase reliance on self-generated feedback. In a randomized controlled trial, forty-six chronic stroke survivors completed nine 2-hour sessions of Standard, Propriopixel or Virtual training. Among participants with proprioceptive deficits, Propriopixel ((Box and Block Test: 7 +/- 4.2, p=0.002) and Virtual Assistance (4.5 +/- 4.4 , p=0.068) yielded greater gains in hand function (Standard: 0.8 +/- 2.3 blocks). Proprioceptive gains correlated with improvements in hand function. Tailored training enhanced neural sensitivity to proprioceptive cues, evidenced by a novel EEG biomarker, the proprioceptive Contingent Negative Variation. These findings support proprioceptively-tailored training as a pathway to precision neurorehabilitation.
comment: Main manuscript: 38 pages (double spaced, with references), 6 figures, 2 tables and collated supplemental materials (17 pages, double spaced)
☆ Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI
Millions of users now design personalized LLM-based chatbots that shape their daily interactions, yet they can only loosely anticipate how their design choices will manifest as behaviors in deployment. This opacity is consequential: seemingly innocuous prompts can trigger excessive sycophancy, toxicity, or inconsistency, degrading utility and raising safety concerns. To address this issue, we introduce an interface that enables neural transparency by exposing language model internals during chatbot design. Our approach extracts behavioral trait vectors (empathy, toxicity, sycophancy, etc.) by computing differences in neural activations between contrastive system prompts that elicit opposing behaviors. We predict chatbot behaviors by projecting the system prompt's final token activations onto these trait vectors, normalizing for cross-trait comparability, and visualizing results via an interactive sunburst diagram. To evaluate this approach, we conducted an online user study using Prolific to compare our neural transparency interface against a baseline chatbot interface without any form of transparency. Our analyses suggest that users systematically miscalibrated AI behavior: participants misjudged trait activations for eleven of fifteen analyzable traits, motivating the need for transparency tools in everyday human-AI interaction. While our interface did not change design iteration patterns, it significantly increased user trust and was enthusiastically received. Qualitative analysis indicated that users' had nuanced experiences with the visualization that may enrich future work designing neurally transparent interfaces. This work offers a path for how mechanistic interpretability can be operationalized for non-technical users, establishing a foundation for safer, more aligned human-AI interactions.
comment: SK and AB are co-first authors
☆ Supporting Patients in Managing Electronic Health Records and Biospecimens Consent for Research: Insights from a Mixed-Methods Usability Evaluation of the iAGREE Portal
De-identified health data are frequently used in research. As AI advances heighten the risk of re-identification, it is important to respond to concerns about transparency, data privacy, and patient preferences. However, few practical and user-friendly solutions exist. We developed iAGREE, a patient-centered electronic consent management portal that allows patients to set granular preferences for sharing electronic health records and biospecimens with researchers. To refine the iAGREE portal, we conducted a mixed-methods usability evaluation with 40 participants from three U.S. health systems. Our results show that the portal received highly positive usability feedback. Moreover, participants identified areas for improvement, suggested actionable enhancements, and proposed additional features to better support informed granular consent while reducing patient burden. Insights from this study may inform further improvements to iAGREE and provide practical guidance for designing patient-centered consent management tools.
comment: Paper accepted for the proceedings of the 2025 American Medical Informatics Association Annual Symposium (AMIA 2025)
☆ Is Crowdsourcing a Puppet Show? Detecting a New Type of Fraud in Online Platforms
Crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) are important tools for researchers seeking to conduct studies with a broad, global participant base. Despite their popularity and demonstrated utility, we present evidence that suggests the integrity of data collected through Amazon MTurk is being threatened by the presence of puppeteers, apparently human workers controlling multiple puppet accounts that are capable of bypassing standard attention checks. If left undetected, puppeteers and their puppets can undermine the integrity of data collected on these platforms. This paper investigates data from two Amazon MTurk studies, finding that a substantial proportion of accounts (33% to 56.4%) are likely puppets. Our findings highlight the importance of adopting multifaceted strategies to ensure data integrity on crowdsourcing platforms. With the goal of detecting this type of fraud, we discuss a set of potential countermeasures for both puppets and bots with varying degrees of sophistication (e.g., employing AI). The problem of single entities (or puppeteers) manually controlling multiple accounts could exist on other crowdsourcing platforms; as such, their detection may be of broader application. While our findings suggest the need to re-evaluate the quality of crowdsourced data, many previous studies likely remain valid, particularly those with robust experimental designs. However, the presence of puppets may have contributed to false null results in some studies, suggesting that unpublished work may be worth revisiting with effective puppet detection strategies.
comment: 20 pages, 2 figures
☆ Between Myths and Metaphors: Rethinking LLMs for SRH in Conservative Contexts
Low-resource countries represent over 90% of maternal deaths, with Pakistan among the top four countries contributing nearly half in 2023. Since these deaths are mostly preventable, large language models (LLMs) can help address this crisis by automating health communication and risk assessment. However, sexual and reproductive health (SRH) communication in conservative contexts often relies on indirect language that obscures meaning, complicating LLM-based interventions. We conduct a two-stage study in Pakistan: (1) analyzing data from clinical observations, interviews, and focus groups with clinicians and patients, and (2) evaluating the interpretive capabilities of five popular LLMs on this data. Our analysis identifies two axes of communication (referential domain and expression approach) and shows LLMs struggle with semantic drift, myths, and polysemy in clinical interactions. We contribute: (1) empirical themes in SRH communication, (2) a categorization framework for indirect communication, (3) evaluation of LLM performance, and (4) design recommendations for culturally-situated SRH communication.
♻ ☆ Data Therapist: Eliciting Domain Knowledge from Subject Matter Experts Using Large Language Models
Effective data visualization requires not only technical proficiency but also a deep understanding of the domain-specific context in which data exists. This context often includes tacit knowledge about data provenance, quality, and intended use, which is rarely explicit in the dataset itself. Motivated by growing demands to surface tacit knowledge, we present the Data Therapist, a web-based system that helps domain experts externalize such implicit knowledge through a mixed-initiative process combining iterative Q&A with interactive annotation. Powered by a large language model, the system automatically analyzes user-supplied datasets, prompts users with targeted questions, and supports annotation at varying levels of granularity. The resulting structured knowledge base can inform both human and automated visualization design. A qualitative study with expert pairs from Accounting, Political Science, and Computer Security revealed recurring patterns in how expert reason about their data and highlighted opportunities for AI support to enhance visualization design.
♻ ☆ Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind
We report a structural convergence among four influential theories of mind: Kahneman's dual-system theory, Friston's predictive processing, Minsky's society of mind, and Clark's extended mind, emerging unintentionally within a practical AI architecture known as Agentic Flow. Designed to address the limitations of large language models (LLMs), Agentic Flow comprises five interlocking modules: Retrieval, Cognition, Control, Memory, and Action, organized into a repeatable cognitive loop. Although originally inspired only by Minsky and Clark, subsequent analysis revealed that its structure echoes computational motifs from all four theories, suggesting that theoretical convergence can emerge naturally from implementation demands rather than deliberate synthesis. Controlled evaluations confirmed this: the structured agent achieved 95.8% task success versus 62.3% for baseline LLMs, demonstrating robust constraint adherence and reproducible reasoning. We describe this convergence under a broader descriptive meta-architecture called PEACE, highlighting recurring design patterns such as predictive modeling, associative recall, and error-sensitive control. Later formalized as the Structured Cognitive Loop (SCL), this framework generalizes the same principles as a foundation for behavioral intelligence in LLM-based agents. Rather than claiming theoretical unification, this paper proposes that intelligent architectures may evolve toward shared structural patterns shaped by practical constraints. As a position paper, it aims to frame this convergence as an interpretive reflection rather than a finalized theory, inviting further theoretical and experimental dialogue. Agentic Flow, or equivalently the Structured Cognitive Loop, thus offers a glimpse of how a unified cognitive form can arise not from abstraction, but from the necessities of real-world reasoning.
comment: This version relocates the "Position Paper" designation from the title to the abstract and adds a citation to the related follow-up study Structured Cognition for Behavioral Intelligence in Large Language Model Agents (Kim, 2025), also available on arXiv
♻ ☆ Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents
Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation - referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) - within the same input token order of magnitude (1e3). Our best evaluated configurations - one token order above, but within the model's context window - outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.
comment: 20 pages, LaTeX; repository URL updated, typos corrected
♻ ☆ A Study on Human-Swarm Interaction: A Framework for Assessing Situation Awareness and Task Performance
This paper introduces a framework for human swarm interaction studies that measures situation awareness in dynamic environments. A tablet-based interface was developed for a user study by implementing the concepts introduced in the framework, where operators guided a robotic swarm in a single-target search task, marking hazardous cells unknown to the swarm. Both subjective and objective situation awareness measures were used, with task performance evaluated based on how close the robots were to the target. The framework enabled a structured investigation of the role of situation awareness in human swarm interaction, leading to key findings such as improved task performance across attempts, showing the interface was learnable, centroid active robot position proved to be a useful task performance metric for assessing situation awareness, perception and projection played a key role in task performance, highlighting their importance in interface design and objective situation awareness influenced both subjective situation awareness and task performance, emphasizing the need for interfaces that emphasise objective situation awareness. These findings validate our framework as a structured approach for integrating situation awareness concepts into human swarm interaction studies, offering a systematic way to assess situation awareness and task performance. The framework can be applied to other swarming studies to evaluate interface learnability, identify meaningful task performance metrics, and refine interface designs to enhance situation awareness, ultimately improving human swarm interaction in dynamic environments.
comment: 10 pages, 8 figures, 2 tables, 2 equations
♻ ☆ Towards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI
Explainable AI (XAI) aims to make AI systems more transparent, yet many practices emphasise mathematical rigour over practical user needs. We propose an alternative to this model-centric approach by following a design thinking process for the emerging XAI field of training data attribution (TDA), which risks repeating solutionist patterns seen in other subfields. However, because TDA is in its early stages, there is a valuable opportunity to shape its direction through user-centred practices. We engage directly with machine learning developers via a needfinding interview study (N=6) and a scenario-based interactive user study (N=31) to ground explanations in real workflows. Our exploration of the TDA design space reveals novel tasks for data-centric explanations useful to developers, such as grouping training samples behind specific model behaviours or identifying undersampled data. We invite the TDA, XAI, and HCI communities to engage with these tasks to strengthen their research's practical relevance and human impact.
♻ ☆ Artificial Empathy: AI based Mental Health
Many people suffer from mental health problems but not everyone seeks professional help or has access to mental health care. AI chatbots have increasingly become a go-to for individuals who either have mental disorders or simply want someone to talk to. This paper presents a study on participants who have previously used chatbots and a scenario-based testing of large language model (LLM) chatbots. Our findings indicate that AI chatbots were primarily utilized as a "Five minute therapist" or as a non-judgmental companion. Participants appreciated the anonymity and lack of judgment from chatbots. However, there were concerns about privacy and the security of sensitive information. The scenario-based testing of LLM chatbots highlighted additional issues. Some chatbots were consistently reassuring, used emojis and names to add a personal touch, and were quick to suggest seeking professional help. However, there were limitations such as inconsistent tone, occasional inappropriate responses (e.g., casual or romantic), and a lack of crisis sensitivity, particularly in recognizing red flag language and escalating responses appropriately. These findings can inform both the technology and mental health care industries on how to better utilize AI chatbots to support individuals during challenging emotional periods.
♻ ☆ Understanding Codebase like a Professional! Human-AI Collaboration for Code Comprehension
Understanding an unfamiliar codebase is an essential task for developers in various scenarios, such as during the onboarding process. Existing studies have shown that LLMs often fail to support users in understanding code structures or to provide user-centered, adaptive, and dynamic assistance in real-world settings. To address this, we propose learning from the perspective of a unique role, code auditors, whose work often requires them to quickly familiarize themselves with new code projects on weekly or even daily basis. To achieve this, we recruited and interviewed 8 code auditing practitioners to understand how they master codebase understanding. We identified several design opportunities for an LLM-based codebase understanding system: supporting cognitive alignment through automated codebase information extraction, decomposition, and representation, as well as reducing manual effort and conversational distraction through interaction design. To validate them, we designed a prototype, CodeMap, that provides dynamic information extraction and representation aligned with the human cognitive flow and enables interactive switching among hierarchical codebase visualizations. To evaluate the usefulness of our system, we conducted a user study with nine experienced developers and six novice developers. Our results demonstrate that CodeMap improved users' perceived intuitiveness, ease of use, and usefulness in supporting code comprehension, while reducing their reliance on reading and interpreting LLM responses by 79% and increasing map usage time by 90% compared with the static visualization analysis tool. It also enhances novice developers' perceived understanding and reduces their unpurposeful exploration.
comment: 11 pages
♻ ☆ LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS
We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems.
♻ ☆ Algorithmic Assistance with Recommendation-Dependent Preferences
When an algorithm provides risk assessments, we typically think of them as helpful inputs to human decisions, such as when risk scores are presented to judges or doctors. However, a decision-maker may react not only to the information provided by the algorithm. The decision-maker may also view the algorithmic recommendation as a default action, making it costly for them to deviate, such as when a judge is reluctant to overrule a high-risk assessment for a defendant or a doctor fears the consequences of deviating from recommended procedures. To address such unintended consequences of algorithmic assistance, we propose a model of joint human-machine decision-making. Within this model, we consider the effect and design of algorithmic recommendations when they affect choices not just by shifting beliefs, but also by altering preferences. We motivate this assumption from institutional factors, such as a desire to avoid audits, as well as from well-established models in behavioral science that predict loss aversion relative to a reference point. We show that recommendation-dependent preferences create inefficiencies where the decision-maker is overly responsive to the recommendation. As a remedy, we discuss algorithms that strategically withhold recommendations and show how they can improve the quality of final decisions. Concretely, we prove that an intuitive algorithm achieves minimax optimality by sending recommendations only when it is confident that their implementation would improve over an unassisted baseline decision.
Software Engineering 34
☆ On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection
Large language models (LLMs) have demonstrated impressive capabilities for many coding tasks, including summarization, translation, completion, and code generation. However, detecting code vulnerabilities remains a challenging task for LLMs. An effective way to improve LLM performance is in-context learning (ICL) - providing few-shot examples similar to the query, along with correct answers, can improve an LLM's ability to generate correct solutions. However, choosing the few-shot examples appropriately is crucial to improving model performance. In this paper, we explore two criteria for choosing few-shot examples for ICL used in the code vulnerability detection task. The first criterion considers if the LLM (consistently) makes a mistake or not on a sample with the intuition that LLM performance on a sample is informative about its usefulness as a few-shot example. The other criterion considers similarity of the examples with the program under query and chooses few-shot examples based on the $k$-nearest neighbors to the given sample. We perform evaluations to determine the benefits of these criteria individually as well as under various combinations, using open-source models on multiple datasets.
☆ Enhancing software product lines with machine learning components
Modern software systems increasingly integrate machine learning (ML) due to its advancements and ability to enhance data-driven decision-making. However, this integration introduces significant challenges for software engineering, especially in software product lines (SPLs), where managing variability and reuse becomes more complex with the inclusion of ML components. Although existing approaches have addressed variability management in SPLs and the integration of ML components in isolated systems, few have explored the intersection of both domains. Specifically, there is limited support for modeling and managing variability in SPLs that incorporate ML components. To bridge this gap, this article proposes a structured framework designed to extend Software Product Line engineering, facilitating the integration of ML components. It facilitates the design of SPLs with ML capabilities by enabling systematic modeling of variability and reuse. The proposal has been partially implemented with the VariaMos tool.
comment: pp. 73-94, 2 figures
☆ CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments
As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.
☆ Agentic LLMs for REST API Test Amplification: A Comparative Study Across Cloud Applications
Representational State Transfer (REST) APIs are a cornerstone of modern cloud native systems. Ensuring their reliability demands automated test suites that exercise diverse and boundary level behaviors. Nevertheless, designing such test cases remains a challenging and resource intensive endeavor. This study extends prior work on Large Language Model (LLM) based test amplification by evaluating single agent and multi agent configurations across four additional cloud applications. The amplified test suites maintain semantic validity with minimal human intervention. The results demonstrate that agentic LLM systems can effectively generalize across heterogeneous API architectures, increasing endpoint and parameter coverage while revealing defects. Moreover, a detailed analysis of computational cost, runtime, and energy consumption highlights trade-offs between accuracy, scalability, and efficiency. These findings underscore the potential of LLM driven test amplification to advance the automation and sustainability of REST API testing in complex cloud environments.
☆ Efficient Integration of cross platform functions onto service-oriented architectures
The automotive industry is currently undergoing a major transformation with respect to the Electric/Electronic (E/E) and software architecture, driven by a significant increase in the complexity of the technological stack within a vehicle. This complexity acts as a driving force for Software-Defined Vehicles (SDVs) leading to the evolution of the automotive E/E architectures from decentralized configuration comprising multiple Electronic Control Units (ECUs) towards a more integrated configuration comprising a smaller number of ECUs, domain controllers, gateways, and High-Performance Computers (HPCs) [2]. This transition along with several other reasons have resulted in heterogeneous software platforms such as AUTOSAR Classic, AUTOSAR Adaptive, and prototypical frameworks like ROS 2. It is therefore essential to develop applications that are both hardware- and platform/middleware-agnostic to attain development and integration efficiency. This work presents an application development and integration concept to facilitate developing applications as Software as a Product (SaaP), while simultaneously ensuring efficient integration onto multiple software architecture platforms. The concept involves designing applications in a hardware- and software platform-agnostic manner and standardizing application interfaces [6]. It also includes describing the relevant aspects of the application and corresponding middleware in a machine-readable format to aid the integration of developed applications. Additionally, tools are developed to facilitate semi-automation of the development and integration processes. An example application has been developed and integrated onto AUTOSAR Adaptive and ROS 2, demonstrating the applicability of the approach. Finally, metrics are presented to show the efficiency of the overall concept.
comment: ELIV Conference 2025, Bonn, Germany
☆ Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes
Application modernization in legacy languages such as COBOL, PL/I, and REXX faces an acute shortage of resources, both in expert availability and in high-quality human evaluation data. While Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review, their reliability must be validated before being trusted in high-stakes workflows. Without principled validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs, potentially reinforcing unreliable judgments and compromising downstream deployment decisions. Although various automated approaches to validating LaaJs have been proposed, alignment with human judgment remains a widely used and conceptually grounded validation strategy. In many real-world domains, the availability of human-labeled evaluation data is severely limited, making it difficult to assess how well a LaaJ aligns with human judgment. We introduce SparseAlign, a formal framework for assessing LaaJ alignment with sparse human-labeled data. SparseAlign combines a novel pairwise-confidence concept with a score-sensitive alignment metric that jointly capture ranking consistency and score proximity, enabling reliable evaluator selection even when traditional statistical methods are ineffective due to limited annotated examples. SparseAlign was applied internally to select LaaJs for COBOL code explanation. The top-aligned evaluators were integrated into assessment workflows, guiding model release decisions. We present a case study of four LaaJs to demonstrate SparseAlign's utility in real-world evaluation scenarios.
☆ On the Marriage of Theory and Practice in Data-Aware Business Processes via Low-Code
In recent years, there has been a growing interest in the verification of business process models. Despite their lack of formal characterization, these models are widely adopted in both industry and academia. To address this issue, formalizing the execution semantics of business process modeling languages is essential. Since data and process are two facets of the same coin, and data are critical elements in the execution of process models, this work introduces Proving an eXecutable BPMN injected with data, BPMN-ProX. BPMN-ProX is a low-code testing framework that significantly enhances the verification of data-aware BPMN. This low-code platform helps bridge the gap between non-technical experts and professionals by proposing a tool that integrates advanced data handling and employs a robust verification mechanism through state-of-the-art model checkers. This innovative approach combines theoretical verification with practical modeling, fostering more agile, reliable, and user-centric business process management.
☆ From product to system network challenges in system of systems lifecycle management
Today, products are no longer isolated artifacts, but nodes in networked systems. This means that traditional, linearly conceived life cycle models are reaching their limits: Interoperability across disciplines, variant and configuration management, traceability, and governance across organizational boundaries are becoming key factors. This collective contribution classifies the state of the art and proposes a practical frame of reference for SoS lifecycle management, model-based systems engineering (MBSE) as the semantic backbone, product lifecycle management (PLM) as the governance and configuration level, CAD-CAE as model-derived domains, and digital thread and digital twin as continuous feedback. Based on current literature and industry experience, mobility, healthcare, and the public sector, we identify four principles: (1) referenced architecture and data models, (2) end-to-end configuration sovereignty instead of tool silos, (3) curated models with clear review gates, and (4) measurable value contributions along time, quality, cost, and sustainability. A three-step roadmap shows the transition from product- to network- centric development: piloting with reference architecture, scaling across variant and supply chain spaces, organizational anchoring (roles, training, compliance). The results are increased change robustness, shorter throughput times, improved reuse, and informed sustainability decisions. This article is aimed at decision-makers and practitioners who want to make complexity manageable and design SoS value streams to be scalable.
☆ MARIA: A Framework for Marginal Risk Assessment without Ground Truth in AI Systems
Before deploying an AI system to replace an existing process, it must be compared with the incumbent to ensure improvement without added risk. Traditional evaluation relies on ground truth for both systems, but this is often unavailable due to delayed or unknowable outcomes, high costs, or incomplete data, especially for long-standing systems deemed safe by convention. The more practical solution is not to compute absolute risk but the difference between systems. We therefore propose a marginal risk assessment framework, that avoids dependence on ground truth or absolute risk. It emphasizes three kinds of relative evaluation methodology, including predictability, capability and interaction dominance. By shifting focus from absolute to relative evaluation, our approach equips software teams with actionable guidance: identifying where AI enhances outcomes, where it introduces new risks, and how to adopt such systems responsibly.
comment: 9 pages, 1 figure
☆ Empirical Studies on Quantum Optimization for Software Engineering: A Systematic Analysis
In recent years, quantum, quantum-inspired, and hybrid algorithms are increasingly showing promise for solving software engineering optimization problems. However, best-intended practices for conducting empirical studies have not yet well established. In this paper, based on the primary studies identified from the latest systematic literature review on quantum optimization for software engineering problems, we conducted a systematic analysis on these studies from various aspects including experimental designs, hyperparameter settings, case studies, baselines, tooling, and metrics. We identify key gaps in the current practices such as limited reporting of the number of repetitions, number of shots, and inadequate consideration of noise handling, as well as a lack of standardized evaluation protocols such as the adoption of quality metrics, especially quantum-specific metrics. Based on our analysis, we provide insights for designing empirical studies and highlight the need for more real-world and open case studies to assess cost-effectiveness and practical utility of the three types of approaches: quantum-inspired, quantum, and hybrid. This study is intended to offer an overview of current practices and serve as an initial reference for designing and conducting empirical studies on evaluating and comparing quantum, quantum-inspired, and hybrid algorithms in solving optimization problems in software engineering.
☆ Towards a Measure of Algorithm Similarity
Given two algorithms for the same problem, can we determine whether they are meaningfully different? In full generality, the question is uncomputable, and empirically it is muddied by competing notions of similarity. Yet, in many applications (such as clone detection or program synthesis) a pragmatic and consistent similarity metric is necessary. We review existing equivalence and similarity notions and introduce EMOC: An Evaluation-Memory-Operations-Complexity framework that embeds algorithm implementations into a feature space suitable for downstream tasks. We compile PACD, a curated dataset of verified Python implementations across three problems, and show that EMOC features support clustering and classification of algorithm types, detection of near-duplicates, and quantification of diversity in LLM-generated programs. Code, data, and utilities for computing EMOC embeddings are released to facilitate reproducibility and future work on algorithm similarity.
comment: 11 pages, many figures and images
☆ LLM-Driven Cost-Effective Requirements Change Impact Analysis
Requirements are inherently subject to changes throughout the software development lifecycle. Within the limited budget available to requirements engineers, manually identifying the impact of such changes on other requirements is both error-prone and effort-intensive. That might lead to overlooked impacted requirements, which, if not properly managed, can cause serious issues in the downstream tasks. Inspired by the growing potential of large language models (LLMs) across diverse domains, we propose ProReFiCIA, an LLM-driven approach for automatically identifying the impacted requirements when changes occur. We conduct an extensive evaluation of ProReFiCIA using several LLMs and prompts variants tailored to this task. Using the best combination of an LLM and a prompt variant, ProReFiCIA achieves a recall of 93.3% on a benchmark dataset and 95.8% on a newly created industry dataset, demonstrating its strong effectiveness in identifying impacted requirements. Further, the cost of applying ProReFiCIA remains small, as the engineer only needs to review the generated results, which represent between 2.1% and 8.5% of the entire set of requirements.
comment: 28 pages, 6 figures
☆ DocPrism: Local Categorization and External Filtering to Identify Relevant Code-Documentation Inconsistencies
Code-documentation inconsistencies are common and undesirable: they can lead to developer misunderstandings and software defects. This paper introduces DocPrism, a multi-language, code-documentation inconsistency detection tool. DocPrism uses a standard large language model (LLM) to analyze and explain inconsistencies. Plain use of LLMs for this task yield unacceptably high false positive rates: LLMs identify natural gaps between high-level documentation and detailed code implementations as inconsistencies. We introduce and apply the Local Categorization, External Filtering (LCEF) methodology to reduce false positives. LCEF relies on the LLM's local completion skills rather than its long-term reasoning skills. In our ablation study, LCEF reduces DocPrism's inconsistency flag rate from 98% to 14%, and increases accuracy from 14% to 94%. On a broad evaluation across Python, TypeScript, C++, and Java, DocPrism maintains a low flag rate of 15%, and achieves a precision of 0.62 without performing any fine-tuning.
☆ Position: Vibe Coding Needs Vibe Reasoning: Improving Vibe Coding with Formal Verification
``Vibe coding'' -- the practice of developing software through iteratively conversing with a large language model (LLM) -- has exploded in popularity within the last year. However, developers report key limitations including the accumulation of technical debt, security issues, and code churn to achieve satisfactory results. We argue that these pitfalls result from LLMs' inability to reconcile accumulating human-imposed constraints during vibe coding, with developers inadvertently failing to resolve contradictions because LLMs prioritize user commands over code consistency. Given LLMs' receptiveness to verification-based feedback, we argue that formal methods can mitigate these pitfalls, making vibe coding more reliable. However, we posit that integrating formal methods must transcend existing approaches that combine formal methods and LLMs. We advocate for a side-car system throughout the vibe coding process which: (1) \emph{Autoformalizes} specifications (2) Validates against targets, (3) Delivers \emph{actionable} feedback to the LLM, and (4) Allows intuitive developer influence on specifications.
comment: 7 pages, 3 figures, In Proceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages (LMPL'25), October 12-18, 2025, Singapore, Singapore. ACM, New York, NY, USA
☆ Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories
The increasing deployment of Large Language Model (LLM) agents for complex software engineering tasks has created a need to understand their problem-solving behaviours beyond simple success metrics. While these agents demonstrate impressive capabilities in automated issue resolution, their decision-making processes remain largely opaque. This paper presents an empirical study of agent trajectories, namely the execution traces capturing the steps agents take when attempting to resolve software issues. We analyse trajectories from three state-of-the-art code agents (OpenHands, SWE-agent, and Prometheus) on the SWE-Bench benchmark, examining both successful and failed attempts. Our investigation reveals several key insights into agent behaviour. First, we identify how distinct problem-solving strategies, such as defensive programming and context gathering, enable success in different scenarios. Second, we find that failed trajectories are consistently longer and exhibit higher variance than successful ones, with failure patterns differing significantly between agents. Third, our fault localisation analysis shows that while most trajectories correctly identify problematic files (72-81\% even in failures), success depends more on achieving approximate rather than exact code modifications. These and other findings unveiled by our study, provide a foundation for understanding agent behaviour through trajectory analysis, contributing to the development of more robust and interpretable autonomous software engineering systems.
☆ What a diff makes: automating code migration with large language models
Modern software programs are built on stacks that are often undergoing changes that introduce updates and improvements, but may also break any project that depends upon them. In this paper we explore the use of Large Language Models (LLMs) for code migration, specifically the problem of maintaining compatibility with a dependency as it undergoes major and minor semantic version changes. We demonstrate, using metrics such as test coverage and change comparisons, that contexts containing diffs can significantly improve performance against out of the box LLMs and, in some cases, perform better than using code. We provide a dataset to assist in further development of this problem area, as well as an open-source Python package, AIMigrate, that can be used to assist with migrating code bases. In a real-world migration of TYPHOIDSIM between STARSIM versions, AIMigrate correctly identified 65% of required changes in a single run, increasing to 80% with multiple runs, with 47% of changes generated perfectly.
comment: 10 pages, 8 figures
☆ Inferring multiple helper Dafny assertions with LLMs
The Dafny verifier provides strong correctness guarantees but often requires numerous manual helper assertions, creating a significant barrier to adoption. We investigate the use of Large Language Models (LLMs) to automatically infer missing helper assertions in Dafny programs, with a primary focus on cases involving multiple missing assertions. To support this study, we extend the DafnyBench benchmark with curated datasets where one, two, or all assertions are removed, and we introduce a taxonomy of assertion types to analyze inference difficulty. Our approach refines fault localization through a hybrid method that combines LLM predictions with error-message heuristics. We implement this approach in a new tool called DAISY (Dafny Assertion Inference SYstem). While our focus is on multiple missing assertions, we also evaluate DAISY on single-assertion cases. DAISY verifies 63.4% of programs with one missing assertion and 31.7% with multiple missing assertions. Notably, many programs can be verified with fewer assertions than originally present, highlighting that proofs often admit multiple valid repair strategies and that recovering every original assertion is unnecessary. These results demonstrate that automated assertion inference can substantially reduce proof engineering effort and represent a step toward more scalable and accessible formal verification.
♻ ☆ A Process Mining-Based System For The Analysis and Prediction of Software Development Workflows
CodeSight is an end-to-end system designed to anticipate deadline compliance in software development workflows. It captures development and deployment data directly from GitHub, transforming it into process mining logs for detailed analysis. From these logs, the system generates metrics and dashboards that provide actionable insights into PR activity patterns and workflow efficiency. Building on this structured representation, CodeSight employs an LSTM model that predicts remaining PR resolution times based on sequential activity traces and static features, enabling early identification of potential deadline breaches. In tests, the system demonstrates high precision and F1 scores in predicting deadline compliance, illustrating the value of integrating process mining with machine learning for proactive software project management.
comment: 16 pages, 7 figures, 4 tables
♻ ☆ Understanding Collective Social Behavior in OSS Communities: A Co-editing Network Analysis of Activity Cascades
Understanding the collective social behavior of software developers is crucial to model and predict the long-term dynamics and sustainability of Open Source Software (OSS) communities. To this end, we analyze temporal activity patterns of developers, revealing an inherently ``bursty'' nature of commit contributions. To investigate the social mechanisms behind this phenomenon, we adopt a network-based modelling framework that captures developer interactions through co-editing networks. Our framework models social interactions, where a developer editing the code of other developers triggers accelerated activity among collaborators. Using a large data set on 50 major OSS communities, we further develop a method that identifies activity cascades, i.e. the propagation of developer activity in the underlying co-editing network. Our results suggest that activity cascades are a statistically significant phenomenon in more than half of the studied projects. We further show that our insights can be used to develop a simple yet practical churn prediction method that forecasts which developers are likely to leave a project. Our work sheds light on the emergent collective social dynamics in OSS communities and highlights the importance of activity cascades to understand developer churn and retention in collaborative software projects.
♻ ☆ Privacy by Design: Aligning GDPR and Software Engineering Specifications with a Requirements Engineering Approach
Context: Consistent requirements and system specifications are essential for the compliance of software systems towards the General Data Protection Regulation (GDPR). Both artefacts need to be grounded in the original text and conjointly assure the achievement of privacy by design (PbD). Objectives: There is little understanding of the perspectives of practitioners on specification objectives and goals to address PbD. Existing approaches do not account for the complex intersection between problem and solution space expressed in GDPR. In this study we explore the demand for conjoint requirements and system specification for PbD and suggest an approach to address this demand. Methods: We reviewed secondary and related primary studies and conducted interviews with practitioners to (1) investigate the state-of-practice and (2) understand the underlying specification objectives and goals (e.g., traceability). We developed and evaluated an approach for requirements and systems specification for PbD, and evaluated it against the specification objectives. Results: The relationship between problem and solution space, as expressed in GDPR, is instrumental in supporting PbD. We demonstrate how our approach, based on the modeling GDPR content with original legal concepts, contributes to specification objectives of capturing legal knowledge, supporting specification transparency, and traceability. Conclusion: GDPR demands need to be addressed throughout different levels of abstraction in the engineering lifecycle to achieve PbD. Legal knowledge specified in the GDPR text should be captured in specifications to address the demands of different stakeholders and ensure compliance. While our results confirm the suitability of our approach to address practical needs, we also revealed specific needs for the future effective operationalization of the approach.
comment: v2: acknowledgements added. Accepted for publication in Information and Software Technology Volume 190, February 2026, 107946
♻ ☆ Environmental Impact of CI/CD Pipelines
CI/CD pipelines are widely used in software development, yet their environmental impact, particularly carbon and water footprints (CWF), remains largely unknown to developers, as CI service providers typically do not disclose such information. With the growing environmental impact of cloud computing, understanding the CWF of CI/CD services has become increasingly important. This work investigates the CWF of using GitHub Actions, focusing on open-source repositories where usage is free and unlimited for standard runners. We build upon a methodology from the Cloud Carbon Footprint framework and we use the largest dataset of workflow runs reported in the literature to date, comprising over 2.2 million workflow runs from more than 18,000 repositories. Our analysis reveals that the GitHub Actions ecosystem results in a substantial CWF. Our estimates for the carbon footprint in 2024 range from 150.5 MTCO2e in the most optimistic scenario to 994.9 MTCO2e in the most pessimistic scenario, while the water footprint ranges from 1,989.6 to 37,664.5 kiloliters. The most likely scenario estimates are 456.9 MTCO2e for carbon footprint and 5,738.2 kiloliters for water footprint. To provide perspective, the carbon footprint in the most likely scenario is equivalent to the carbon captured by 7,615 urban trees in a year, and the water footprint is comparable to the water consumed by an average American family over 5,053 years. We explore strategies to mitigate this impact, primarily by reducing wasted computational resources. Key recommendations include deploying runners in regions whose energy production has a low environmental impact such as France and the United Kingdom, implementing stricter deactivation policies for scheduled runs and aligning their execution with periods when the regional energy mix is more environmentally favorable, and reducing the size of repositories.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Can Emulating Semantic Translation Help LLMs with Code Translation? A Study Based on Pseudocode
Large language models (LLMs) show great potential in code translation. However, accurate translation remains challenging when using the commonly adopted direct code-to-code translation approach, which converts a program into the target programming language (PL) in a single step. Inspired by the success of incorporating intermediate steps to guide LLMs in resolving challenging tasks, we explore pseudocode-based code translation, which emulates the human semantic translation by first interpreting the program's intent and logic into pseudocode and then implementing it in the target PL. We find that pseudocode-based translation helps translate programs that direct translation struggles to handle. Nonetheless, the effectiveness, advantages, and limitations of this approach remain underexplored. To bridge this gap, we present an empirical study on pseudocode-based code translation, aiming to investigate its effectiveness in enhancing the direct translation approach, illuminate its effective usage, and identify limitations hindering its potential benefits. By comparing direct and pseudocode-based translation approaches on 9,690 translation tasks across six PLs with five popular LLMs, we demonstrate that pseudocode-based translation can effectively complement direct translation, particularly when translating from flexible to rigid PLs or dealing with low-resource Rust. Based on these findings, we suggest adopting strategies that combine the complementary strengths of both approaches to enhance code translation accuracy. We also reveal the advantages of pseudocode-based translation in disentangling translations of complicated programs and mitigating distractions from detailed implementations in original programs, as well as its limitations due to incorrect, incomplete, or ambiguous pseudocode.
♻ ☆ On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations
Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist. However, studies make the mistake of assuming implementations of the same algorithm to be consistent and thus, interchangeable. In this paper, through a differential testing lens, we present the results of studying the extent of implementation inconsistencies, their effect on the implementations' performance, as well as their impact on the conclusions of prior studies under the assumption of interchangeable implementations. The outcomes of our differential tests showed significant discrepancies between the tested algorithm implementations, indicating that they are not interchangeable. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50% of their total trials while the other two implementations only achieved superhuman performance for less than 15% of their total trials. As part of a meticulous manual analysis of the implementations' source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that this assumption of implementation interchangeability was sufficient to flip experiment outcomes. Therefore, this calls for a shift in how implementations are being used.
comment: Added the ACM Functional v1.1 badge, the full publication citation, and a link to the publication on IEEE Xplore
♻ ☆ Human to Document, AI to Code: Comparing GenAI for Notebook Competitions
Computational notebooks have become the preferred tool of choice for data scientists and practitioners to perform analyses and share results. Notebooks uniquely combine scripts with documentation. With the emergence of generative AI (GenAI) technologies, it is increasingly important, especially in competitive settings, to distinguish the characteristics of human-written versus GenAI. In this study, we present three case studies to explore potential strengths of both humans and GenAI through the coding and documenting activities in notebooks. We first characterize differences between 25 code and documentation features in human-written, medal-winning Kaggle notebooks. We find that gold medalists are primarily distinguished by longer and more detailed documentation. Second, we analyze the distinctions between human-written and GenAI notebooks. Our results show that while GenAI notebooks tend to achieve higher code quality (as measured by metrics like code smells and technical debt), human-written notebooks display greater structural diversity, complexity, and innovative approaches to problem-solving. Based on these results, we envision the work as groundwork that highlight four agendas to further investigate how GenAI could be utilized in notebooks that maximizes the potential collaboration between human and AI.
comment: Accepted at the AIware 2025 - Main Track
♻ ☆ LLM Based Long Code Translation using Identifier Replacement
In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
♻ ☆ RepoMasterEval: Evaluating Code Completion via Real-World Repositories
With the growing reliance on automated code completion tools in software development, the need for comprehensive evaluation benchmarks has become critical. Existing benchmarks focus more on code completion in function and class level by providing text descriptions to prompt the model. By contrast, such descriptive prompt is commonly unavailable in real development and code completion can occur in wider range of situations such as in the middle of a function or a code block. These limitations makes existing evaluation benchmarks poorly align with the practical scenarios of code completion tools. In this paper, we propose RepoMasterEval, a novel benchmark for evaluating code completion models constructed from real-world repositories. Each benchmark datum is generated by masking a code snippet (ground truth) from one source code file with existing test suites. To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases and we manually crafted new test cases for those test suites with low mutation score. Our empirical evaluation on 10 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark and RepoMasterEval is able to report variance in model performance in real-world scenarios. The deployment of RepoMasterEval also revealed that the benchmark is useful to give accurate feedback during model training and the score is in high correlation with the model's performance in practice.
♻ ☆ LLM-Guided Scenario-based GUI Testing
The assurance of mobile app GUIs has become increasingly important, as the GUI serves as the primary medium of interaction between users and apps. Although numerous automated GUI testing approaches have been developed with diverse strategies, a substantial gap remains between these approaches and the underlying app business logic. Most existing approaches focus on general exploration rather than the completion of specific testing scenarios, often missing critical functionalities. Inspired by manual testing, which treats business logic-driven scenarios as the fundamental unit of testing, this paper introduces an approach that leverages large language models to comprehend GUI semantics and contextual relevance to given scenarios. Building on this capability, we propose ScenGen, an LLM-guided scenario-based GUI testing framework employing multi-agent collaboration to simulate and automate manual testing phases. Specifically, ScenGen integrates five agents: the Observer, Decider, Executor, Supervisor, and Recorder. The Observer perceives the app GUI state by extracting and structuring GUI widgets and layouts, interpreting semantic information. This is passed to the Decider, which makes scenario-driven decisions with LLM guidance to identify target widgets and determine actions toward fulfilling specific goals. The Executor performs these operations, while the Supervisor verifies alignment with intended scenario completion, ensuring traceability and consistency. Finally, the Recorder logs GUI operations into context memory as a knowledge base for subsequent decision-making and monitors runtime bugs. Comprehensive evaluations demonstrate that ScenGen effectively generates scenario-based GUI tests guided by LLM collaboration, achieving higher relevance to business logic and improving the completeness of automated GUI testing.
♻ ☆ Ten Simple Rules for AI-Assisted Coding in Science
While AI coding tools have demonstrated potential to accelerate software development, their use in scientific computing raises critical questions about code quality and scientific validity. In this paper, we provide ten practical rules for AI-assisted coding that balance leveraging capabilities of AI with maintaining scientific and methodological rigor. We address how AI can be leveraged strategically throughout the development cycle with four key themes: problem preparation and understanding, managing context and interaction, testing and validation, and code quality assurance and iterative improvement. These principles serve to emphasize maintaining human agency in coding decisions, establishing robust validation procedures, and preserving the domain expertise essential for methodologically sound research. These rules are intended to help researchers harness AI's transformative potential for faster software development while ensuring that their code meets the standards of reliability, reproducibility, and scientific validity that research integrity demands.
comment: 9 pages of content; 1 table; 1 page appendix
♻ ☆ RepoMark: A Code Usage Auditing Framework for Code Large Language Models
The rapid development of Large Language Models (LLMs) for code generation has transformed software development by automating coding tasks with unprecedented efficiency. However, the training of these models on open-source code repositories (e.g., from GitHub) raises critical ethical and legal concerns, particularly regarding data authorization and open-source license compliance. Developers are increasingly questioning whether model trainers have obtained proper authorization before using repositories for training, especially given the lack of transparency in data collection. To address these concerns, we propose a novel data marking framework RepoMark to audit the data usage of code LLMs. Our method enables repository owners to verify whether their code has been used in training, while ensuring semantic preservation, imperceptibility, and theoretical false detection rate (FDR) guarantees. By generating multiple semantically equivalent code variants, RepoMark introduces data marks into the code files, and during detection, RepoMark leverages a novel ranking-based hypothesis test to detect memorization within the model. Compared to prior data auditing approaches, RepoMark significantly enhances sample efficiency, allowing effective auditing even when the user's repository possesses only a small number of code files. Experiments demonstrate that RepoMark achieves a detection success rate over 90\% on small code repositories under a strict FDR guarantee of 5\%. This represents a significant advancement over existing data marking techniques, all of which only achieve accuracy below 55\% under identical settings. This further validates RepoMark as a robust, theoretically sound, and promising solution for enhancing transparency in code LLM training, which can safeguard the rights of repository owners.
♻ ☆ Your Build Scripts Stink: The State of Code Smells in Build Scripts
Build scripts automate the process of compiling source code, managing dependencies, running tests, and packaging software into deployable artifacts. These scripts are ubiquitous in modern software development pipelines for streamlining testing and delivery. While developing build scripts, practitioners may inadvertently introduce code smells, which are recurring patterns of poor coding practices that may lead to build failures or increase risk and technical debt. The goal of this study is to aid practitioners in avoiding code smells in build scripts through an empirical study of build scripts and issues on GitHub.We employed a mixed-methods approach, combining qualitative and quantitative analysis. First, we conducted a qualitative analysis of 2000 build-script-related GitHub issues to understand recurring smells. Next, we developed a static analysis tool, Sniffer, to automatically detect code smells in 5882 build scripts of Maven, Gradle, CMake, and Make files, collected from 4877 open-source GitHub repositories. To assess Sniffer's performance, we conducted a user study, where Sniffer achieved higher precision, recall, and F-score. We identified 13 code smell categories, with a total of 10,895 smell occurrences, where 3184 were in Maven, 1214 in Gradle, 337 in CMake, and 6160 in Makefiles. Our analysis revealed that Insecure URLs were the most prevalent code smell in Maven build scripts, while HardcodedPaths/URLs were commonly observed in both Gradle and CMake scripts. Wildcard Usage emerged as the most frequent smell in Makefiles. The co-occurrence analysis revealed strong associations between specific smell pairs of Hardcoded Paths/URLs with Duplicates, and Inconsistent Dependency Management with Empty or Incomplete Tags, which indicate potential underlying issues in the build script structure and maintenance practices.
comment: 13 pages, 5 tables, 2 figures
♻ ☆ A Practical-Driven Framework for Transitioning Drive-by-Wire to Autonomous Driving Systems: A Case Study with a Chrysler Pacifica Hybrid Vehicle
Transitioning from a Drive-by-Wire (DBW) system to a fully autonomous driving system (ADS) involves multiple stages of development and demands robust positioning and sensing capabilities. This paper presents a practice-driven framework for facilitating the DBW-to-ADS transition using a 2022 Chrysler Pacifica Hybrid Minivan equipped with cameras, LiDAR, GNSS, and onboard computing hardware configured with the Robot Operating System (ROS) and Autoware.AI. The implementation showcases offline autonomous operations utilizing pre-recorded LiDAR and camera data, point clouds, and vector maps, enabling effective localization and path planning within a structured test environment. The study addresses key challenges encountered during the transition, particularly those related to wireless-network-assisted sensing and positioning. It offers practical solutions for overcoming software incompatibility constraints, sensor synchronization issues, and limitations in real-time perception. Furthermore, the integration of sensing, data fusion, and automation is emphasized as a critical factor in supporting autonomous driving systems in map generation, simulation, and training. Overall, the transition process outlined in this work aims to provide actionable strategies for researchers pursuing DBW-to-ADS conversion. It offers direction for incorporating real-time perception, GNSS-LiDAR-camera integration, and fully ADS-equipped autonomous vehicle operations, thus contributing to the advancement of robust autonomous vehicle technologies.
comment: This updated version includes further implementation details and experimental validation. Accepted for presentation at The 22nd International Conference on Automation Technology (AUTOMATION 2025), Taipei, Taiwan, November 2025
♻ ☆ A Multifaceted View on Discrimination in Software Development Careers
Conversations around diversity and inclusion in software engineering often focus on gender and racial disparities. However, the State of Devs 2025 survey with 8,717 participants revealed that other forms of discrimination are similarly prevalent but receive considerably less attention. This includes discrimination based on age, political perspective, disabilities, or cognitive differences such as neurodivergence. We conducted a secondary analysis of 800 open-ended survey responses to examine patterns of perceived discrimination, as well as related challenges and negative impacts. Our study covers multiple identity facets, including age, gender, race, and disability. We found that age- and gender-related discrimination was the most frequently reported workplace issue, but discrimination based on political and religious views emerged as further notable concerns. Most of the participants who identified as female cited gender as the primary source of discrimination, often accompanied by intersectional factors such as race, political views, age, or sexual orientation. Discrimination related to caregiving responsibilities was reported by all gender identities. Regarding the negative impacts of workplace issues, many participants described modifying their appearance or behavior in response to gender biases. Gender also appeared to influence broader career challenges, as women and non-binary respondents reported experiencing almost all workplace issues at higher rates, particularly discrimination (35%) and mental health challenges (62%). Our goal is to raise awareness in the research community that discrimination in software development is multifaceted, and to encourage researchers to select and assess relevant facets beyond age and gender when designing software engineering studies.
comment: 11 pages, 1 figure, 5 tables
♻ ☆ Small Changes, Big Trouble: Demystifying and Parsing License Variants for Incompatibility Detection in the PyPI Ecosystem ICSE 2026
Open-source licenses establish the legal foundation for software reuse, yet license variants, including both modified standard licenses and custom-created alternatives, introduce significant compliance complexities. Despite their prevalence and potential impact, these variants are poorly understood in modern software systems, and existing tools do not account for their existence, leading to significant challenges in both effectiveness and efficiency of license analysis. To fill this knowledge gap, we conduct a comprehensive empirical study of license variants in the PyPI ecosystem. Our findings show that textual variations in licenses are common, yet only 2% involve substantive modifications. However, these license variants lead to significant compliance issues, with 10.7% of their downstream dependencies found to be license-incompatible. Inspired by our findings, we introduce LV-Parser, a novel approach for efficient license variant analysis leveraging diff-based techniques and large language models, along with LV-Compat, an automated pipeline for detecting license incompatibilities in software dependency networks. Our evaluation demonstrates that LV-Parser achieves an accuracy of 0.936 while reducing computational costs by 30%, and LV-Compat identifies 5.2 times more incompatible packages than existing methods with a precision of 0.98. This work not only provides the first empirical study into license variants in software packaging ecosystem but also equips developers and organizations with practical tools for navigating the complex landscape of open-source licensing.
comment: The 48th International Conference on Software Engineering(ICSE 2026)
♻ ☆ Internal Vulnerabilities, External Threats: A Grounded Framework for Enterprise Open Source Risk Governance
Enterprise engagement with open source has evolved from tactical adoption to strategic deep integration, exposing them to a complex risk landscape far beyond mere code. However, traditional risk management, narrowly focused on technical tools, is structurally inadequate for systemic threats like upstream "silent fixes", community conflicts, or sudden license changes, creating a dangerous governance blind spot. To address this governance vacuum and enable the necessary shift from tactical risk management to holistic risk governance, we conducted a grounded theory study with 15 practitioners to develop a holistic risk governance framework. Our study formalizes an analytical framework built on a foundational risk principle: an uncontrollable External Threat (e.g., a sudden license change in a key dependency) only becomes a critical risk when it exploits a controllable Internal Vulnerability (e.g., an undefined risk appetite for single-vendor projects), which then amplifies the impact. The framework operationalizes this principle through a clear logical chain: "Objectives -> Threats -> Vulnerabilities -> Mitigation" (OTVM). This provides a holistic decision model that transcends mere technical checklists. Based on this logic, our contributions are: (1) a "Strategic Objectives Matrix" to clarify goals; (2) a systematic dual taxonomy of External Threats (Ex-Tech, Ex-Comm, Ex-Eco) and Internal Vulnerabilities (In-Strat, In-Ops, In-Tech); and (3) an actionable mitigation framework mapping capability-building to these vulnerabilities. The framework's analytical utility was validated by three industry experts through retrospective case studies on real-world incidents. This work provides a novel diagnostic lens and a systematic path for enterprises to shift from reactive "firefighting" to proactively building an organizational "immune system".
Systems and Control 30
☆ Technical Report for Dissipativity Learning in Reproducing Kernel Hilbert Space
This work presents a nonparametric framework for dissipativity learning in reproducing kernel Hilbert spaces, which enables data-driven certification of stability and performance properties for unknown nonlinear systems without requiring an explicit dynamic model. Dissipativity is a fundamental system property that generalizes Lyapunov stability, passivity, and finite L2 gain conditions through an energy balance inequality between a storage function and a supply rate. Unlike prior parametric formulations that approximate these functions using quadratic forms with fixed matrices, the proposed method represents them as Hilbert Schmidt operators acting on canonical kernel features, thereby capturing nonlinearities implicitly while preserving convexity and analytic tractability. The resulting operator optimization problem is formulated in the form of a one-class support vector machine and reduced, via the representer theorem, to a finite dimensional convex program expressed through kernel Gram matrices. Furthermore, statistical learning theory is applied to establish generalization guarantees, including confidence bounds on the dissipation rate and the L2 gain. Numerical results demonstrate that the proposed RKHS based dissipativity learning method effectively identifies nonlinear dissipative behavior directly from input output data, providing a powerful and interpretable framework for model free control analysis and synthesis.
comment: 26 pages, 3 figures
☆ SpecAttn: Speculating Sparse Attention NeurIPS 2025
Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.
comment: Accepted to NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling
☆ Risk-constrained stochastic scheduling of multi-market energy storage systems
Energy storage can promote the integration of renewables by operating with charge and discharge policies that balance an intermittent power supply. This study investigates the scheduling of energy storage assets under energy price uncertainty, with a focus on electricity markets. A two-stage stochastic risk-constrained approach is employed, whereby electricity price trajectories or specific power markets are observed, allowing for recourse in the schedule. Conditional value-at-risk is used to quantify tail risk in the optimization problems; this allows for the explicit specification of a probabilistic risk limit. The proposed approach is tested in an integrated hydrogen system (IHS) and a battery energy storage system (BESS). In the joint design and operation context for the IHS, the risk constraint results in larger installed unit capacities, increasing capital cost but enabling more energy inventory to buffer price uncertainty. As shown in both case studies, there is an operational trade-off between risk and expected reward; this is reflected in higher expected costs (or lower expected profits) with increasing levels of risk aversion. Despite the decrease in expected reward, both systems exhibit substantial benefits of increasing risk aversion. This work provides a general method to address uncertainties in energy storage scheduling, allowing operators to input their level of risk tolerance on asset decisions.
comment: 39 pages, 10 figures, 7 tables
☆ Context-Aware Stochastic Modeling of Consumer Energy Resource Aggregators in Electricity Markets SC
Aggregators of consumer energy resources (CERs) like rooftop solar and battery energy storage (BES) face challenges due to their inherent uncertainties. A sensible approach is to use stochastic optimization to handle such uncertainties, which can lead to infeasible problems or loss in revenues if not chosen appropriately. This paper presents three efficient two-stage stochastic optimization methods: risk-neutral, robust, and chance-constrained, to address the impact of CER uncertainties for aggregators who participate in energy and regulation services markets in the Australian National Electricity Market. Furthermore, these methods utilize the flexibility of BES, considering precise state-of-charge dynamics and complementarity constraints, aiming for scalable performance while managing uncertainty. The problems are formed as two-stage stochastic mixed-integer linear programs, with relaxations adopted for large scenario sets. The solution approach employs scenario-based methodologies and affine recourse policies to obtain tractable reformulations. These methods are evaluated across use cases reflecting diverse operational and market settings, uncertainty characteristics, and decision-making preferences, demonstrating their ability to mitigate uncertainty, enhance profitability, and provide context-aware guidance for aggregators in choosing the most appropriate stochastic optimization method.
comment: Submitted to PSCC 2026
☆ A Switching Strategy for Event-Trigger Control of Spacecraft Rendezvous
This paper presents the design of a state-feedback control law for spacecraft rendezvous, formulated using the Hill-Clohessy-Wiltshire equations. The proposed method introduces an impulsive control strategy to regulate thruster operations. Specifically, a state-dependent switching framework is developed to determine both the control input magnitudes and the precise state conditions that trigger thruster activation. The nonlinear control law is derived using principles from automatic control theory, particularly Lyapunov stability analysis and the Linear Matrix Inequality framework. The resulting closed-loop system is proven to be stable, while simultaneously minimizing the total number of actuation events. The effectiveness of the proposed method is demonstrated through a numerical case study, which includes a comparative analysis with a standard Model Predictive Control scheme, highlighting the advantages and trade-offs of the developed control structure.
comment: Submitted for EuroGNC 2026
☆ Simplifying Preference Elicitation in Local Energy Markets: Combinatorial Clock Exchange
As distributed energy resources (DERs) proliferate, future power system will need new market platforms enabling prosumers to trade various electricity and grid-support products. However, prosumers often exhibit complex, product interdependent preferences and face limited cognitive and computational resources, hindering engagement with complex market structures and bid formats. We address this challenge by introducing a multi-product market that allows prosumers to express complex preferences through an intuitive format, by fusing combinatorial clock exchange and machine learning (ML) techniques. The iterative mechanism only requires prosumers to report their preferred package of products at posted prices, eliminating the need for forecasting product prices or adhering to complex bid formats, while the ML-aided price discovery speeds up convergence. The linear pricing rule further enhances transparency and interpretability. Finally, numerical simulations demonstrate convergence to clearing prices in approximately 15 clock iterations.
☆ Value of Multi-pursuer Single-evader Pursuit-evasion Game with Terminal Cost of Evader's Position: Relaxation of Convexity Condition
In this study, we consider a multi-pursuer single-evader quantitative pursuit-evasion game with payoff function that includes only the terminal cost. The terminal cost is a function related only to the terminal position of the evader. This problem has been extensively studied in target defense games. Here, we prove that a candidate for the value function generated by geometric method is the viscosity solution of the corresponding Hamilton-Jacobi-Isaacs partial differential equation (HJI PDE) Dirichlet problem. Therefore, the value function of the game at each point can be computed by a mathematical program. In our work, the convexity of the terminal cost or the target is not required. The terminal cost only needs to be locally Lipschitz continuous. The cases in which the terminal costs or the targets are not convex are covered. Therefore, our result is more universal than those of previous studies, and the complexity of the proof is improved. We also discuss the optimal strategies in this game and present an intuitive explanation of this value function.
comment: 21 pages, 6 figures
☆ Solving Infinite-Horizon Optimal Control Problems using the Extreme Theory of Functional Connections
This paper presents a physics-informed machine learning approach for synthesizing optimal feedback control policy for infinite-horizon optimal control problems by solving the Hamilton-Jacobi-Bellman (HJB) partial differential equation(PDE). The optimal control policy is derived analytically for affine dynamical systems with separable and strictly convex control costs, expressed as a function of the gradient of the value function. The resulting HJB-PDE is then solved by approximating the value function using the Extreme Theory of Functional Connections (X-TFC) - a hybrid approach that combines the Theory of Functional Connections (TFC) with the Extreme Learning Machine (ELM) algorithm. This approach ensures analytical satisfaction of boundary conditions and significantly reduces training cost compared to traditional Physics-Informed Neural Networks (PINNs). We benchmark the method on linear and non-linear systems with known analytical solutions as well as demonstrate its effectiveness on control tasks such as spacecraft optimal de-tumbling control.
comment: Accepted to Indian Control Conference (ICC-11), 6 pages, 12 figures
☆ Optimal BESS Sizing and Placement for Mitigating EV-Induced Voltage Violations: A Scalable Spatio-Temporal Adaptive Targeting Strategy
The escalating adoption of electric vehicles (EVs) and the growing demand for charging solutions are driving a surge in EV charger installations in distribution networks. However, this rising EV load strains the distribution grid, causing severe voltage drops, particularly at feeder extremities. This study proposes a proactive voltage management (PVM) framework that can integrate Monte Carlo-based simulations of varying EV charging loads to (i) identify potential voltage violations through a voltage violation analysis (VVA) model, and (ii) then mitigate those violations with optimally-invested battery energy storage systems (BESS) through an optimal expansion planning (OEP) model. A novel spatio-temporal adaptive targeting (STAT) strategy is proposed to alleviate the computational complexity of the OEP model by defining a targeted OEP (T-OEP) model, solved by applying the OEP model to (i) a reduced set of representative critical time periods and (ii) candidate BESS installation nodes. The efficacy and scalability of the proposed approach are validated on 33-bus, 69-bus, and a large-scale 240-bus system. Results demonstrate that the strategic sizing and placement of BESS not only effectively mitigate voltage violations but also yield substantial cost savings on electricity purchases under time-of-use tariffs. This research offers a cost-effective and scalable solution for integrating high penetrations of EVs, providing crucial insights for future distribution network planning.
☆ Analyzing the Impact of Demand Response on Short-Circuit Current via a Unit Commitment Model
In low-carbon grids, system flexibility can be enhanced through mechanisms such as Demand Response (DR), enabling the efficient utilization of renewable energy. However, as Synchronous Generators (SGs) are being replaced with renewable energy characterized by Inverter-Based Resources (IBR), system stability is severely affected. Due to the limited overload capability of IBR, their Short-Circuit Current (SCC) contribution is much smaller than that of SGs, which may result in protection devices failing to trip during faults. Consequently, the remaining SGs play a key role in offering sufficient SCC volumes. Given that the commitment of SGs is closely related to system load, DR can thus indirectly affect their SCC provision, a relationship that has not been investigated. Therefore, this paper incorporates both DR and SCC constraints into a unit commitment model and conducts studies on an IEEE 30-bus system. The results show that although DR can reduce social costs by lowering power demand, it may also lead to inadequate SCC levels. Nevertheless, the cost increases by only 0.3% when DR is combined with SCC constraints, indicating that DR can actually help achieve a stable system in a cost-effective manner.
comment: 1-5 pages. submitted to PESGM 2026, Canada
☆ Learning a Network Digital Twin as a Hybrid System
Network digital twin (NDT) models are virtual models that replicate the behavior of physical communication networks and are considered a key technology component to enable novel features and capabilities in future 6G networks. In this work, we focus on NDTs that model the communication quality properties of a multi-cell, dynamically changing wireless network over a workspace populated with multiple moving users. We propose an NDT modeled as a hybrid system, where each mode corresponds to a different base station and comprises sub-modes that correspond to areas of the workspace with similar network characteristics. The proposed hybrid NDT is identified and continuously improved through an annealing optimization-based learning algorithm, driven by online data measurements collected by the users. The advantages of the proposed hybrid NDT are studied with respect to memory and computational efficiency, data consumption, and the ability to timely adapt to network changes. Finally, we validate the proposed methodology on real experimental data collected from a two-cell 5G testbed.
☆ Which Top Energy-Intensive Manufacturing Countries Can Compete in a Renewable Energy Future?
In a world increasingly powered by renewables and aiming for greenhouse gas-neutral industrial production, the future competitiveness of todays top manufacturing countries is questioned. This study applies detailed energy system modeling to quantify the Renewable Pull, an incentive for industry relocation exerted by countries with favorable renewable conditions. Results reveal that the Renewable Pull is not a cross-industrial phenomenon but strongly depends on the relationship between energy costs and transport costs. The intensity of the Renewable Pull varies, with China, India, and Japan facing a significantly stronger effect than Germany and the United States. Incorporating national capital cost assumptions proves critical, reducing Germanys Renewable Pull by a factor of six and positioning it as the second least affected top manufacturing country after Saudi Arabia. Using Germany as a case study, the analysis moreover illustrates that targeted import strategies, especially within the EU, can nearly eliminate the Renewable Pull, offering policymakers clear options for risk mitigation.
comment: 29 pages, 16 figures
☆ Multivariable Gradient-Based Extremum Seeking Control with Saturation Constraints
This paper addresses the multivariable gradient-based extremum seeking control (ESC) subject to saturation. Two distinct saturation scenarios are investigated here: saturation acting on the input of the function to be optimized, which is addressed using an anti-windup compensation strategy, and saturation affecting the gradient estimate. In both cases, the unknown Hessian matrix is represented using a polytopic uncertainty description, and sufficient conditions in the form of linear matrix inequalities (LMIs) are derived to design a stabilizing control gain. The proposed conditions guarantee exponential stability of the origin for the average closed-loop system under saturation constraints. With the proposed design conditions, non-diagonal control gain matrices can be obtained, generalizing conventional ESC designs that typically rely on diagonal structures. Stability and convergence are rigorously proven using the Averaging Theory for dynamical systems with Lipschitz continuous right-hand sides. Numerical simulations illustrate the effectiveness of the proposed ESC algorithms, confirming the convergence even in the presence of saturation.
comment: 15 pages, 6 figures
☆ Supply Chain Exploitation of Secure ROS 2 Systems: A Proof-of-Concept on Autonomous Platform Compromise via Keystore Exfiltration
This paper presents a proof-of-concept supply chain attack against the Secure ROS 2 (SROS 2) framework, demonstrated on a Quanser QCar2 autonomous vehicle platform. A Trojan-infected Debian package modifies core ROS 2 security commands to exfiltrate newly generated keystore credentials via DNS in base64-encoded chunks to an attacker-controlled nameserver. Possession of these credentials enables the attacker to rejoin the SROS 2 network as an authenticated participant and publish spoofed control or perception messages without triggering authentication failures. We evaluate this capability on a secure ROS 2 Humble testbed configured for a four-stop-sign navigation routine using an Intel RealSense camera for perception. Experimental results show that control-topic injections can cause forced braking, sustained high-speed acceleration, and continuous turning loops, while perception-topic spoofing can induce phantom stop signs or suppress real detections. The attack generalizes to any data distribution service (DDS)-based robotic system using SROS 2, highlighting the need for both supply chain integrity controls and runtime semantic validation to safeguard autonomous systems against insider and impersonation threats.
comment: Author-accepted version (preprint). Presented at IEEE MILCOM 2025 Workshops, WS07: 2nd Workshop on Security, Resilience, and Robustness of Systems and Software (SRRSS), Los Angeles, Oct 2025. 6 pages. Primary: cs.CR; cross-lists: cs.RO, cs.OS. Program: https://milcom2025.ieee-milcom.org/workshop/ws07-2nd-workshop-security-resilient-and-robustness-systems-and-software/program
☆ DCcluster-Opt: Benchmarking Dynamic Multi-Objective Optimization for Geo-Distributed Data Center Workloads NeurIPS 2025
The increasing energy demands and carbon footprint of large-scale AI require intelligent workload management in globally distributed data centers. Yet progress is limited by the absence of benchmarks that realistically capture the interplay of time-varying environmental factors (grid carbon intensity, electricity prices, weather), detailed data center physics (CPUs, GPUs, memory, HVAC energy), and geo-distributed network dynamics (latency and transmission costs). To bridge this gap, we present DCcluster-Opt: an open-source, high-fidelity simulation benchmark for sustainable, geo-temporal task scheduling. DCcluster-Opt combines curated real-world datasets, including AI workload traces, grid carbon intensity, electricity markets, weather across 20 global regions, cloud transmission costs, and empirical network delay parameters with physics-informed models of data center operations, enabling rigorous and reproducible research in sustainable computing. It presents a challenging scheduling problem where a top-level coordinating agent must dynamically reassign or defer tasks that arrive with resource and service-level agreement requirements across a configurable cluster of data centers to optimize multiple objectives. The environment also models advanced components such as heat recovery. A modular reward system enables an explicit study of trade-offs among carbon emissions, energy costs, service level agreements, and water use. It provides a Gymnasium API with baseline controllers, including reinforcement learning and rule-based strategies, to support reproducible ML research and a fair comparison of diverse algorithms. By offering a realistic, configurable, and accessible testbed, DCcluster-Opt accelerates the development and validation of next-generation sustainable computing solutions for geo-distributed data centers.
comment: Submitted to the NeurIPS 2025 conference
☆ LC-Opt: Benchmarking Reinforcement Learning and Agentic AI for End-to-End Liquid Cooling Optimization in Data Centers NeurIPS 2025
Liquid cooling is critical for thermal management in high-density data centers with the rising AI workloads. However, machine learning-based controllers are essential to unlock greater energy efficiency and reliability, promoting sustainability. We present LC-Opt, a Sustainable Liquid Cooling (LC) benchmark environment, for reinforcement learning (RL) control strategies in energy-efficient liquid cooling of high-performance computing (HPC) systems. Built on the baseline of a high-fidelity digital twin of Oak Ridge National Lab's Frontier Supercomputer cooling system, LC-Opt provides detailed Modelica-based end-to-end models spanning site-level cooling towers to data center cabinets and server blade groups. RL agents optimize critical thermal controls like liquid supply temperature, flow rate, and granular valve actuation at the IT cabinet level, as well as cooling tower (CT) setpoints through a Gymnasium interface, with dynamic changes in workloads. This environment creates a multi-objective real-time optimization challenge balancing local thermal regulation and global energy efficiency, and also supports additional components like a heat recovery unit (HRU). We benchmark centralized and decentralized multi-agent RL approaches, demonstrate policy distillation into decision and regression trees for interpretable control, and explore LLM-based methods that explain control actions in natural language through an agentic mesh architecture designed to foster user trust and simplify system management. LC-Opt democratizes access to detailed, customizable liquid cooling models, enabling the ML community, operators, and vendors to develop sustainable data center liquid cooling control solutions.
comment: Submitted to the NeurIPS 2025 conference
♻ ☆ Competitive Equilibrium for Electricity Markets with Spatially Flexible Loads
Electric vehicle charging and geo-distributed datacenters introduce spatially flexible loads (FLs) that couple power, transportation, and datacenter networks. These couplings create a closed-loop feedback between locational marginal prices (LMPs) and decisions of the FL systems, challenging the foundations of conventional competitive equilibrium (CE) in electricity markets. This paper studies a notion of generalized competitive equilibrium (GCE) that aims to capture such price-demand interactions across the interconnected infrastructures. We establish structural conditions under which the GCE preserves key properties of the conventional CE, including existence, uniqueness, and efficiency, without requiring detailed knowledge of decision processes for individual FL systems. The framework generalizes to settings where the grid is coupled with multiple FL systems. Stylized examples and case studies on the New York ISO grid, coupled with the Sioux Falls transportation and distributed datacenter networks, demonstrate the use of our theoretical framework and illustrate the mutual influence among the grid and the studied FL systems.
♻ ☆ Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
♻ ☆ Data-Driven Stochastic Optimal Control in Reproducing Kernel Hilbert Spaces
This paper proposes a fully data-driven approach for optimal control of nonlinear control-affine systems represented by a stochastic diffusion. The focus is on the scenario where both the nonlinear dynamics and stage cost functions are unknown, while only a control penalty function and constraints are provided. To this end, we embed state probability densities into a reproducing kernel Hilbert space (RKHS) to leverage recent advances in operator regression, thereby identifying Markov transition operators associated with controlled diffusion processes. This operator learning approach integrates naturally with convex operator-theoretic Hamilton-Jacobi-Bellman recursions that scale linearly with state dimensionality, effectively solving a wide range of nonlinear optimal control problems. Numerical results demonstrate its ability to address diverse nonlinear control tasks, including the depth regulation of an autonomous underwater vehicle.
comment: author-submitted electronic preprint version: 19 pages, 5 figures, 3 tables
♻ ☆ Observability for Nonlinear Systems: Connecting Variational Dynamics, Lyapunov Exponents, and Empirical Gramians
Observability quantification is a key problem in dynamic network sciences. While it has been thoroughly studied for linear systems, observability quantification for nonlinear networks is less intuitive and more cumbersome. One common approach to quantify observability for nonlinear systems is via the Empirical Gramian (Empr-Gram) -- a generalized form of the Gramian of linear systems. In this paper, we produce three new results. First, we establish that a variational form of discrete-time autonomous nonlinear systems yields a so-called Variational Gramian (Var-Gram) that is equivalent to the classic Empr-Gram; the former being easier to compute than the latter. Via Lyapunov exponents derived from Lyapunov's direct method, the paper's second result derives connections between existing observability measures and Var-Gram. The third result demonstrates the applicability of these new notions for sensor selection/placement in nonlinear systems. Numerical case studies demonstrate these three developments and their merits.
♻ ☆ Partitioning and Observability in Linear Systems via Submodular Optimization
Network partitioning has gained recent attention as a pathway to enable decentralized operation and control in large-scale systems. This paper addresses the interplay between partitioning, observability, and sensor placement (SP) in dynamic networks. The problem, being computationally intractable at scale, is a largely unexplored, open problem in the literature. To that end, the paper's objective is designing scalable partitioning of linear systems while maximizing observability metrics of the subsystems. We show that the partitioning problem can be posed as a submodular maximization problem -- and the SP problem can subsequently be solved over the partitioned network. Consequently, theoretical bounds are derived to compare observability metrics of the original network with those of the resulting partitions, highlighting the impact of partitioning on system observability. Case studies on networks of varying sizes corroborate the derived theoretical bounds.
♻ ☆ Analysis and Synthesis of Switched Optimization Algorithms
Deployment of optimization algorithms on networked systems face challenges associated with time delays and corruptions. One particular instance is the presence of time-varying delays arising from factors such as packet drops and irregular sampling. Fixed time delays can destabilize gradient descent algorithms, and this degradation is exacerbated by time-varying delays. This work concentrates on the analysis and creation of discrete-time optimization algorithms with certified exponential convergence rates that are robust against switched uncertainties between the optimizer and the gradient oracle. These optimization algorithms are implemented by a switch-scheduled output feedback controllers. Rate variation and sawtooth behavior (packet drops) in time-varying delays can be imposed through constraining switching sequences. Analysis is accomplished by bisection in the convergence rate to find Zames-Falb filter coefficents. Synthesis is performed by alternating between a filter coefficient search for a fixed controller, and a controller search for fixed multipliers.
♻ ☆ Downlink Performance of Cell-Free Massive MIMO for LEO Satellite Mega-Constellation
Low-earth orbit (LEO) satellite communication (SatCom) has emerged as a promising technology to improve wireless connectivity in global areas. Cell-free massive multiple-input multiple-output (CF-mMIMO), an architecture proposed for next-generation networks, has yet to be fully explored for LEO satellites. In this paper, we investigate the downlink performance of a CF-mMIMO LEO SatCom network, where multiple satellite access points (SAPs) simultaneously serve the corresponding ground user terminals (UTs). Using tools from stochastic geometry, we model the locations of SAPs and UTs on surfaces of concentric spheres using Poisson point processes (PPPs) and present expressions on transmit and received signals, signal-to-interference-plus-noise ratio (SINR). Then, we derive the coverage probabilities in fading scenarios, considering significant system parameters such as the Nakagami fading parameter, the number of UTs, the number of SAPs, the orbital altitude, and the service range affected by the dome angle. Finally, the analytical model is verified by extensive Monte Carlo simulations. Simulation results indicate that stronger line-of-sight (LoS) effects and a more comprehensive service range of the UT result in a higher coverage probability, despite the presence of multi-user interference (MUI). Moreover, we found that there exist optimal numbers of UTs that maximize system capacity for different orbital altitudes and dome angles, providing valuable insights for system design.
♻ ☆ Prescribed-Time Convergent Distributed Multiobjective Optimization With Dynamic Event-Triggered Communication
This paper addresses distributed constrained multiobjective resource allocation problems (DCMRAPs) in multi-agent networks, where agents face multiple conflicting local objectives under local and global constraints. By reformulating DCMRAPs as single-objective weighted $L_p$ problems, the proposed approach enables distributed solutions without relying on predefined weighting coefficients or centralized decision-making. Leveraging prescribed-time control and dynamic event-triggered mechanisms (ETMs), a novel distributed algorithm is proposed within a prescribed time through sampled communication. Using generalized time-based generators (TBGs), the algorithm provides more flexibility in optimizing solution accuracy and trajectory smoothness without the constraints of initial conditions. Novel dynamic ETMs, integrated with generalized TBGs, improve communication efficiency by adapting to local error metrics and network-based disagreements, while providing enhanced flexibility in balancing solution accuracy and communication frequency. The Zeno behavior is excluded. Validated by Lyapunov analysis and simulation experiments, our method demonstrates superior control performance and efficiency compared to existing methods, advancing distributed optimization across diverse applications.
comment: This work has been accepted and published in IEEE Transactions on Systems, Man, and Cybernetics: Systems
♻ ☆ Kernel Mean Embedding Topology: Weak and Strong Forms for Stochastic Kernels and Implications for Model Learning
We introduce a novel topology, called Kernel Mean Embedding Topology, for stochastic kernels, in a weak and strong form. This topology, defined on the spaces of Bochner integrable functions from a signal space to a space of probability measures endowed with a Hilbert space structure, allows for a versatile formulation. This construction allows one to obtain both a strong and weak formulation. (i) For its weak formulation, we highlight the utility on relaxed policy spaces, and investigate connections with the Young narrow topology and Borkar (or \( w^* \))-topology, and establish equivalence properties. We report that, while both the \( w^* \)-topology and kernel mean embedding topology are relatively compact, they are not closed. Conversely, while the Young narrow topology is closed, it lacks relative compactness. (ii) We show that the strong form provides an appropriate formulation for placing topologies on spaces of models characterized by stochastic kernels with explicit robustness and learning theoretic implications on optimal stochastic control under discounted or average cost criteria. (iii) We thus show that this topology possesses several properties making it ideal to study optimality and approximations (under the weak formulation) and robustness (under the strong formulation) for many applications.
comment: 37 pages
♻ ☆ Revealing Chaotic Dependence and Degree-Structure Mechanisms in Optimal Pinning Control of Complex Networks
Identifying an optimal set of driver nodes to achieve synchronization via pinning control is a fundamental challenge in complex network science, limited by computational intractability and the lack of general theory. Here, leveraging a degree-based mean-field (annealed) approximation from statistical physics, we analytically reveal how the structural degree distribution systematically governs synchronization performance, and derive an analytic characterization of the globally optimal pinning set and constructive algorithms with linear complexity (dominated by degree sorting, O(N+M). The optimal configuration exhibits a chaotic dependence--a discontinuous sensitivity--on its cardinality, whereby adding a single node can trigger abrupt changes in node composition and control effectiveness. This structural transition fundamentally challenges traditional heuristics that assume monotonic performance gains with budget. Systematic experiments on synthetic and empirical networks confirm that the proposed approach consistently outperforms degree-, betweenness-, and other centrality-based baselines. Furthermore, we quantify how key degree-distribution features--low-degree saturation, high-degree cutoff, and the power-law exponent--govern achievable synchronizability and shape the form of optimal sets. These results offer a systematic understanding of how degree heterogeneity shapes the network controllability. Our work establishes a unified link between degree heterogeneity and spectral controllability, offering both mechanistic insights and practical design rules for optimal driver-node selection in diverse complex systems.
comment: 16 pages, 6 figures; primary: eess.SY; cross-lists: cs.SY, math.OC. Submitted to IEEE TAC
♻ ☆ Transformers as Implicit State Estimators: In-Context Learning in Dynamical Systems
Predicting the behavior of a dynamical system from noisy observations of its past outputs is a classical problem encountered across engineering and science. For linear systems with Gaussian inputs, the Kalman filter -- the best linear minimum mean-square error estimator of the state trajectory -- is optimal in the Bayesian sense. For nonlinear systems, Bayesian filtering is typically approached using suboptimal heuristics such as the Extended Kalman Filter (EKF), or numerical methods such as particle filtering (PF). In this work, we show that transformers, employed in an in-context learning (ICL) setting, can implicitly infer hidden states in order to predict the outputs of a wide family of dynamical systems, without test-time gradient updates or explicit knowledge of the system model. Specifically, when provided with a short context of past input-output pairs and, optionally, system parameters, a frozen transformer accurately predicts the current output. In linear-Gaussian regimes, its predictions closely match those of the Kalman filter; in nonlinear regimes, its performance approaches that of EKF and PF. Moreover, prediction accuracy degrades gracefully when key parameters, such as the state-transition matrix, are withheld from the context, demonstrating robustness and implicit parameter inference. These findings suggest that transformer in-context learning provides a flexible, non-parametric alternative for output prediction in dynamical systems, grounded in implicit latent-state estimation.
♻ ☆ Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control
This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.
comment: Updated manuscript. Added new experiments, figures, and exposition
♻ ☆ Privacy Preservation by Local Design in Cooperative Networked Control Systems
In this paper, we study the privacy preservation problem in a cooperative networked control system, which has closed-loop dynamics, working for the task of linear quadratic Guassian (LQG) control. The system consists of a user and a server: the user owns the plant to control, while the server provides computation capability, and the user employs the server to compute control inputs for it. To enable the server's computation, the user needs to provide the measurements of the plant states to the server, who then calculates estimates of the states, based on which the control inputs are computed. However, the user regards the states as privacy, and makes an interesting request: the user wants the server to have "incorrect" knowledge of the state estimates rather than the true values. Regarding that, we propose a novel design methodology for the privacy preservation, in which the privacy scheme is locally equipped at the user side not open to the server, which manages to create a deviation in the server's knowledge of the state estimates from the true values. However, this methodology also raises significant challenges: in a closed-loop dynamic system, when the server's seized knowledge is incorrect, the system's behavior becomes complex to analyze; even the stability of the system becomes questionable, as the incorrectness will accumulate through the closed loop as time evolves. In this paper, we succeed in showing that the performance loss in LQG control caused by the proposed privacy scheme is bounded by rigorous mathematical proofs, which convinces the availability of the proposed design methodology. We also propose an associated novel privacy metric and obtain the analytical result on evaluating the privacy performance. Finally, we study the performance trade-off between privacy and control, where the accordingly proposed optimization problems are solved by numerical methods efficiently.
comment: 14 pages, 7 figures
♻ ☆ Optimal and Heuristic Approaches for Platooning Systems with Deadlines
Efficient truck platooning is a key strategy for reducing freight costs, lowering fuel consumption, and mitigating emissions. Deadlines are critical in this context, as trucks must depart within specific time windows to meet delivery requirements and avoid penalties. In this paper, we investigate the optimal formation and dispatch of truck platoons at a highway station with finite capacity \(L\) and deadline constraints \(T\). The system operates in discrete time, with each arriving truck assigned a deadline of \(T\) slot units. The objective is to leverage the efficiency gains from forming large platoons while accounting for waiting costs and deadline violations. We formulate the problem as a Markov decision process and analyze the structure of the optimal policy \(\pi^\star\) for \(L = 3\), extending insights to arbitrary \(L\). We prove certain monotonicity properties of the optimal policy in the state space \(\mathcal{S}\) and identify classes of unreachable states. Moreover, since the size of \(\mathcal{S}\) grows exponentially with \(L\) and \(T\), we propose heuristics--including conditional and deep-learning based approaches--that exploit these structural insights while maintaining low computational complexity.
Graphics 3
☆ Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds
The protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method's ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.
☆ Object-Aware 4D Human Motion Generation
Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.
♻ ☆ C3Editor: Achieving Controllable Consistency in 2D Model for 3D Editing ICCV 2025
Existing 2D-lifting-based 3D editing methods often encounter challenges related to inconsistency, stemming from the lack of view-consistent 2D editing models and the difficulty of ensuring consistent editing across multiple views. To address these issues, we propose C3Editor, a controllable and consistent 2D-lifting-based 3D editing framework. Given an original 3D representation and a text-based editing prompt, our method selectively establishes a view-consistent 2D editing model to achieve superior 3D editing results. The process begins with the controlled selection of a ground truth (GT) view and its corresponding edited image as the optimization target, allowing for user-defined manual edits. Next, we fine-tune the 2D editing model within the GT view and across multiple views to align with the GT-edited image while ensuring multi-view consistency. To meet the distinct requirements of GT view fitting and multi-view consistency, we introduce separate LoRA modules for targeted fine-tuning. Our approach delivers more consistent and controllable 2D and 3D editing results than existing 2D-lifting-based methods, outperforming them in both qualitative and quantitative evaluations.
comment: ICCV 2025 Workshop Wild3D
Computation and Language 89
☆ Continuous Autoregressive Language Models
The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.
☆ Culture Cartography: Mapping the Landscape of Cultural Knowledge EMNLP 2025
To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher's goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.
comment: EMNLP 2025
☆ SpecAttn: Speculating Sparse Attention NeurIPS 2025
Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.
comment: Accepted to NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling
☆ Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.
☆ Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
☆ MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval
Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools -- semantic search, keyword search, filtering, and aggregation -- and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.
☆ SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning
Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.
comment: Short Paper - Under Review
☆ Data-Efficient Domain Adaptation for LLM-based MT using Contrastive Preference Optimization
LLMs often require adaptation to domain-specific requirements, a process that can be expensive when relying solely on SFT. We present an empirical study on applying CPO to simulate a post-editing workflow for data-efficient domain adaptation. Our approach synthesizes preference pairs by treating the base model's own raw output as the 'rejected' translation and the human-approved TM entry as the 'chosen' one. This method provides direct feedback on the model's current knowledge, guiding it to align with domain-specific standards. Experiments in English-Brazilian Portuguese and English-Korean show that, by using just 14.7k preference pairs, the model achieves performance close to that of a model trained on 160k+ samples with SFT, demonstrating significant data efficiency. Although we showcase its effectiveness in MT, this application of CPO naturally generalizes to other generative tasks where a model's initial drafts can serve as a contrastive signal against a golden reference.
☆ Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality
In multilingual healthcare applications, the availability of domain-specific natural language processing(NLP) tools is limited, especially for low-resource languages. Although multilingual bidirectional encoder representations from transformers (BERT) offers a promising motivation to mitigate the language gap, the medical NLP tasks in low-resource languages are still underexplored. Therefore, this study investigates how further pre-training on domain-specific corpora affects model performance on medical tasks, focusing on three languages: Dutch, Romanian and Spanish. In terms of further pre-training, we conducted four experiments to create medical domain models. Then, these models were fine-tuned on three downstream tasks: Automated patient screening in Dutch clinical notes, named entity recognition in Romanian and Spanish clinical notes. Results show that domain adaptation significantly enhanced task performance. Furthermore, further differentiation of domains, e.g. clinical and general biomedical domains, resulted in diverse performances. The clinical domain-adapted model outperformed the more general biomedical domain-adapted model. Moreover, we observed evidence of cross-lingual transferability. Moreover, we also conducted further investigations to explore potential reasons contributing to these performance differences. These findings highlight the feasibility of domain adaptation and cross-lingual ability in medical NLP. Within the low-resource language settings, these findings can provide meaningful guidance for developing multilingual medical NLP systems to mitigate the lack of training data and thereby improve the model performance.
☆ DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.
comment: 9 pages, 9 tables
☆ Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design
Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients' biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.
comment: The first two listed authors contributed equally Pages: 21; Figures:2; Tables:3
☆ SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps EMNLP
We introduce SQLSpace, a human-interpretable, generalizable, compact representation for text-to-SQL examples derived with minimal human intervention. We demonstrate the utility of these representations in evaluation with three use cases: (i) closely comparing and contrasting the composition of popular text-to-SQL benchmarks to identify unique dimensions of examples they evaluate, (ii) understanding model performance at a granular level beyond overall accuracy scores, and (iii) improving model performance through targeted query rewriting based on learned correctness estimation. We show that SQLSpace enables analysis that would be difficult with raw examples alone: it reveals compositional differences between benchmarks, exposes performance patterns obscured by accuracy alone, and supports modeling of query success.
comment: Accepted to EMNLP Findings
☆ BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization ICDM
Transformer-based architectures have advanced text summarization, yet their quadratic complexity limits scalability on long documents. This paper introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a novel framework that combines sparse attention, adaptive spans, and bilinear attention to address these limitations. Sparse attention reduces computational costs by focusing on the most relevant parts of the input, while adaptive spans dynamically adjust the attention ranges. Bilinear attention complements both by modeling complex token interactions within this refined context. BiSparse-AAS consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization tasks, achieving average ROUGE improvements of about 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets. By addressing efficiency, scalability, and long-sequence modeling, BiSparse-AAS provides a unified, practical solution for real-world text summarization applications.
comment: Accepted at the IEEE International Conference on Data Mining (ICDM) 2025, Washington, DC, USA
☆ Effect of Domain Generalization Techniques in Low Resource Systems
Machine learning models typically assume that training and test data follow the same distribution, an assumption that often fails in real-world scenarios due to distribution shifts. This issue is especially pronounced in low-resource settings, where data scarcity and limited domain diversity hinder robust generalization. Domain generalization (DG) approaches address this challenge by learning features that remain invariant across domains, often using causal mechanisms to improve model robustness. In this study, we examine two distinct causal DG techniques in low-resource natural language tasks. First, we investigate a causal data augmentation (CDA) approach that automatically generates counterfactual examples to improve robustness to spurious correlations. We apply this method to sentiment classification on the NaijaSenti Twitter corpus, expanding the training data with semantically equivalent paraphrases to simulate controlled distribution shifts. Second, we explore an invariant causal representation learning (ICRL) approach using the DINER framework, originally proposed for debiasing aspect-based sentiment analysis. We adapt DINER to a multilingual setting. Our findings demonstrate that both approaches enhance robustness to unseen domains: counterfactual data augmentation yields consistent cross-domain accuracy gains in sentiment classification, while causal representation learning with DINER improves out-of-distribution performance in multilingual sentiment analysis, albeit with varying gains across languages.
☆ Thought Branches: Interpreting LLM Reasoning Requires Resampling
Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.
comment: Uzay Macar and Paul C. Bogdan contributed equally to this work, and their listed order was determined by coinflip
☆ The aftermath of compounds: Investigating Compounds and their Semantic Representations AACL
This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.
comment: IJCNLP-AACL SRW 2025
☆ Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning
In recent years, large language models (LLMs) have witnessed remarkable advancements, with the test-time scaling law consistently enhancing the reasoning capabilities. Through systematic evaluation and exploration of a diverse spectrum of intermediate thoughts, LLMs demonstrate the potential to generate deliberate reasoning steps, thereby substantially enhancing reasoning accuracy. However, LLMs' autoregressive generation paradigm results in reasoning performance scaling sub-optimally with test-time computation, often requiring excessive computational overhead to propose thoughts while yielding only marginal performance gains. In contrast, diffusion language models (DLMs) can efficiently produce diverse samples through parallel denoising in a single forward pass, inspiring us to leverage them for proposing intermediate thoughts, thereby alleviating the computational burden associated with autoregressive generation while maintaining quality. In this work, we propose an efficient collaborative reasoning framework, leveraging DLMs to generate candidate thoughts and LLMs to evaluate their quality. Experiments across diverse benchmarks demonstrate that our framework achieves strong performance in complex reasoning tasks, offering a promising direction for future research. Our code is open-source at https://anonymous.4open.science/r/Diffuse-Thinking-EC60.
☆ VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision
Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.
comment: Under Review
☆ DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like ``overthinking'' simple problems and ``underthinking'' complex ones. While existing methods that use supervised fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces \textbf{DeepCompress}, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as ``Simple'' or ``Hard'' in real-time based on the model's evolving capability. It encourages shorter, more efficient reasoning for ``Simple'' problems while promoting longer, more exploratory thought chains for ``Hard'' problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.
comment: Work in progress
☆ Dynamic Affective Memory Management for Personalized LLM Agents
Advances in large language models are making personalized AI agents a new research focus. While current agent systems primarily rely on personalized external memory databases to deliver customized experiences, they face challenges such as memory redundancy, memory staleness, and poor memory-context integration, largely due to the lack of effective memory updates during interaction. To tackle these issues, we propose a new memory management system designed for affective scenarios. Our approach employs a Bayesian-inspired memory update algorithm with the concept of memory entropy, enabling the agent to autonomously maintain a dynamically updated memory vector database by minimizing global entropy to provide more personalized services. To better evaluate the system's effectiveness in this context, we propose DABench, a benchmark focusing on emotional expression and emotional change toward objects. Experimental results demonstrate that, our system achieves superior performance in personalization, logical coherence, and accuracy. Ablation studies further validate the effectiveness of the Bayesian-inspired update mechanism in alleviating memory bloat. Our work offers new insights into the design of long-term memory systems.
comment: 12 pasges, 8 figures
☆ Atlas-Alignment: Making Interpretability Transferable Across Language Models
Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.
☆ Awal -- Community-Powered Language Technology for Tamazight
This paper presents Awal, a community-powered initiative for developing language technology resources for Tamazight. We provide a comprehensive review of the NLP landscape for Tamazight, examining recent progress in computational resources, and the emergence of community-driven approaches to address persistent data scarcity. Launched in 2024, awaldigital.org platform addresses the underrepresentation of Tamazight in digital spaces through a collaborative platform enabling speakers to contribute translation and voice data. We analyze 18 months of community engagement, revealing significant barriers to participation including limited confidence in written Tamazight and ongoing standardization challenges. Despite widespread positive reception, actual data contribution remained concentrated among linguists and activists. The modest scale of community contributions -- 6,421 translation pairs and 3 hours of speech data -- highlights the limitations of applying standard crowdsourcing approaches to languages with complex sociolinguistic contexts. We are working on improved open-source MT models using the collected data.
comment: Accepted to the International Conference on Information and Communication Technologies for Amazigh (TICAM 25)
☆ Balancing Knowledge Updates: Toward Unified Modular Editing in LLMs
Knowledge editing has emerged as an efficient approach for updating factual knowledge in large language models (LLMs). It typically locates knowledge storage modules and then modifies their parameters. However, most existing methods focus on the weights of multilayer perceptron (MLP) modules, which are often identified as the main repositories of factual information. Other components, such as attention (Attn) modules, are often ignored during editing. This imbalance can leave residual outdated knowledge and limit editing effectiveness. We perform comprehensive knowledge localization experiments on advanced LLMs and find that Attn modules play a substantial role in factual knowledge storage and retrieval, especially in earlier layers. Based on these insights, we propose IntAttn-Edit, a method that extends the associative memory paradigm to jointly update both MLP and Attn modules. Our approach uses a knowledge balancing strategy that allocates update magnitudes in proportion to each module's measured contribution to knowledge storage. Experiments on standard benchmarks show that IntAttn-Edit achieves higher edit success, better generalization, and stronger knowledge preservation than prior methods. Further analysis shows that the balancing strategy keeps editing performance within an optimal range across diverse settings.
☆ Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
comment: Project page at https://ajmeek.github.io/cot_monitorability_website/
☆ From the Rock Floor to the Cloud: A Systematic Survey of State-of-the-Art NLP in Battery Life Cycle
We present a comprehensive systematic survey of the application of natural language processing (NLP) along the entire battery life cycle, instead of one stage or method, and introduce a novel technical language processing (TLP) framework for the EU's proposed digital battery passport (DBP) and other general battery predictions. We follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) method and employ three reputable databases or search engines, including Google Scholar, Institute of Electrical and Electronics Engineers Xplore (IEEE Xplore), and Scopus. Consequently, we assessed 274 scientific papers before the critical review of the final 66 relevant papers. We publicly provide artifacts of the review for validation and reproducibility. The findings show that new NLP tasks are emerging in the battery domain, which facilitate materials discovery and other stages of the life cycle. Notwithstanding, challenges remain, such as the lack of standard benchmarks. Our proposed TLP framework, which incorporates agentic AI and optimized prompts, will be apt for tackling some of the challenges.
comment: 15 pages, 3 images
☆ ThoughtProbe: Classifier-Guided LLM Thought Space Exploration via Probing Representations EMNLP2025
This paper introduces ThoughtProbe, a novel inference time framework that leverages the hidden reasoning features of Large Language Models (LLMs) to improve their reasoning performance. Unlike previous works that manipulate the hidden representations to steer LLM generation, we harness them as discriminative signals to guide the tree structured response space exploration. In each node expansion, a classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by prioritizing higher score candidates for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We then propose a branch aggregation method that marginalizes over all supporting branches by aggregating their CoT scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework's comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.
comment: EMNLP2025 main conference
☆ TransAlign: Machine Translation Encoders are Strong Word Aligners, Too
In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.
☆ A Unified Representation Underlying the Judgment of Large Language Models
A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence ("what is good") and the model's assent to factual claims ("what is true"). Through direct interventions, we show this unified representation creates a critical dependency: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. This mechanism, which we term the subordination of reasoning, shifts the process of reasoning from impartial inference toward goal-directed justification. Our discovery offers a mechanistic account for systemic bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.
☆ Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity
Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm
☆ Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model's inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis.
☆ MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models
As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.
☆ Higher-order Linear Attention
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.
comment: Project Page: https://github.com/yifanzhang-pro/HLA
☆ Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.
comment: 14 pages, 3 Figures
☆ Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.
☆ Identifying the Periodicity of Information in Natural Language
Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.
☆ DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries SIGMOD 2026
Manually conducting real-world data analyses is labor-intensive and inefficient. Despite numerous attempts to automate data science workflows, none of the existing paradigms or systems fully demonstrate all three key capabilities required to support them effectively: (1) open-domain data collection, (2) structured data transformation, and (3) analytic reasoning. To overcome these limitations, we propose DRAMA, an end-to-end paradigm that answers users' analytic queries in natural language on large-scale open-domain data. DRAMA unifies data collection, transformation, and analysis as a single pipeline. To quantitatively evaluate system performance on tasks representative of DRAMA, we construct a benchmark, DRAMA-Bench, consisting of two categories of tasks: claim verification and question answering, each comprising 100 instances. These tasks are derived from real-world applications that have gained significant public attention and require the retrieval and analysis of open-domain data. We develop DRAMA-Bot, a multi-agent system designed following DRAMA. It comprises a data retriever that collects and transforms data by coordinating the execution of sub-agents, and a data analyzer that performs structured reasoning over the retrieved data. We evaluate DRAMA-Bot on DRAMA-Bench together with five state-of-the-art baseline agents. DRAMA-Bot achieves 86.5% task accuracy at a cost of $0.05, outperforming all baselines with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is publicly available at https://github.com/uiuc-kang-lab/drama.
comment: Accepted to SIGMOD 2026
☆ MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models EMNLP 2025
The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.
comment: EMNLP 2025
☆ Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.
☆ Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+
The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.
☆ Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Can an AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired, multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning process. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that by combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.
☆ Probability Distributions Computed by Hard-Attention Transformers
Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
comment: 18 pages
☆ Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks EMNLP 2025
As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.
comment: Accepted at EMNLP 2025
☆ Characterizing Selective Refusal Bias in Large Language Models
Safety guardrails in large language models(LLMs) are developed to prevent malicious users from generating toxic content at a large scale. However, these measures can inadvertently introduce or reflect new biases, as LLMs may refuse to generate harmful content targeting some demographic groups and not others. We explore this selective refusal bias in LLM guardrails through the lens of refusal rates of targeted individual and intersectional demographic groups, types of LLM responses, and length of generated refusals. Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. This leads us to investigate additional safety implications via an indirect attack, where we target previously refused groups. Our findings emphasize the need for more equitable and robust performance in safety guardrails across demographic groups.
comment: 21 pages, 12 figures, 14 tables
☆ Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models
This paper addresses the limitations of large-scale language models in safety alignment and robustness by proposing a fine-tuning method that combines contrastive distillation with noise-robust training. The method freezes the backbone model and transfers the knowledge boundaries of the teacher model to the student model through distillation, thereby improving semantic consistency and alignment accuracy. At the same time, noise perturbations and robust optimization constraints are introduced during training to ensure that the model maintains stable predictive outputs under noisy and uncertain inputs. The overall framework consists of distillation loss, robustness loss, and a regularization term, forming a unified optimization objective that balances alignment ability with resistance to interference. To systematically validate its effectiveness, the study designs experiments from multiple perspectives, including distillation weight sensitivity, stability analysis under computation budgets and mixed-precision environments, and the impact of data noise and distribution shifts on model performance. Results show that the method significantly outperforms existing baselines in knowledge transfer, robustness, and overall safety, achieving the best performance across several key metrics. This work not only enriches the theoretical system of parameter-efficient fine-tuning but also provides a new solution for building safer and more trustworthy alignment mechanisms.
☆ Towards a Measure of Algorithm Similarity
Given two algorithms for the same problem, can we determine whether they are meaningfully different? In full generality, the question is uncomputable, and empirically it is muddied by competing notions of similarity. Yet, in many applications (such as clone detection or program synthesis) a pragmatic and consistent similarity metric is necessary. We review existing equivalence and similarity notions and introduce EMOC: An Evaluation-Memory-Operations-Complexity framework that embeds algorithm implementations into a feature space suitable for downstream tasks. We compile PACD, a curated dataset of verified Python implementations across three problems, and show that EMOC features support clustering and classification of algorithm types, detection of near-duplicates, and quantification of diversity in LLM-generated programs. Code, data, and utilities for computing EMOC embeddings are released to facilitate reproducibility and future work on algorithm similarity.
comment: 11 pages, many figures and images
♻ ☆ The End of Manual Decoding: Towards Truly End-to-End Language Models
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.
♻ ☆ SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks EMNLP 2025
Direct alignment algorithms have proven an effective step for aligning language models to human-desired behaviors. Current variants of the Direct Preference Optimization objective have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected equally by each word in a sequence but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best balance reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments illustrate the effectiveness of our approach at aligning to preference proxies, including sentiment control, helpfulness and harmlessness, and summary quality. Our method obtains +10% and +3% win rate points in summarization and dialogue scenarios, respectively, without compromising model reasoning or the relevancy and faithfulness of the summary response.
comment: 27 pages, 9 figures, 5 tables. Accepted to EMNLP 2025
♻ ☆ HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose $\textit{HELIOS}$, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. $\textit{First}$, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. $\textit{Second}$, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$ higher throughput and $15.14\times$ larger batch size compared to existing EE-LLM frameworks.
♻ ☆ Minitron-SSM: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning
Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.
♻ ☆ A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models
Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations, especially for smaller models. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show that our framework not only enhances occupation and skill classification tasks, but also provides a cost-effective alternative to frontier models like GPT-4o, significantly reducing computational costs while maintaining strong performance. This makes it a practical and scalable solution for occupation classification and related tasks across LLMs.
comment: Accepted to ICWSM'26
♻ ☆ NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration
Large language models (LLMs) have recently demonstrated the ability to act as function call agents by invoking external tools, enabling them to solve tasks beyond their static knowledge. However, existing agents typically call tools step by step at a time without a global view of task structure. As tools depend on each other, this leads to error accumulation and limited scalability, particularly when scaling to thousands of tools. To address these limitations, we propose NaviAgent, a novel bilevel architecture that decouples task planning from tool execution through graph-based modeling of the tool ecosystem. At the task-planning level, the LLM-based agent decides whether to respond directly, clarify user intent, invoke a toolchain, or execute tool outputs, ensuring broad coverage of interaction scenarios independent of inter-tool complexity. At the execution level, a continuously evolving Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, guiding the agent to generate scalable and robust invocation sequences. By incorporating feedback from real tool interactions, NaviAgent supports closed-loop optimization of planning and execution, moving beyond tool calling toward adaptive navigation of large-scale tool ecosystems. Experiments show that NaviAgent achieves the best task success rates across models and tasks, and integrating TWMN further boosts performance by up to 17 points on complex tasks, underscoring its key role in toolchain orchestration.
♻ ☆ Token Distillation: Attention-aware Input Embeddings For New Tokens
Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules. In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.
comment: Additional experiments + clearer method name compared to the May 2025 version
♻ ☆ Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents
Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation - referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) - within the same input token order of magnitude (1e3). Our best evaluated configurations - one token order above, but within the model's context window - outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.
comment: 20 pages, LaTeX; repository URL updated, typos corrected
♻ ☆ Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning
Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy (pass@1) but often fails to improve capability (pass@k) of LLMs in reasoning tasks, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR struggles to improve capability as it focuses on improving the accuracy of the easier questions to the detriment of the accuracy of the most difficult questions. Second, we show that RLVR does not merely increase the success probability for the easier questions, but in our small model settings, produces quality responses that were absent in its original output distribution. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, from the experiment distilling teacher responses to in-distribution problems, we find that capability does not always improve with distillation. We conjecture that capability improves only when new knowledge is introduced, whereas distilling reasoning patterns only improves accuracy but not capability, sacrificing performance on the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in LLMs
comment: 25 pages
♻ ☆ More of the Same: Persistent Representational Harms Under Increased Representation NeurIPS
To recognize and mitigate the harms of generative AI systems, it is crucial to consider whether and how different societal groups are represented by these systems. A critical gap emerges when naively measuring or improving who is represented, as this does not consider how people are represented. In this work, we develop GAS(P), an evaluation methodology for surfacing distribution-level group representational biases in generated text, tackling the setting where groups are unprompted (i.e., groups are not specified in the input to generative systems). We apply this novel methodology to investigate gendered representations in occupations across state-of-the-art large language models. We show that, even though the gender distribution when models are prompted to generate biographies leads to a large representation of women, even representational biases persist in how different genders are represented. Our evaluation methodology reveals that there are statistically significant distribution-level differences in the word choice used to describe biographies and personas of different genders across occupations, and we show that many of these differences are associated with representational harms and stereotypes. Our empirical findings caution that naively increasing (unprompted) representation may inadvertently proliferate representational biases, and our proposed evaluation methodology enables systematic and rigorous measurement of the problem.
comment: Proceedings of the Neural Information Processing Systems (NeurIPS) 2025; 39 pages, 7 figures, 15 tables
♻ ☆ Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles
♻ ☆ SMOL: Professionally translated parallel data for 115 under-represented languages
We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock machine translation for low-resource languages. SMOL has been translated into 124 (and growing) under-resourced languages (125 language pairs), including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level resource focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust chrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages.
comment: ~10 pages with appendices
♻ ☆ Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges EMNLP 2025
As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.
comment: EMNLP 2025 Findings
♻ ☆ Mano Technical Report
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
♻ ☆ Multilingual Political Views of Large Language Models: Identification and Steering
Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases--frequently skewing toward liberal or progressive positions--key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled. In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at https://github.com/d-gurgurov/Political-Ideologies-LLMs.
comment: pre-print
♻ ☆ Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates
Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.
♻ ☆ LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. Project page: https://fsoft-aic.github.io/fsoft-LibMoE.github.io.
comment: 15 pages, 9 figures
♻ ☆ Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models NeurIPS 2025
We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.
comment: Accepted to NeurIPS 2025. Code link: https://github.com/maple-research-lab/LLaDOU
♻ ☆ FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing NeurIPS 2025
The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
comment: NeurIPS 2025. The code is at https://github.com/ictnlp/FastLongSpeech. This model is at https://huggingface.co/ICTNLP/FastLongSpeech. The dataset is at https://huggingface.co/datasets/ICTNLP/LongSpeech-Eval
♻ ☆ Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.
♻ ☆ Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations
Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks
♻ ☆ Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives EMNLP 2025
Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs' reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs' normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.
comment: Accepted to the 8th BlackboxNLP Workshop at EMNLP 2025
♻ ☆ AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy NeurIPS 2025
Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model's ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.
comment: Accepted at NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ R$^2$ec: Towards Large Recommender Models with Reasoning
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. In this work, we propose R$^2$ec, a unified large recommender model with intrinsic reasoning capability. R$^2$ec introduces a dual-head architecture that supports both reasoning chain generation and efficient item prediction in a single model, significantly reducing inference latency. To overcome the lack of annotated reasoning data, we design RecPO, a reinforcement learning framework that optimizes reasoning and recommendation jointly with a novel fused reward mechanism. Extensive experiments on three datasets demonstrate that R$^2$ec outperforms traditional, LLM-based, and reasoning-augmented recommender baselines, while further analyses validate its competitive efficiency among conventional LLM-based recommender baselines and strong adaptability to diverse recommendation scenarios. Code and checkpoints available at https://github.com/YRYangang/RRec.
comment: Accepted by Neurips 2025
♻ ☆ KAT-Coder Technical Report
Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.
♻ ☆ Training a Generally Curious Agent ICML 2025
Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
comment: ICML 2025. Project Website: https://paprika-llm.github.io
♻ ☆ Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.
♻ ☆ HiRA: A Hierarchical Reasoning Framework for Decoupled Planning and Execution in Deep Search
Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at https://github.com/ignorejjj/HiRA.
comment: 9 pages
♻ ☆ E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework E2Rank, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, E2Rank achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
comment: Code and models are avaliable at https://alibaba-nlp.github.io/E2Rank
♻ ☆ Multilingual State Space Models for Structured Question Answering in Indic Languages NAACL 2025
The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.
comment: NAACL 2025 Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT)
♻ ☆ DiagramEval: Evaluating LLM-Generated Diagrams via Graphs EMNLP 2025
Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.
comment: EMNLP 2025 Main
♻ ☆ FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages NAACL 2025
This paper presents the winning submission of the RaaVa team to the AmericasNLP 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation (MT) into Indigenous Languages of America, where our system ranked first overall based on average Pearson correlation with the human annotations. We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality. In addition to FUSE, we explore five alternative approaches leveraging different combinations of linguistic similarity features and learning paradigms. FUSE Score highlights the effectiveness of combining lexical, phonetic, semantic, and fuzzy token similarity with learning-based modeling to improve MT evaluation for morphologically rich and low-resource languages. MT into Indigenous languages poses unique challenges due to polysynthesis, complex morphology, and non-standardized orthography. Conventional automatic metrics such as BLEU, TER, and ChrF often fail to capture deeper aspects like semantic adequacy and fluency. Our proposed framework, formerly referred to as FUSE, incorporates multilingual sentence embeddings and phonological encodings to better align with human evaluation. We train supervised models on human-annotated development sets and evaluate held-out test data. Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments, offering a robust and linguistically informed solution for MT evaluation in low-resource settings.
comment: NAACL 2025 Workshop on NLP for Indigenous Languages of the Americas
♻ ☆ MindSearch: Mimicking Human Minds Elicits Deep AI Searcher ICLR2025
Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests often cannot be accurately and completely retrieved by the search engine once (2) corresponding information to be integrated is spread over multiple web pages along with massive noise, and (3) a large number of web pages with long contents may quickly exceed the maximum context length of LLMs. Inspired by the cognitive process when humans solve these problems, we introduce MindSearch to mimic the human minds in web information seeking and integration, which can be instantiated by a simple yet effective LLM-based multi-agent framework. The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process: it decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher. Tasked with each sub-question, WebSearcher performs hierarchical information retrieval with search engines and collects valuable information for WebPlanner. The multi-agent design of MindSearch enables the whole framework to seek and integrate information parallelly from larger-scale (e.g., more than 300) web pages in 3 minutes, which is worth 3 hours of human effort. MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth, on both close-set and open-set QA problems. Besides, responses from MindSearch based on InternLM2.5-7B are preferable by humans to ChatGPT-Web and Perplexity.ai applications, which implies that MindSearch can already deliver a competitive solution to the proprietary AI search engine.
comment: ICLR2025. Project Page: https://mindsearch.netlify.app Code: https://github.com/InternLM/MindSearch
♻ ☆ Prompt-MII: Meta-Learning Instruction Induction for LLMs
A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.
♻ ☆ NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium EuroSys'26
AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.
comment: 12 pages, 8 figures, submitted to the Proceedings of the Twenty-First European Conference on Computer Systems (EuroSys'26)
♻ ☆ From Memorization to Reasoning in the Spectrum of Loss Curvature
We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.
♻ ☆ Context Tuning for In-Context Optimization ICML 2025
We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for LLMs, they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model's inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.
comment: A short version of this paper was accepted at ICML 2025 Workshop on Test-Time Adaptation
♻ ☆ PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue
Recent rapid progress in the field of computational pathology has been enabled by foundation models. These models are beginning to move beyond encoding image patches towards whole-slide understanding but their clinical utility remains limited. In this work, we present PRISM2, a multimodal slide-level foundation model trained on data from 700,000 diagnostic specimen-report pairs, the largest vision (2.3 million whole slide images) and language (14M question-answer pairs) histopathology dataset to date. By learning through clinical-dialogue supervision, PRISM2 aligns histomorphologic features with the language of diagnostic reasoning, producing slide-level representations that support both direct diagnostic question-answering and transferable embeddings for downstream tasks. Without additional training, PRISM2 matches or exceeds the cancer-detection performance of clinical-grade products. This is observed without loss of generality on other tasks, where PRISM2 achieves top performance. Finally, using survival prediction as the example, we show that task-specific finetuning with a large dataset can outperform task-specific models, further improving performance. These results demonstrate how language-supervised pretraining provides a scalable, clinically grounded signal for learning generalizable pathology representations, bridging human diagnostic reasoning and foundation-model performance.
♻ ☆ CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists EMNLP 2025
Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.
comment: EMNLP 2025
♻ ☆ A Comprehensive Evaluation of Cognitive Biases in LLMs
We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code to encourage future research on biases in LLMs: https://github.com/simonmalberg/cognitive-biases-in-llms
comment: Published in "Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities"
♻ ☆ Do LLM Evaluators Prefer Themselves for a Reason?
Large language models (LLMs) are increasingly used as automatic evaluators in applications like benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models? Answering this has been difficult as previous studies relied primarily on subjective tasks. These tasks lack an objective ground truth, meaning that either preference can be reasonably justified. To address this ambiguity, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful self-preference (favoring objectively worse responses) from legitimate self-preference (favoring genuinely superior ones). We conduct large-scale experiments under controlled evaluation conditions across diverse model families (e.g., Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key insights: (1) While stronger models exhibit greater self-preference, much of this preference aligns with objectively superior performance, indicating stronger models prefer themselves mostly legitimately. (2) Harmful self-preference persists when evaluator models err as generators, and stronger models display more pronounced harmful self-preference when they do err. This suggests stronger models struggle more to recognize when they are wrong. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce harmful self-preference. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.
comment: Preprint. Under review
♻ ☆ AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.
comment: 10 pages, 6 figures, submitted to Ninth Annual Conference on Machine Learning and Systems (MLSys'26)
♻ ☆ Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses EMNLP
Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.
comment: Accepted by 2025 EMNLP industry track
♻ ☆ BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method against latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
Information Retrieval 21
☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
☆ Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
☆ Interact-RAG: Reason and Interact with the Corpus, Beyond Black-Box Retrieval
Retrieval-Augmented Generation (RAG) has significantly enhanced LLMs by incorporating external information. However, prevailing agentic RAG approaches are constrained by a critical limitation: they treat the retrieval process as a black-box querying operation. This confines agents' actions to query issuing, hindering its ability to tackle complex information-seeking tasks. To address this, we introduce Interact-RAG, a new paradigm that elevates the LLM agent from a passive query issuer into an active manipulator of the retrieval process. We dismantle the black-box with a Corpus Interaction Engine, equipping the agent with a set of action primitives for fine-grained control over information retrieval. To further empower the agent on the entire RAG pipeline, we first develop a reasoning-enhanced workflow, which enables both zero-shot execution and the synthesis of interaction trajectories. We then leverage this synthetic data to train a fully autonomous end-to-end agent via Supervised Fine-Tuning (SFT), followed by refinement with Reinforcement Learning (RL). Extensive experiments across six benchmarks demonstrate that Interact-RAG significantly outperforms other advanced methods, validating the efficacy of our reasoning-interaction strategy.
☆ Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start Recommendation
Recommender systems (RSs) are intelligent filtering methods that suggest items to users based on their inferred preferences, derived from their interaction history on the platform. Collaborative filtering-based RSs rely on users past interactions to generate recommendations. However, when a user is new to the platform, referred to as a cold-start user, there is no historical data available, making it difficult to provide personalized recommendations. To address this, rating elicitation techniques can be used to gather initial ratings or preferences on selected items, helping to build an early understanding of the user's tastes. Rating elicitation approaches are generally categorized into two types: non-personalized and personalized. Decision tree-based rating elicitation is a personalized method that queries users about their preferences at each node of the tree until sufficient information is gathered. In this paper, we propose an extension to the decision tree approach for rating elicitation in the context of music recommendation. Our method: (i) elicits not only item ratings but also preferences on attributes such as genres to better cluster users, and (ii) uses item pairs instead of single items at each node to more effectively learn user preferences. Experimental results demonstrate that both proposed enhancements lead to improved performance, particularly with a reduced number of queries.
☆ Traceable Drug Recommendation over Medical Knowledge Graphs CIKM2025
Drug recommendation (DR) systems aim to support healthcare professionals in selecting appropriate medications based on patients' medical conditions. State-of-the-art approaches utilize deep learning techniques for improving DR, but fall short in providing any insights on the derivation process of recommendations -- a critical limitation in such high-stake applications. We propose TraceDR, a novel DR system operating over a medical knowledge graph (MKG), which ensures access to large-scale and high-quality information. TraceDR simultaneously predicts drug recommendations and related evidence within a multi-task learning framework, enabling traceability of medication recommendations. For covering a more diverse set of diseases and drugs than existing works, we devise a framework for automatically constructing patient health records and release DrugRec, a new large-scale testbed for DR.
comment: Accepted to MediKS@CIKM2025
☆ Research Output of Webology Journal (2013-2017): A Scientometric Analysis
Webology is an international peer-reviewed journal in English devoted to the field of the World Wide Web and serves as a forum for discussion and experimentation. It serves as a forum for new research in information dissemination and communication processes in general, and in the context of the World Wide Web in particular. This paper presents a Scientometric analysis of the Webology Journal. The paper analyses the pattern of growth of the research output published in the journal, pattern of authorship, author productivity, and subjects covered to the papers over the period (2013-2017). It is found that 62 papers were published during the period of study (2013-2017). The maximum numbers of articles were collaborative in nature. The subject concentration of the journal noted was Social Networking/Web 2.0/Library 2.0 and Scientometrics or Bibliometrics. Iranian researchers contributed the maximum number of articles (37.10%). The study applied standard formula and statistical tools to bring out the factual result.
comment: 13 pages, 3 figures, Research Paper
☆ Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.
☆ DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries SIGMOD 2026
Manually conducting real-world data analyses is labor-intensive and inefficient. Despite numerous attempts to automate data science workflows, none of the existing paradigms or systems fully demonstrate all three key capabilities required to support them effectively: (1) open-domain data collection, (2) structured data transformation, and (3) analytic reasoning. To overcome these limitations, we propose DRAMA, an end-to-end paradigm that answers users' analytic queries in natural language on large-scale open-domain data. DRAMA unifies data collection, transformation, and analysis as a single pipeline. To quantitatively evaluate system performance on tasks representative of DRAMA, we construct a benchmark, DRAMA-Bench, consisting of two categories of tasks: claim verification and question answering, each comprising 100 instances. These tasks are derived from real-world applications that have gained significant public attention and require the retrieval and analysis of open-domain data. We develop DRAMA-Bot, a multi-agent system designed following DRAMA. It comprises a data retriever that collects and transforms data by coordinating the execution of sub-agents, and a data analyzer that performs structured reasoning over the retrieved data. We evaluate DRAMA-Bot on DRAMA-Bench together with five state-of-the-art baseline agents. DRAMA-Bot achieves 86.5% task accuracy at a cost of $0.05, outperforming all baselines with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is publicly available at https://github.com/uiuc-kang-lab/drama.
comment: Accepted to SIGMOD 2026
☆ A Survey on Deep Text Hashing: Efficient Semantic Text Retrieval with Binary Representation
With the rapid growth of textual content on the Internet, efficient large-scale semantic text retrieval has garnered increasing attention from both academia and industry. Text hashing, which projects original texts into compact binary hash codes, is a crucial method for this task. By using binary codes, the semantic similarity computation for text pairs is significantly accelerated via fast Hamming distance calculations, and storage costs are greatly reduced. With the advancement of deep learning, deep text hashing has demonstrated significant advantages over traditional, data-independent hashing techniques. By leveraging deep neural networks, these methods can learn compact and semantically rich binary representations directly from data, overcoming the performance limitations of earlier approaches. This survey investigates current deep text hashing methods by categorizing them based on their core components: semantic extraction, hash code quality preservation, and other key technologies. We then present a detailed evaluation schema with results on several popular datasets, followed by a discussion of practical applications and open-source tools for implementation. Finally, we conclude by discussing key challenges and future research directions, including the integration of deep text hashing with large language models to further advance the field. The project for this survey can be accessed at https://github.com/hly1998/DeepTextHashing.
☆ A Survey on Generative Recommendation: Data, Model, and Tasks
Recommender systems serve as foundational infrastructure in modern information ecosystems, helping users navigate digital content and discover items aligned with their preferences. At their core, recommender systems address a fundamental problem: matching users with items. Over the past decades, the field has experienced successive paradigm shifts, from collaborative filtering and matrix factorization in the machine learning era to neural architectures in the deep learning era. Recently, the emergence of generative models, especially large language models (LLMs) and diffusion models, have sparked a new paradigm: generative recommendation, which reconceptualizes recommendation as a generation task rather than discriminative scoring. This survey provides a comprehensive examination through a unified tripartite framework spanning data, model, and task dimensions. Rather than simply categorizing works, we systematically decompose approaches into operational stages-data augmentation and unification, model alignment and training, task formulation and execution. At the data level, generative models enable knowledge-infused augmentation and agent-based simulation while unifying heterogeneous signals. At the model level, we taxonomize LLM-based methods, large recommendation models, and diffusion approaches, analyzing their alignment mechanisms and innovations. At the task level, we illuminate new capabilities including conversational interaction, explainable reasoning, and personalized content generation. We identify five key advantages: world knowledge integration, natural language understanding, reasoning capabilities, scaling laws, and creative generation. We critically examine challenges in benchmark design, model robustness, and deployment efficiency, while charting a roadmap toward intelligent recommendation assistants that fundamentally reshape human-information interaction.
☆ Compass: General Filtered Search across Vector and Structured Data
The increasing prevalence of hybrid vector and relational data necessitates efficient, general support for queries that combine high-dimensional vector search with complex relational filtering. However, existing filtered search solutions are fundamentally limited by specialized indices, which restrict arbitrary filtering and hinder integration with general-purpose DBMSs. This work introduces \textsc{Compass}, a unified framework that enables general filtered search across vector and structured data without relying on new index designs. Compass leverages established index structures -- such as HNSW and IVF for vector attributes, and B+-trees for relational attributes -- implementing a principled cooperative query execution strategy that coordinates candidate generation and predicate evaluation across modalities. Uniquely, Compass maintains generality by allowing arbitrary conjunctions, disjunctions, and range predicates, while ensuring robustness even with highly-selective or multi-attribute filters. Comprehensive empirical evaluations demonstrate that Compass consistently outperforms NaviX, the only existing performant general framework, across diverse hybrid query workloads. It also matches the query throughput of specialized single-attribute indices in their favorite settings with only a single attribute involved, all while maintaining full generality and DBMS compatibility. Overall, Compass offers a practical and robust solution for achieving truly general filtered search in vector database systems.
☆ IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval EMNLP 2025
Identifying/retrieving relevant statutes and prior cases/precedents for a given legal situation are common tasks exercised by law practitioners. Researchers to date have addressed the two tasks independently, thus developing completely different datasets and models for each task; however, both retrieval tasks are inherently related, e.g., similar cases tend to cite similar statutes (due to similar factual situation). In this paper, we address this gap. We propose IL-PCR (Indian Legal corpus for Prior Case and Statute Retrieval), which is a unique corpus that provides a common testbed for developing models for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit the dependence between the two. We experiment extensively with several baseline models on the tasks, including lexical models, semantic models and ensemble based on GNNs. Further, to exploit the dependence between the two tasks, we develop an LLM-based re-ranking approach that gives the best performance.
comment: Accepted at EMNLP 2025 (Main)
☆ Effectiveness of LLMs in Temporal User Profiling for Recommendation ICDM 2025
Effectively modeling the dynamic nature of user preferences is crucial for enhancing recommendation accuracy and fostering transparency in recommender systems. Traditional user profiling often overlooks the distinction between transitory short-term interests and stable long-term preferences. This paper examines the capability of leveraging Large Language Models (LLMs) to capture these temporal dynamics, generating richer user representations through distinct short-term and long-term textual summaries of interaction histories. Our observations suggest that while LLMs tend to improve recommendation quality in domains with more active user engagement, their benefits appear less pronounced in sparser environments. This disparity likely stems from the varying distinguishability of short-term and long-term preferences across domains; the approach shows greater utility where these temporal interests are more clearly separable (e.g., Movies\&TV) compared to domains with more stable user profiles (e.g., Video Games). This highlights a critical trade-off between enhanced performance and computational costs, suggesting context-dependent LLM application. Beyond predictive capability, this LLM-driven approach inherently provides an intrinsic potential for interpretability through its natural language profiles and attention weights. This work contributes insights into the practical capability and inherent interpretability of LLM-driven temporal user profiling, outlining new research directions for developing adaptive and transparent recommender systems.
comment: Accepted to the IEEE International Conference on Data Mining (ICDM 2025), Workshop on User Modeling and Recommendation (UMRec). To appear in the IEEE ICDMW 2025 proceedings
♻ ☆ UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration
With the rise of generative paradigms, generative recommendation has garnered increasing attention. The core component is the item code, generally derived by quantizing collaborative or semantic representations to serve as candidate items identifiers in the context. However, existing methods typically construct separate codes for each modality, leading to higher computational and storage costs and hindering the integration of their complementary strengths. Considering this limitation, we seek to integrate two different modalities into a unified code, fully unleashing the potential of complementary nature among modalities. Nevertheless, the integration remains challenging: the integrated embedding obtained by the common concatenation method would lead to underutilization of collaborative knowledge, thereby resulting in limited effectiveness. To address this, we propose a novel method, named UNGER, which integrates semantic and collaborative knowledge into a unified code for generative recommendation. Specifically, we propose to adaptively learn an integrated embedding through the joint optimization of cross-modality knowledge alignment and next item prediction tasks. Subsequently, to mitigate the information loss caused by the quantization process, we introduce an intra-modality knowledge distillation task, using the integrated embeddings as supervised signals to compensate. Extensive experiments on three widely used benchmarks demonstrate the superiority of our approach compared to existing methods.
comment: Accepted by TOIS 2025
♻ ☆ LLM Based Long Code Translation using Identifier Replacement
In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
♻ ☆ R$^2$ec: Towards Large Recommender Models with Reasoning
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. In this work, we propose R$^2$ec, a unified large recommender model with intrinsic reasoning capability. R$^2$ec introduces a dual-head architecture that supports both reasoning chain generation and efficient item prediction in a single model, significantly reducing inference latency. To overcome the lack of annotated reasoning data, we design RecPO, a reinforcement learning framework that optimizes reasoning and recommendation jointly with a novel fused reward mechanism. Extensive experiments on three datasets demonstrate that R$^2$ec outperforms traditional, LLM-based, and reasoning-augmented recommender baselines, while further analyses validate its competitive efficiency among conventional LLM-based recommender baselines and strong adaptability to diverse recommendation scenarios. Code and checkpoints available at https://github.com/YRYangang/RRec.
comment: Accepted by Neurips 2025
♻ ☆ HiRA: A Hierarchical Reasoning Framework for Decoupled Planning and Execution in Deep Search
Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at https://github.com/ignorejjj/HiRA.
comment: 9 pages
♻ ☆ CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning SIGIR
Multimodal Retrieval Augmented Generation (MRAG) systems have shown promise in enhancing the generation capabilities of multimodal large language models (MLLMs). However, existing MRAG frameworks primarily adhere to rigid, single-step retrieval strategies that fail to address real-world challenges of information acquisition and query reformulation. In this work, we introduce the task of Multimodal Retrieval Augmented Generation Planning (MRAG Planning) that aims at effective information seeking and integration while minimizing computational overhead. Specifically, we propose CogPlanner, an agentic plug-and-play framework inspired by human cognitive processes, which iteratively determines query reformulation and retrieval strategies to generate accurate and contextually relevant responses. CogPlanner supports parallel and sequential modeling paradigms. Furthermore, we introduce CogBench, a new benchmark designed to rigorously evaluate the MRAG Planning task and facilitate lightweight CogPlanner integration with resource-efficient MLLMs, such as Qwen2-VL-7B-Cog. Experimental results demonstrate that CogPlanner significantly outperforms existing MRAG baselines, offering improvements in both accuracy and efficiency with minimal additional computational costs.
comment: Accepted by SIGIR-AP 2025
♻ ☆ E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework E2Rank, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, E2Rank achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
comment: Code and models are avaliable at https://alibaba-nlp.github.io/E2Rank
♻ ☆ Improving Product Search Relevance with EAR-MP: A Solution for the CIKM 2025 AnalytiCup
Multilingual e-commerce search is challenging due to linguistic diversity and the noise inherent in user-generated queries. This paper documents the solution employed by our team (EAR-MP) for the CIKM 2025 AnalytiCup, which addresses two core tasks: Query-Category (QC) relevance and Query-Item (QI) relevance. Our approach first normalizes the multilingual dataset by translating all text into English, then mitigates noise through extensive data cleaning and normalization. For model training, we build on DeBERTa-v3-large and improve performance with label smoothing, self-distillation, and dropout. In addition, we introduce task-specific upgrades, including hierarchical token injection for QC and a hybrid scoring mechanism for QI. Under constrained compute, our method achieves competitive results, attaining an F1 score of 0.8796 on QC and 0.8744 on QI. These findings underscore the importance of systematic data preprocessing and tailored training strategies for building robust, resource-efficient multilingual relevance systems.
♻ ☆ Multimodal Item Scoring for Natural Language Recommendation via Gaussian Process Regression with LLM Relevance Judgments
Natural Language Recommendation (NLRec) generates item suggestions based on the relevance between user-issued NL requests and NL item description passages. Existing NLRec approaches often use Dense Retrieval (DR) to compute item relevance scores from aggregation of inner products between user request embeddings and relevant passage embeddings. However, DR views the request as the sole relevance label, thus leading to a unimodal scoring function centered on the query embedding that is often a weak proxy for query relevance. To better capture the potential multimodal distribution of the relevance scoring function that may arise from complex NLRec data, we propose GPR-LLM that uses Gaussian Process Regression (GPR) with LLM relevance judgments for a subset of candidate passages. Experiments on four NLRec datasets and two LLM backbones demonstrate that GPR-LLM with an RBF kernel, capable of modeling multimodal relevance scoring functions, consistently outperforms simpler unimodal kernels (dot product, cosine similarity), as well as baseline methods including DR, cross-encoder, and pointwise LLM-based relevance scoring by up to 65%. Overall, GPR-LLM provides an efficient and effective approach to NLRec within a minimal LLM labeling budget.
comment: 16 pages,20 figures
Multimedia 5
☆ Referee: Reference-aware Audiovisual Deepfake Detection
Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.
comment: In Progress
☆ HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition
Three-dimensional scene generation holds significant potential in gaming, film, and virtual reality. However, most existing methods adopt a single-step generation process, making it difficult to balance scene complexity with minimal user input. Inspired by the human cognitive process in scene modeling, which progresses from global to local, focuses on key elements, and completes the scene through semantic association, we propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition. HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest while the model completes peripheral areas automatically. To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph (PHiSSG), which dynamically organizes spatial relationships and semantic dependencies across the evolving scene structure. PHiSSG ensures spatial and geometric consistency throughout the generation process by maintaining a one-to-one mapping between graph nodes and generated objects and supporting recursive layout optimization. Experiments demonstrate that HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference, offering a controllable and extensible paradigm for efficient 3D scene construction.
☆ LongCat-Flash-Omni Technical Report
We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.
♻ ☆ Mano Technical Report
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
♻ ☆ Symmetric Entropy-Constrained Video Coding for Machines
As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data, thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non-semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy-Constrained Video Coding framework for Machines (SEC-VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB's representation capabilities to preserve semantics and discard MVS-irrelevant information. Specifically, a bi-directional entropy-constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial to MVS while squeezing useless information. Furthermore, a semantic-pixel dual-path fusion (SPDF) module injects pixel-level priors into the final reconstruction. Through semantic-pixel fusion, it suppresses artifacts harmful to MVS and improves machine-oriented reconstruction quality. Experimental results show our framework achieves state-of-the-art~(SOTA) in rate-task performance, with significant bitrate savings over VTM on video instance segmentation (37.4%), video object segmentation (29.8%), object detection (46.2%), and multiple object tracking (44.9%). We will release our code soon.
comment: This paper is submitted to the IEEE Transactions
Image and Video Processing 17
☆ Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models
Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels.Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
☆ Bayesian model selection and misspecification testing in imaging inverse problems only from noisy and partial measurements
Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian cross-validation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.
☆ Navigated hepatic tumor resection using intraoperative ultrasound imaging
Purpose: This proof-of-concept study evaluates feasibility and accuracy of an ultrasound-based navigation system for open liver surgery. Unlike most conventional systems that rely on registration to preoperative imaging, the proposed system provides navigation-guided resection using 3D models generated from intraoperative ultrasound. Methods: A pilot study was conducted in 25 patients undergoing resection of liver metastases. The first five cases served to optimize the workflow. Intraoperatively, an electromagnetic sensor compensated for organ motion, after which an ultrasound volume was acquired. Vasculature was segmented automatically and tumors semi-automatically using region-growing (n=15) or a deep learning algorithm (n=5). The resulting 3D model was visualized alongside tracked surgical instruments. Accuracy was assessed by comparing the distance between surgical clips and tumors in the navigation software with the same distance on a postoperative CT of the resected specimen. Results: Navigation was successfully established in all 20 patients. However, four cases were excluded from accuracy assessment due to intraoperative sensor detachment (n=3) or incorrect data recording (n=1). The complete navigation workflow was operational within 5-10 minutes. In 16 evaluable patients, 78 clip-to-tumor distances were analyzed. The median navigation accuracy was 3.2 mm [IQR: 2.8-4.8 mm], and an R0 resection was achieved in 15/16 (93.8%) patients and one patient had an R1 vascular resection. Conclusion: Navigation based solely on intra-operative ultrasound is feasible and accurate for liver surgery. This registration-free approach paves the way for simpler and more accurate image guidance systems.
☆ Combined fluorescence and photoacoustic imaging of tozuleristide in muscle tissue in vitro -- toward optically-guided solid tumor surgery: feasibility studies
Near-infrared fluorescence (NIRF) can deliver high-contrast, video-rate, non-contact imaging of tumor-targeted contrast agents with the potential to guide surgeries excising solid tumors. However, it has been met with skepticism for wide-margin excision due to sensitivity and resolution limitations at depths larger than ~5 mm in tissue. To address this limitation, fast-sweep photoacoustic-ultrasound (PAUS) imaging is proposed to complement NIRF. In an exploratory in vitro feasibility study using dark-red bovine muscle tissue, we observed that PAUS scanning can identify tozuleristide, a clinical stage investigational imaging agent, at a concentration of 20 uM from the background at depths of up to ~34 mm, highly extending the capabilities of NIRF alone. The capability of spectroscopic PAUS imaging was tested by direct injection of 20 uM tozuleristide into bovine muscle tissue at a depth of ~ 8 mm. It is shown that laser-fluence compensation and strong clutter suppression enabled by the unique capabilities of the fast-sweep approach greatly improve spectroscopic accuracy and the PA detection limit, and strongly reduce image artifacts. Thus, the combined NIRF-PAUS approach can be promising for comprehensive pre- (with PA) and intra- (with NIRF) operative solid tumor detection and wide-margin excision in optically guided solid tumor surgery.
comment: 24 pages, 10 figures
☆ Towards robust quantitative photoacoustic tomography via learned iterative methods
Photoacoustic tomography (PAT) is a medical imaging modality that can provide high-resolution tissue images based on the optical absorption. Classical reconstruction methods for quantifying the absorption coefficients rely on sufficient prior information to overcome noisy and imperfect measurements. As these methods utilize computationally expensive forward models, the computation becomes slow, limiting their potential for time-critical applications. As an alternative approach, deep learning-based reconstruction methods have been established for faster and more accurate reconstructions. However, most of these methods rely on having a large amount of training data, which is not the case in practice. In this work, we adopt the model-based learned iterative approach for the use in Quantitative PAT (QPAT), in which additional information from the model is iteratively provided to the updating networks, allowing better generalizability with scarce training data. We compare the performance of different learned updates based on gradient descent, Gauss-Newton, and Quasi-Newton methods. The learning tasks are formulated as greedy, requiring iterate-wise optimality, as well as end-to-end, where all networks are trained jointly. The implemented methods are tested with ideal simulated data as well as against a digital twin dataset that emulates scarce training data and high modeling error.
☆ A fragile zero-watermarking method based on dual quaternion matrix decomposition
Medical images play a crucial role in assisting diagnosis, remote consultation, and academic research. However, during the transmission and sharing process, they face serious risks of copyright ownership and content tampering. Therefore, protecting medical images is of great importance. As an effective means of image copyright protection, zero-watermarking technology focuses on constructing watermarks without modifying the original carrier by extracting its stable features, which provides an ideal approach for protecting medical images. This paper aims to propose a fragile zero-watermarking model based on dual quaternion matrix decomposition, which utilizes the operational relationship between the standard part and the dual part of dual quaternions to correlate the original carrier image with the watermark image, and generates zero-watermarking information based on the characteristics of dual quaternion matrix decomposition, ultimately achieving copyright protection and content tampering detection for medical images.
comment: 18 pages, 6 figures, 3 tables
☆ Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models
Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels. Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
☆ An Efficient and Generalizable Transfer Learning Method for Weather Condition Detection on Ground Terminals
The increasing adoption of satellite Internet with low-Earth-orbit (LEO) satellites in mega-constellations allows ubiquitous connectivity to rural and remote areas. However, weather events have a significant impact on the performance and reliability of satellite Internet. Adverse weather events such as snow and rain can disturb the performance and operations of satellite Internet's essential ground terminal components, such as satellite antennas, significantly disrupting the space-ground link conditions between LEO satellites and ground stations. This challenge calls for not only region-based weather forecasts but also fine-grained detection capability on ground terminal components of fine-grained weather conditions. Such a capability can assist in fault diagnostics and mitigation for reliable satellite Internet, but its solutions are lacking, not to mention the effectiveness and generalization that are essential in real-world deployments. This paper discusses an efficient transfer learning (TL) method that can enable a ground component to locally detect representative weather-related conditions. The proposed method can detect snow, wet, and other conditions resulting from adverse and typical weather events and shows superior performance compared to the typical deep learning methods, such as YOLOv7, YOLOv9, Faster R-CNN, and R-YOLO. Our TL method also shows the advantage of being generalizable to various scenarios.
♻ ☆ Intelligent Software System for Low-Cost, Brightfield Segmentation: Algorithmic Implementation for Cytometric Auto-Analysis
Bright-field microscopy, a cost-effective solution for live-cell culture, is often the only resource available, along with standard CPUs, for many low-budget labs. The inherent challenges of bright-field images -- their noisiness, low contrast, and dynamic morphology -- coupled with a lack of GPU resources and complex software interfaces, hinder the desired research output. This article presents a novel microscopy image analysis framework designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular architecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical applications -- particularly in cell transplantation for personalised medicine and muscle regeneration therapies. The access to the application is available for reproducibility
♻ ☆ LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation
While large models have achieved significant progress in computer vision, challenges such as optimization complexity, the intricacy of transformer architectures, computational constraints, and practical application demands highlight the importance of simpler model designs in medical image segmentation. This need is particularly pronounced in mobile medical devices, which require lightweight, deployable models with real-time performance. However, existing lightweight models often suffer from poor robustness across datasets, limiting their widespread adoption. To address these challenges, this paper introduces LV-UNet, a lightweight and vanilla model that leverages pre-trained MobileNetv3-Large backbones and incorporates fusible modules. LV-UNet employs an enhanced deep training strategy and switches to a deployment mode during inference by re-parametrization, significantly reducing parameter count and computational overhead. Experimental results on ISIC 2016, BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvair-SEG datasets demonstrate a better trade-off between performance and the computational load. The code will be released at https://github.com/juntaoJianggavin/LV-UNet.
comment: Accepted by IEEE BIBM2024 ML4BMI workshop
♻ ☆ Augmented Reality-based Guidance with Deformable Registration in Head and Neck Tumor Resection MICCAI 2025
Head and neck squamous cell carcinoma (HNSCC) has one of the highest rates of recurrence cases among solid malignancies. Recurrence rates can be reduced by improving positive margins localization. Frozen section analysis (FSA) of resected specimens is the gold standard for intraoperative margin assessment. However, because of the complex 3D anatomy and the significant shrinkage of resected specimens, accurate margin relocation from specimen back onto the resection site based on FSA results remains challenging. We propose a novel deformable registration framework that uses both the pre-resection upper surface and the post-resection site of the specimen to incorporate thickness information into the registration process. The proposed method significantly improves target registration error (TRE), demonstrating enhanced adaptability to thicker specimens. In tongue specimens, the proposed framework improved TRE by up to 33% as compared to prior deformable registration. Notably, tongue specimens exhibit complex 3D anatomies and hold the highest clinical significance compared to other head and neck specimens from the buccal and skin. We analyzed distinct deformation behaviors in different specimens, highlighting the need for tailored deformation strategies. To further aid intraoperative visualization, we also integrated this framework with an augmented reality-based auto-alignment system. The combined system can accurately and automatically overlay the deformed 3D specimen mesh with positive margin annotation onto the resection site. With a pilot study of the AR guided framework involving two surgeons, the integrated system improved the surgeons' average target relocation error from 9.8 cm to 4.8 cm.
comment: Accepted at MICCAI 2025
♻ ☆ Highly Undersampled MRI Reconstruction via a Single Posterior Sampling of Diffusion Models
Incoherent k-space undersampling and deep learning-based reconstruction methods have shown great success in accelerating MRI. However, the performance of most previous methods will degrade dramatically under high acceleration factors, e.g., 8$\times$ or higher. Recently, denoising diffusion models (DM) have demonstrated promising results in solving this issue; however, one major drawback of the DM methods is the long inference time due to a dramatic number of iterative reverse posterior sampling steps. In this work, a Single Step Diffusion Model-based reconstruction framework, namely SSDM-MRI, is proposed for restoring MRI images from highly undersampled k-space. The proposed method achieves one-step reconstruction by first training a conditional DM and then iteratively distilling this model four times using an iterative selective distillation algorithm, which works synergistically with a shortcut reverse sampling strategy for model inference. Comprehensive experiments were carried out on both publicly available fastMRI brain and knee images, as well as an in-house multi-echo GRE (QSM) subject. Overall, the results showed that SSDM-MRI outperformed other methods in terms of numerical metrics (e.g., PSNR and SSIM), error maps, image fine details, and latent susceptibility information hidden in MRI phase images. In addition, the reconstruction time for a 320$\times$320 brain slice of SSDM-MRI is only 0.45 second, which is only comparable to that of a simple U-net, making it a highly effective solution for MRI reconstruction tasks.
♻ ☆ Spatially Regularized Super-Resolved Constrained Spherical Deconvolution (SR$^2$-CSD) of Diffusion MRI Data
Constrained Spherical Deconvolution (CSD) is widely used to estimate the white matter fiber orientation distribution (FOD) from diffusion MRI data. Its angular resolution depends on the maximum spherical harmonic order ($l_{max}$): low $l_{max}$ yields smooth but poorly resolved FODs, while high $l_{max}$, as in Super-CSD, enables resolving fiber crossings with small inter-fiber angles but increases sensitivity to noise. In this proof-of-concept study, we introduce Spatially Regularized Super-Resolved CSD (SR$^2$-CSD), a novel method that regularizes Super-CSD using a spatial FOD prior estimated via a self-calibrated total variation denoiser. We evaluated SR$^2$-CSD against CSD and Super-CSD across four datasets: (i) the HARDI-2013 challenge numerical phantom, assessing angular and peak number errors across multiple signal-to-noise ratio (SNR) levels and CSD variants (single-/multi-shell, single-/multi-tissue); (ii) the Sherbrooke in vivo dataset, evaluating spatial coherence of FODs; (iii) a six-subject test-retest dataset acquired with both full (96 gradient directions) and subsampled (45 directions) protocols, assessing reproducibility; and (iv) the DiSCo phantom, evaluating tractography accuracy under varying SNR levels and multiple noise repetitions. Across all evaluations, SR$^2$-CSD consistently reduced angular and peak number errors, improved spatial coherence, enhanced test-retest reproducibility, and yielded connectivity matrices more strongly correlated with ground-truth. Most improvements were statistically significant under multiple-comparison correction. These results demonstrate that incorporating spatial priors into CSD is feasible, mitigates estimation instability, and improves FOD reconstruction accuracy.
comment: 21 pages, 9 figures; Supplementary Material appended after the References
♻ ☆ Poisson Informed Retinex Network for Extreme Low-Light Image Enhancement
Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.
comment: 10 pages, 5 figures and 1 table
♻ ☆ Panoramic Out-of-Distribution Segmentation for Autonomous Driving
Panoramic imaging enables capturing 360{\deg} images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception, which is critical to applications, such as autonomous driving and augmented reality, etc. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene understanding. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities and advances the development of panoramic understanding. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.
comment: Code and datasets will be available at https://github.com/MengfeiD/PanOoS
♻ ☆ Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
comment: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis
♻ ☆ Symmetric Entropy-Constrained Video Coding for Machines
As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data, thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non-semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy-Constrained Video Coding framework for Machines (SEC-VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB's representation capabilities to preserve semantics and discard MVS-irrelevant information. Specifically, a bi-directional entropy-constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial to MVS while squeezing useless information. Furthermore, a semantic-pixel dual-path fusion (SPDF) module injects pixel-level priors into the final reconstruction. Through semantic-pixel fusion, it suppresses artifacts harmful to MVS and improves machine-oriented reconstruction quality. Experimental results show our framework achieves state-of-the-art~(SOTA) in rate-task performance, with significant bitrate savings over VTM on video instance segmentation (37.4%), video object segmentation (29.8%), object detection (46.2%), and multiple object tracking (44.9%). We will release our code soon.
comment: This paper is submitted to the IEEE Transactions
Computer Vision and Pattern Recognition 150
☆ Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io
comment: Project Page: https://video-cof.github.io
☆ OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.
comment: Project page: https://yukun-huang.github.io/OmniX/
☆ Masked Diffusion Captioning for Visual Feature Learning EMNLP 2025
We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
comment: EMNLP 2025 (Findings). Project page: https://cfeng16.github.io/mdlm4vfl/
☆ SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.
comment: 26 pages; 21 figures; 3 tables; project page: https://see-4d.github.io/
☆ Scaling Image Geo-Localization to Continent Level NeurIPS 2025
Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.
comment: NeurIPS 2025
☆ The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.
☆ HEIR: Learning Graph-Based Motion Hierarchies
Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/
comment: Code link: https://github.com/princeton-computational-imaging/HEIR
☆ Clone Deterministic 3D Worlds with Geometrically-Regularized World Models
A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.
☆ ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.
☆ Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.
☆ SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.
☆ MORE: Multi-Organ Medical Image REconstruction Dataset
CT reconstruction provides radiologists with images for diagnosis and treatment, yet current deep learning methods are typically limited to specific anatomies and datasets, hindering generalization ability to unseen anatomies and lesions. To address this, we introduce the Multi-Organ medical image REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies with 15 lesion types. This dataset serves two key purposes: (1) enabling robust training of deep learning models on extensive, heterogeneous data, and (2) facilitating rigorous evaluation of model generalization for CT reconstruction. We further establish a strong baseline solution that outperforms prior approaches under these challenging conditions. Our results demonstrate that: (1) a comprehensive dataset helps improve the generalization capability of models, and (2) optimization-based methods offer enhanced robustness for unseen anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our project page https://more-med.github.io/
comment: Accepted to ACMMM 2025
☆ ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection
Purpose: Medical foundation models (FMs) offer a path to build high-performance diagnostic systems. However, their application to prostate cancer (PCa) detection from micro-ultrasound ({\mu}US) remains untested in clinical settings. We present ProstNFound+, an adaptation of FMs for PCa detection from {\mu}US, along with its first prospective validation. Methods: ProstNFound+ incorporates a medical FM, adapter tuning, and a custom prompt encoder that embeds PCa-specific clinical biomarkers. The model generates a cancer heatmap and a risk score for clinically significant PCa. Following training on multi-center retrospective data, the model is prospectively evaluated on data acquired five years later from a new clinical site. Model predictions are benchmarked against standard clinical scoring protocols (PRI-MUS and PI-RADS). Results: ProstNFound+ shows strong generalization to the prospective data, with no performance degradation compared to retrospective evaluation. It aligns closely with clinical scores and produces interpretable heatmaps consistent with biopsy-confirmed lesions. Conclusion: The results highlight its potential for clinical deployment, offering a scalable and interpretable alternative to expert-driven protocols.
☆ The Impact and Outlook of 3D Gaussian Splatting
Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.
comment: Article written for Frontiers of Science Award, International Congress on Basic Science, 2025
☆ Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill
We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.
☆ Improving Classification of Occluded Objects through Scene Context
The presence of occlusions has provided substantial challenges to typically-powerful object recognition algorithms. Additional sources of information can be extremely valuable to reduce errors caused by occlusions. Scene context is known to aid in object recognition in biological vision. In this work, we attempt to add robustness into existing Region Proposal Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks through two distinct scene-based information fusion techniques. We present one algorithm under each methodology: the first operates prior to prediction, selecting a custom object network to use based on the identified background scene, and the second operates after detection, fusing scene knowledge into initial object scores output by the RPN. We demonstrate our algorithms on challenging datasets featuring partial occlusions, which show overall improvement in both recall and precision against baseline methods. In addition, our experiments contrast multiple training methodologies for occlusion handling, finding that training on a combination of both occluded and unoccluded images demonstrates an improvement over the others. Our method is interpretable and can easily be adapted to other datasets, offering many future directions for research and practical applications.
☆ BRIQA: Balanced Reweighting in Image Quality Assessment of Pediatric Brain MRI
Assessing the severity of artifacts in pediatric brain Magnetic Resonance Imaging (MRI) is critical for diagnostic accuracy, especially in low-field systems where the signal-to-noise ratio is reduced. Manual quality assessment is time-consuming and subjective, motivating the need for robust automated solutions. In this work, we propose BRIQA (Balanced Reweighting in Image Quality Assessment), which addresses class imbalance in artifact severity levels. BRIQA uses gradient-based loss reweighting to dynamically adjust per-class contributions and employs a rotating batching scheme to ensure consistent exposure to underrepresented classes. Through experiments, no single architecture performs best across all artifact types, emphasizing the importance of architectural diversity. The rotating batching configuration improves performance across metrics by promoting balanced learning when combined with cross-entropy loss. BRIQA improves average macro F1 score from 0.659 to 0.706, with notable gains in Noise (0.430), Zipper (0.098), Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012) artifact severity classification. The code is available at https://github.com/BioMedIA-MBZUAI/BRIQA.
☆ Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2
Accurate estimation of sea ice drift is critical for Arctic navigation, climate research, and operational forecasting. While optical flow, a computer vision technique for estimating pixel wise motion between consecutive images, has advanced rapidly in computer vision, its applicability to geophysical problems and to satellite SAR imagery remains underexplored. Classical optical flow methods rely on mathematical models and strong assumptions about motion, which limit their accuracy in complex scenarios. Recent deep learning based approaches have substantially improved performance and are now the standard in computer vision, motivating their application to sea ice drift estimation. We present the first large scale benchmark of 48 deep learning optical flow models on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the spatial scales of sea ice motion and typical navigation requirements in the Arctic. Our results demonstrate that the models are capable of capturing consistent regional drift patterns and that recent deep learning based optical flow methods, which have substantially improved motion estimation accuracy compared to classical methods, can be effectively transferred to polar remote sensing. Optical flow produces spatially continuous drift fields, providing motion estimates for every image pixel rather than at sparse buoy locations, offering new opportunities for navigation and climate modeling.
☆ All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
☆ SAMRI: Segment Anything Model for MRI
Accurate magnetic resonance imaging (MRI) segmentation is crucial for clinical decision-making, but remains labor-intensive when performed manually. Convolutional neural network (CNN)-based methods can be accurate and efficient, but often generalize poorly to MRI's variable contrast, intensity inhomogeneity, and protocols. Although the transformer-based Segment Anything Model (SAM) has demonstrated remarkable generalizability in natural images, existing adaptations often treat MRI as another imaging modality, overlooking these modality-specific challenges. We present SAMRI, an MRI-specialized SAM trained and validated on 1.1 million labeled MR slices spanning whole-body organs and pathologies. We demonstrate that SAM can be effectively adapted to MRI by simply fine-tuning its mask decoder using a two-stage strategy, reducing training time by 94% and trainable parameters by 96% versus full-model retraining. Across diverse MRI segmentation tasks, SAMRI achieves a mean Dice of 0.87, delivering state-of-the-art accuracy across anatomical regions and robust generalization on unseen structures, particularly small and clinically important structures.
☆ PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus
To address the challenges in UAV object detection, such as complex backgrounds, severe occlusion, dense small objects, and varying lighting conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection algorithm specifically designed for small objects in UAV imagery. In the backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module to enhance feature extraction for small objects. Additionally,we design the Median-Frequency Feature Fusion (MFFF) module,which effectively improves the model's ability to capture small-object details and contextual information. Furthermore,we incorporate Focaler-SIoU to strengthen the model's bounding box matching capability and increase its sensitivity to small-object features, thereby further enhancing detection accuracy and robustness. Compared with RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the VisDrone2019 dataset with lower computational complexity and fewer parameters, demonstrating its robustness and feasibility for small-object detection tasks.
☆ Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras
We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.
☆ CYPRESS: Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing
Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.
☆ ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.
comment: 5 pages, 4 figures
☆ Emu3.5: Native Multimodal Models are World Learners
We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.
comment: project page: https://emu.world
☆ CATCH: A Modular Cross-domain Adaptive Template with Hook
Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.
☆ Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios
In real-world environments, AI systems often face unfamiliar scenarios without labeled data, creating a major challenge for conventional scene understanding models. The inability to generalize across unseen contexts limits the deployment of vision-based applications in dynamic, unstructured settings. This work introduces a Dynamic Context-Aware Scene Reasoning framework that leverages Vision-Language Alignment to address zero-shot real-world scenarios. The goal is to enable intelligent systems to infer and adapt to new environments without prior task-specific training. The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions, enhancing contextual comprehension. A dynamic reasoning module refines predictions by combining global scene cues and object-level interactions guided by linguistic priors. Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and Open Images demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments. Results also show robust performance in ambiguous or cluttered scenes due to the synergistic fusion of vision and language. This framework offers a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings.
comment: Preprint under review at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
☆ Comparative Analysis of Deep Learning Models for Olive Tree Crown and Shadow Segmentation Towards Biovolume Estimation
Olive tree biovolume estimation is a key task in precision agriculture, supporting yield prediction and resource management, especially in Mediterranean regions severely impacted by climate-induced stress. This study presents a comparative analysis of three deep learning models U-Net, YOLOv11m-seg, and Mask RCNN for segmenting olive tree crowns and their shadows in ultra-high resolution UAV imagery. The UAV dataset, acquired over Vicopisano, Italy, includes manually annotated crown and shadow masks. Building on these annotations, the methodology emphasizes spatial feature extraction and robust segmentation; per-tree biovolume is then estimated by combining crown projected area with shadow-derived height using solar geometry. In testing, Mask R-CNN achieved the best overall accuracy (F1 = 0.86; mIoU = 0.72), while YOLOv11m-seg provided the fastest throughput (0.12 second per image). The estimated biovolumes spanned from approximately 4 to 24 cubic meters, reflecting clear structural differences among trees. These results indicate Mask R-CNN is preferable when biovolume accuracy is paramount, whereas YOLOv11m-seg suits large-area deployments where speed is critical; U-Net remains a lightweight, high-sensitivity option. The framework enables accurate, scalable orchard monitoring and can be further strengthened with DEM or DSM integration and field calibration for operational decision support.
comment: 6 pages, 2025 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor)
☆ AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.
comment: Accepted at 32nd International Conference on MultiMedia Modeling
☆ SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging
Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SA$^{2}$Net) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SA$^{2}$Net achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SA$^{2}$Net to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at https://github.com/taetiseo09/SA2Net.
comment: Accepted by Computerized Medical Imaging and Graphics (CMIG)
☆ Analysis of the Robustness of an Edge Detector Based on Cellular Automata Optimized by Particle Swarm
The edge detection task is essential in image processing aiming to extract relevant information from an image. One recurring problem in this task is the weaknesses found in some detectors, such as the difficulty in detecting loose edges and the lack of context to extract relevant information from specific problems. To address these weaknesses and adapt the detector to the properties of an image, an adaptable detector described by two-dimensional cellular automaton and optimized by meta-heuristic combined with transfer learning techniques was developed. This study aims to analyze the impact of expanding the search space of the optimization phase and the robustness of the adaptability of the detector in identifying edges of a set of natural images and specialized subsets extracted from the same image set. The results obtained prove that expanding the search space of the optimization phase was not effective for the chosen image set. The study also analyzed the adaptability of the model through a series of experiments and validation techniques and found that, regardless of the validation, the model was able to adapt to the input and the transfer learning techniques applied to the model showed no significant improvements.
☆ Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
comment: Preprint
☆ Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
☆ Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.
comment: 12 pages, 7 figures
☆ PointSt3R: Point Tracking through 3D Grounded Correspondence
Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5\%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $\delta_{avg}$ / 85.8\% occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.
comment: http://rhodriguerrier.github.io/PointSt3R
☆ A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.
comment: 23 pages, 14 figures
☆ LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation
Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.
☆ EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models
Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.
comment: Demo paper
☆ SPG-CDENet: Spatial Prior-Guided Cross Dual Encoder Network for Multi-Organ Segmentation
Multi-organ segmentation is a critical task in computer-aided diagnosis. While recent deep learning methods have achieved remarkable success in image segmentation, huge variations in organ size and shape challenge their effectiveness in multi-organ segmentation. To address these challenges, we propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel two-stage segmentation paradigm designed to improve multi-organ segmentation accuracy. Our SPG-CDENet consists of two key components: a spatial prior network and a cross dual encoder network. The prior network generates coarse localization maps that delineate the approximate ROI, serving as spatial guidance for the dual encoder network. The cross dual encoder network comprises four essential components: a global encoder, a local encoder, a symmetric cross-attention module, and a flow-based decoder. The global encoder captures global semantic features from the entire image, while the local encoder focuses on features from the prior network. To enhance the interaction between the global and local encoders, a symmetric cross-attention module is proposed across all layers of the encoders to fuse and refine features. Furthermore, the flow-based decoder directly propagates high-level semantic features from the final encoder layer to all decoder layers, maximizing feature preservation and utilization. Extensive qualitative and quantitative experiments on two public datasets demonstrate the superior performance of SPG-CDENet compared to existing segmentation methods. Furthermore, ablation studies further validate the effectiveness of the proposed modules in improving segmentation accuracy.
☆ CorVS: Person Identification via Video Trajectory-Sensor Correspondence in a Real-World Warehouse
Worker location data is key to higher productivity in industrial sites. Cameras are a promising tool for localization in logistics warehouses since they also offer valuable environmental contexts such as package status. However, identifying individuals with only visual data is often impractical. Accordingly, several prior studies identified people in videos by comparing their trajectories and wearable sensor measurements. While this approach has advantages such as independence from appearance, the existing methods may break down under real-world conditions. To overcome this challenge, we propose CorVS, a novel data-driven person identification method based on correspondence between visual tracking trajectories and sensor measurements. Firstly, our deep learning model predicts correspondence probabilities and reliabilities for every pair of a trajectory and sensor measurements. Secondly, our algorithm matches the trajectories and sensor measurements over time using the predicted probabilities and reliabilities. We developed a dataset with actual warehouse operations and demonstrated the method's effectiveness for real-world applications.
comment: 7 pages, 3 figures, accepted to IPIN 2025
☆ AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM
Autonomous robots in orchards require real-time 3D scene understanding despite repetitive row geometry, seasonal appearance changes, and wind-driven foliage motion. We present AgriGS-SLAM, a Visual--LiDAR SLAM framework that couples direct LiDAR odometry and loop closures with multi-camera 3D Gaussian Splatting (3DGS) rendering. Batch rasterization across complementary viewpoints recovers orchard structure under occlusions, while a unified gradient-driven map lifecycle executed between keyframes preserves fine details and bounds memory. Pose refinement is guided by a probabilistic LiDAR-based depth consistency term, back-propagated through the camera projection to tighten geometry-appearance coupling. We deploy the system on a field platform in apple and pear orchards across dormancy, flowering, and harvesting, using a standardized trajectory protocol that evaluates both training-view and novel-view synthesis to reduce 3DGS overfitting in evaluation. Across seasons and sites, AgriGS-SLAM delivers sharper, more stable reconstructions and steadier trajectories than recent state-of-the-art 3DGS-SLAM baselines while maintaining real-time performance on-tractor. While demonstrated in orchard monitoring, the approach can be applied to other outdoor domains requiring robust multimodal perception.
☆ GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model? ICLR 2026
Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.
comment: 11 pages, 6 figures. Includes supplementary material. Under review as a conference paper at ICLR 2026
☆ A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading
Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.
☆ Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG
The subject of this work is to check how different types of music affect human emotions. While listening to music, a subjective survey and brain activity measurements were carried out using an EEG helmet. The aim is to demonstrate the impact of different music genres on emotions. The research involved a diverse group of participants of different gender and musical preferences. This had the effect of capturing a wide range of emotional responses to music. After the experiment, a relationship analysis of the respondents' questionnaires with EEG signals was performed. The analysis revealed connections between emotions and observed brain activity.
comment: Published at IWAIPR 2025 conference
☆ Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology
Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth's surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their real-world performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains $3,907$ finely tuned satellite assets and $16,410$ scenarios. Each scenario features $1$ to $50$ satellites and $50$ to $300$ imaging tasks. These scenarios are generated via a high-fidelity simulation platform, ensuring realistic satellite behavior such as orbital dynamics and resource constraints. Ground truth scheduling annotations are provided for each scenario. To our knowledge, AEOS-Bench is the first large-scale benchmark suite tailored for realistic constellation scheduling. Building upon this benchmark, we introduce AEOS-Former, a Transformer-based scheduling model that incorporates a constraint-aware attention mechanism. A dedicated internal constraint module explicitly models the physical and operational limits of each satellite. Through simulation-based iterative learning, AEOS-Former adapts to diverse scenarios, offering a robust solution for AEOS constellation scheduling. Experimental results demonstrate that AEOS-Former outperforms baseline models in task completion and energy efficiency, with ablation studies highlighting the contribution of each component. Code and data are provided in https://github.com/buaa-colalab/AEOSBench.
☆ Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping
We focus on ocular biometrics, specifically the periocular region (the area around the eye), which offers high discrimination and minimal acquisition constraints. We evaluate three Convolutional Neural Network architectures of varying depth and complexity to assess their effectiveness for periocular recognition. The networks are trained on 1,907,572 ocular crops extracted from the large-scale VGGFace2 database. This significantly contrasts with existing works, which typically rely on small-scale periocular datasets for training having only a few thousand images. Experiments are conducted with ocular images from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images, and the UFPR-Periocular database, which consists of selfies captured via mobile devices with user guidance on the screen. Due to the uncontrolled conditions of VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from 9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In contrast, UFPR-Periocular yields significantly better performance (EERs of 1-2%), thanks to higher image quality and more consistent acquisition protocols. To the best of our knowledge, these are the lowest reported EERs on the UFPR dataset to date.
comment: Published at IWAIPR 2025 conference
☆ Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving
Planning is a critical component of end-to-end autonomous driving. However, prevailing imitation learning methods often suffer from mode collapse, failing to produce diverse trajectory hypotheses. Meanwhile, existing generative approaches struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. To address these limitations, we propose CATG, a novel planning framework that leverages Constrained Flow Matching. Concretely, CATG explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our primary contribution is the novel imposition of explicit constraints directly within the flow matching process, ensuring that the generated trajectories adhere to vital safety and kinematic rules. Secondly, CATG parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Notably, on the NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and was honored with the Innovation Award.
☆ Exploring Complementarity and Explainability in CNNs for Periocular Verification Across Acquisition Distances
We study the complementarity of different CNNs for periocular verification at different distances on the UBIPr database. We train three architectures of increasing complexity (SqueezeNet, MobileNetv2, and ResNet50) on a large set of eye crops from VGGFace2. We analyse performance with cosine and chi2 metrics, compare different network initialisations, and apply score-level fusion via logistic regression. In addition, we use LIME heatmaps and Jensen-Shannon divergence to compare attention patterns of the CNNs. While ResNet50 consistently performs best individually, the fusion provides substantial gains, especially when combining all three networks. Heatmaps show that networks usually focus on distinct regions of a given image, which explains their complementarity. Our method significantly outperforms previous works on UBIPr, achieving a new state-of-the-art.
comment: Accepted at BIOSIG 2025 conference
☆ Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws NeurIPS 2025
Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.
comment: NeurIPS 2025 spotlight
☆ Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
comment: 10 pages
☆ OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation
Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.
comment: TL;DR: With OmniLayout-1M dataset and LLM-based coarse-to-fine learning, we enable universal and diverse document layout generation
☆ Developing a Multi-task Ensemble Geometric Deep Network for Supply Chain Sustainability and Risk Management
The sustainability of supply chain plays a key role in achieving optimal performance in controlling the supply chain. The management of risks that occur in a supply chain is a fundamental problem for the purpose of developing the sustainability of the network and elevating the performance efficiency of the supply chain. The correct classification of products is another essential element in a sustainable supply chain. Acknowledging recent breakthroughs in the context of deep networks, several architectural options have been deployed to analyze supply chain datasets. A novel geometric deep network is used to propose an ensemble deep network. The proposed Chebyshev ensemble geometric network (Ch-EGN) is a hybrid convolutional and geometric deep learning. This network is proposed to leverage the information dependencies in supply chain to derive invisible states of samples in the database. The functionality of the proposed deep network is assessed on the two different databases. The SupplyGraph Dataset and DataCo are considered in this research. The prediction of delivery status of DataCo supply chain is done for risk administration. The product classification and edge classification are performed using the SupplyGraph database to enhance the sustainability of the supply network. An average accuracy of 98.95% is obtained for the ensemble network for risk management. The average accuracy of 100% and 98.07% are obtained for sustainable supply chain in terms of 5 product group classification and 4 product relation classification, respectively. The average accuracy of 92.37% is attained for 25 company relation classification. The results confirm an average improvement and efficiency of the proposed method compared to the state-of-the-art approaches.
☆ Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction SIGGRAPH
3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules-an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a "learn from synthesis" strategy. First, a diffusion model is trained to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions are incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively show that our model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.
comment: SIGGRAPH Asia 2025
☆ ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts NeurIPS 2025
Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.
comment: Published in the Thirty-Ninth Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models
Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.
comment: 10 pages, 6 figures
☆ Self-localization on a 3D map by fusing global and local features from a monocular camera
Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself with 7.51cm error on average, which is more accurate than SOTA.
☆ CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.
☆ Detecting Unauthorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh
Modes of transportation vary across countries depending on geographical location and cultural context. In South Asian countries rickshaws are among the most common means of local transport. Based on their mode of operation, rickshaws in cities across Bangladesh can be broadly classified into non-auto (pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from accessing certain routes. However, existing surveillance systems make it quite difficult to monitor them due to their similarity to other vehicles, especially non-auto rickshaws whereas manual video analysis is too time-consuming. This paper presents a machine learning-based approach to automatically detect auto-rickshaws in traffic images. In this system, we used real-time object detection using the YOLOv8 model. For training purposes, we prepared a set of 1,730 annotated images that were captured under various traffic conditions. The results show that our proposed model performs well in real-time auto-rickshaw detection and offers an mAP50 of 83.447% and binary precision and recall values above 78%, demonstrating its effectiveness in handling both dense and sparse traffic scenarios. The dataset has been publicly released for further research.
comment: 16 pages
☆ MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction ICCV 2025
Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.
comment: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025
☆ BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.
comment: 13 pages, 10 figures, 5 tables
☆ StructLayoutFormer:Conditional Structured Layout Generation via Structure Serialization and Disentanglement
Structured layouts are preferable in many 2D visual contents (\eg, GUIs, webpages) since the structural information allows convenient layout editing. Computational frameworks can help create structured layouts but require heavy labor input. Existing data-driven approaches are effective in automatically generating fixed layouts but fail to produce layout structures. We present StructLayoutFormer, a novel Transformer-based approach for conditional structured layout generation. We use a structure serialization scheme to represent structured layouts as sequences. To better control the structures of generated layouts, we disentangle the structural information from the element placements. Our approach is the first data-driven approach that achieves conditional structured layout generation and produces realistic layout structures explicitly. We compare our approach with existing data-driven layout generation approaches by including post-processing for structure extraction. Extensive experiments have shown that our approach exceeds these baselines in conditional structured layout generation. We also demonstrate that our approach is effective in extracting and transferring layout structures. The code is publicly available at %\href{https://github.com/Teagrus/StructLayoutFormer} {https://github.com/Teagrus/StructLayoutFormer}.
☆ FullPart: Generating each 3D Part at Full Resolution
Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.
comment: Project page: https://fullpart3d.github.io
☆ Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM
Attention models have recently emerged as a powerful approach, demonstrating significant progress in various fields. Visualization techniques, such as class activation mapping, provide visual insights into the reasoning of convolutional neural networks (CNNs). Using network gradients, it is possible to identify regions where the network pays attention during image recognition tasks. Furthermore, these gradients can be combined with CNN features to localize more generalizable, task-specific attentive (salient) regions within scenes. However, explicit use of this gradient-based attention information integrated directly into CNN representations for semantic object understanding remains limited. Such integration is particularly beneficial for visual tasks like simultaneous localization and mapping (SLAM), where CNN representations enriched with spatially attentive object locations can enhance performance. In this work, we propose utilizing task-specific network attention for RGB-D indoor SLAM. Specifically, we integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance. Experimental results indicate improved performance compared to baseline methods, particularly for large environments.
comment: double-column 5 pages, 3 figures
☆ WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.
☆ JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting
Traditional novel view synthesis methods heavily rely on external camera pose estimation tools such as COLMAP, which often introduce computational bottlenecks and propagate errors. To address these challenges, we propose a unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs. Our approach iteratively refines 3D Gaussian parameters and updates camera poses through a novel co-optimization strategy, ensuring simultaneous improvements in scene reconstruction fidelity and pose accuracy. The key innovation lies in decoupling the joint optimization into two interleaved phases: first, updating 3D Gaussian parameters via differentiable rendering with fixed poses, and second, refining camera poses using a customized 3D optical flow algorithm that incorporates geometric and photometric constraints. This formulation progressively reduces projection errors, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions, where traditional methods struggle. Extensive evaluations on multiple datasets demonstrate that our approach significantly outperforms existing COLMAP-free techniques in reconstruction quality, and also surpasses the standard COLMAP-based baseline in general.
☆ OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research
As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.
☆ EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.
comment: project page: \url{https://minjoong507.github.io/projects/EgoExo-Con/}
☆ Security Risk of Misalignment between Text and Image in Multi-modal Model
Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.
☆ Dynamic VLM-Guided Negative Prompting for Diffusion Models NeurIPS 2025
We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: The First Workshop on Generative and Protective AI for Content Creation
☆ FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation
Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.
☆ Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.
☆ AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception
This paper presents the Autonomous Driving Segment Anything Model (AD-SAM), a fine-tuned vision foundation model for semantic segmentation in autonomous driving (AD). AD-SAM extends the Segment Anything Model (SAM) with a dual-encoder and deformable decoder tailored to spatial and geometric complexity of road scenes. The dual-encoder produces multi-scale fused representations by combining global semantic context from SAM's pretrained Vision Transformer (ViT-H) with local spatial detail from a trainable convolutional deep learning backbone (i.e., ResNet-50). A deformable fusion module aligns heterogeneous features across scales and object geometries. The decoder performs progressive multi-stage refinement using deformable attention. Training is guided by a hybrid loss that integrates Focal, Dice, Lovasz-Softmax, and Surface losses, improving semantic class balance, boundary precision, and optimization stability. Experiments on the Cityscapes and Berkeley DeepDrive 100K (BDD100K) benchmarks show that AD-SAM surpasses SAM, Generalized SAM (G-SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy. It achieves 68.1 mean Intersection over Union (mIoU) on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G-SAM, and DeepLabV3 by margins of up to +22.9 and +19.2 mIoU in structured and diverse road scenes, respectively. AD-SAM demonstrates strong cross-domain generalization with a 0.87 retention score (vs. 0.76 for SAM), and faster, more stable learning dynamics, converging within 30-40 epochs, enjoying double the learning speed of benchmark models. It maintains 0.607 mIoU with only 1000 samples, suggesting data efficiency critical for reducing annotation costs. These results confirm that targeted architectural and optimization enhancements to foundation models enable reliable and scalable AD perception.
comment: Submitted to IEEE Transactions on Intelligent Transportation Systems (IEEE T-ITS)
☆ A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics
Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.
☆ VitalLens 2.0: High-Fidelity rPPG for Heart Rate Variability Estimation from Face Video
This report introduces VitalLens 2.0, a new deep learning model for estimating physiological signals from face video. This new model demonstrates a significant leap in accuracy for remote photoplethysmography (rPPG), enabling the robust estimation of not only heart rate (HR) and respiratory rate (RR) but also Heart Rate Variability (HRV) metrics. This advance is achieved through a combination of a new model architecture and a substantial increase in the size and diversity of our training data, now totaling 1,413 unique individuals. We evaluate VitalLens 2.0 on a new, combined test set of 422 unique individuals from four public and private datasets. When averaging results by individual, VitalLens 2.0 achieves a Mean Absolute Error (MAE) of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD. These results represent a new state-of-the-art, significantly outperforming previous methods. This model is now available to developers via the VitalLens API at https://rouast.com/api.
comment: Technical Report. 8 pages, 5 figures. Introduces the VitalLens 2.0 model for rPPG and Heart Rate Variability (HRV) estimation. Project website: https://rouast.com/api
☆ Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning
In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans' ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \href{https://github.com/weiyana/ContinualHOI}{this HTTP URL}
☆ MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation
In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.
☆ Semantic Frame Aggregation-based Transformer for Live Video Comment Generation
Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.
☆ Using Salient Object Detection to Identify Manipulative Cookie Banners that Circumvent GDPR AAAI
The main goal of this paper is to study how often cookie banners that comply with the General Data Protection Regulation (GDPR) contain aesthetic manipulation, a design tactic to draw users' attention to the button that permits personal data sharing. As a byproduct of this goal, we also evaluate how frequently the banners comply with GDPR and the recommendations of national data protection authorities regarding banner designs. We visited 2,579 websites and identified the type of cookie banner implemented. Although 45% of the relevant websites have fully compliant banners, we found aesthetic manipulation on 38% of the compliant banners. Unlike prior studies of aesthetic manipulation, we use a computer vision model for salient object detection to measure how salient (i.e., attention-drawing) each banner element is. This enables the discovery of new types of aesthetic manipulation (e.g., button placement), and leads us to conclude that aesthetic manipulation is more common than previously reported (38% vs 27% of banners). To study the effects of user and/or website location on cookie banner design, we include websites within the European Union (EU), where privacy regulation enforcement is more stringent, and websites outside the EU. We visited websites from IP addresses in the EU and from IP addresses in the United States (US). We find that 13.9% of EU websites change their banner design when the user is from the US, and EU websites are roughly 48.3% more likely to use aesthetic manipulation than non-EU websites, highlighting their innovative responses to privacy regulation.
comment: Accepted to International AAAI Conference on Web and Social Media 2026 (ICWSM'26)
☆ SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions
Automated segmentation of heterogeneous brain lesions from multi-modal MRI remains a critical challenge in clinical neuroimaging. Current deep learning models are typically specialized `point solutions' that lack generalization and high performance variance, limiting their clinical reliability. To address these gaps, we propose the Unified Multi-Stream SYNAPSE-Net, an adaptive framework designed for both generalization and robustness. The framework is built on a novel hybrid architecture integrating multi-stream CNN encoders, a Swin Transformer bottleneck for global context, a dynamic cross-modal attention fusion (CMAF) mechanism, and a hierarchical gated decoder for high-fidelity mask reconstruction. The architecture is trained with a variance reduction strategy that combines pathology specific data augmentation and difficulty-aware sampling method. The model was evaluated on three different challenging public datasets: the MICCAI 2017 WMH Challenge, the ISLES 2022 Challenge, and the BraTS 2020 Challenge. Our framework attained a state-of-the-art DSC value of 0.831 with the HD95 value of 3.03 in the WMH dataset. For ISLES 2022, it achieved the best boundary accuracy with a statistically significant difference (HD95 value of 9.69). For BraTS 2020, it reached the highest DSC value for the tumor core region (0.8651). These experimental findings suggest that our unified adaptive framework achieves state-of-the-art performance across multiple brain pathologies, providing a robust and clinically feasible solution for automated segmentation. The source code and the pre-trained models are available at https://github.com/mubid-01/SYNAPSE-Net-pre.
comment: 17 pages, 10 figures, 8 tables, submitted to "Medical Image Analysis" journal
☆ Scale-Aware Curriculum Learning for Ddata-Efficient Lung Nodule Detection with YOLOv11
Lung nodule detection in chest CT is crucial for early lung cancer diagnosis, yet existing deep learning approaches face challenges when deployed in clinical settings with limited annotated data. While curriculum learning has shown promise in improving model training, traditional static curriculum strategies fail in data-scarce scenarios. We propose Scale Adaptive Curriculum Learning (SACL), a novel training strategy that dynamically adjusts curriculum design based on available data scale. SACL introduces three key mechanisms:(1) adaptive epoch scheduling, (2) hard sample injection, and (3) scale-aware optimization. We evaluate SACL on the LUNA25 dataset using YOLOv11 as the base detector. Experimental results demonstrate that while SACL achieves comparable performance to static curriculum learning on the full dataset in mAP50, it shows significant advantages under data-limited conditions with 4.6%, 3.5%, and 2.0% improvements over baseline at 10%, 20%, and 50% of training data respectively. By enabling robust training across varying data scales without architectural modifications, SACL provides a practical solution for healthcare institutions to develop effective lung nodule detection systems despite limited annotation resources.
comment: 5 pages, 2 figures
☆ DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting NeurIPS 2025
We present a Directional Consistency (DC)-driven Adaptive Density Control (ADC) for 3D Gaussian Splatting (DC4GS). Whereas the conventional ADC bases its primitive splitting on the magnitudes of positional gradients, we further incorporate the DC of the gradients into ADC, and realize it through the angular coherence of the gradients. Our DC better captures local structural complexities in ADC, avoiding redundant splitting. When splitting is required, we again utilize the DC to define optimal split positions so that sub-primitives best align with the local structures than the conventional random placement. As a consequence, our DC4GS greatly reduces the number of primitives (up to 30% in our experiments) than the existing ADC, and also enhances reconstruction fidelity greatly.
comment: Accepted to NeurIPS 2025 / Project page: https://github.com/cgskku/dc4gs
☆ Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction
Electron back-scatter diffraction (EBSD) has traditionally relied upon methods such as the Hough transform and dictionary Indexing to interpret diffraction patterns and extract crystallographic orientation. However, these methods encounter significant limitations, particularly when operating at high scanning speeds, where the exposure time per pattern is decreased beyond the operating sensitivity of CCD camera. Hence the signal to noise ratio decreases for the observed pattern which makes the pattern noisy, leading to reduced indexing accuracy. This research work aims to develop generative machine learning models for the post-processing or on-the-fly processing of Kikuchi patterns which are capable of restoring noisy EBSD patterns obtained at high scan speeds. These restored patterns can be used for the determination of crystal orientations to provide reliable indexing results. We compare the performance of such generative models in enhancing the quality of patterns captured at short exposure times (high scan speeds). An interesting observation is that the methodology is not data-hungry as typical machine learning methods.
☆ PF-DAformer: Proximal Femur Segmentation via Domain Adaptive Transformer for Dual-Center QCT
Quantitative computed tomography (QCT) plays a crucial role in assessing bone strength and fracture risk by enabling volumetric analysis of bone density distribution in the proximal femur. However, deploying automated segmentation models in practice remains difficult because deep networks trained on one dataset often fail when applied to another. This failure stems from domain shift, where scanners, reconstruction settings, and patient demographics vary across institutions, leading to unstable predictions and unreliable quantitative metrics. Overcoming this barrier is essential for multi-center osteoporosis research and for ensuring that radiomics and structural finite element analysis results remain reproducible across sites. In this work, we developed a domain-adaptive transformer segmentation framework tailored for multi-institutional QCT. Our model is trained and validated on one of the largest hip fracture related research cohorts to date, comprising 1,024 QCT images scans from Tulane University and 384 scans from Rochester, Minnesota for proximal femur segmentation. To address domain shift, we integrate two complementary strategies within a 3D TransUNet backbone: adversarial alignment via Gradient Reversal Layer (GRL), which discourages the network from encoding site-specific cues, and statistical alignment via Maximum Mean Discrepancy (MMD), which explicitly reduces distributional mismatches between institutions. This dual mechanism balances invariance and fine-grained alignment, enabling scanner-agnostic feature learning while preserving anatomical detail.
comment: 22 Pages, 5 Tables, 10 Figures. The combination of GRL and MMD achieved the most balanced performance, reducing contour deviations and enhancing surface smoothness
☆ ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
comment: project page: https://thinkmorph.github.io/
♻ ☆ Smoothing Slot Attention Iterations and Recurrences
Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video's first frame; Also, non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our source code, model checkpoints and training logs are available on https://github.com/Genera1Z/SmoothSA.
♻ ☆ Predicting Video Slot Attention Queries from Random Slot-Feature Pairs
Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code, model checkpoints and training logs are available on https://github.com/Genera1Z/RandSF.Q.
♻ ☆ Locality in Image Diffusion Models Emerges from Data Statistics
Recent work has shown that the generalization ability of image diffusion models arises from the locality properties of the trained neural network. In particular, when denoising a particular pixel, the model relies on a limited neighborhood of the input image around that pixel, which, according to the previous work, is tightly related to the ability of these models to produce novel images. Since locality is central to generalization, it is crucial to understand why diffusion models learn local behavior in the first place, as well as the factors that govern the properties of locality patterns. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset and is not due to the inductive bias of convolutional neural networks, as suggested in previous work. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to deep neural denoisers. We show, both theoretically and experimentally, that this locality arises directly from pixel correlations present in the image datasets. Moreover, locality patterns are drastically different on specialized datasets, approximating principal components of the data's covariance. We use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than prior expert-crafted alternatives. Our key takeaway is that while neural network architectures influence generation quality, their primary role is to capture locality patterns inherent in the data.
comment: 31 pages, 20 figures, 7 tables
♻ ☆ ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models
Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality, while maintaining inference efficiency. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.
♻ ☆ GSE: Group-wise Sparse and Explainable Adversarial Attacks
Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, often regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. We address this by presenting a two-phase algorithm that generates group-wise sparse attacks within semantically meaningful areas of an image. Initially, we optimize a quasinorm adversarial loss using the $1/2-$quasinorm proximal operator tailored for non-convex programming. Subsequently, the algorithm transitions to a projected Nesterov's accelerated gradient descent with $2-$norm regularization applied to perturbation magnitudes. Rigorous evaluations on CIFAR-10 and ImageNet datasets demonstrate a remarkable increase in group-wise sparsity, e.g., $50.9\%$ on CIFAR-10 and $38.4\%$ on ImageNet (average case, targeted attack). This performance improvement is accompanied by significantly faster computation times, improved explainability, and a $100\%$ attack success rate.
♻ ☆ Resource Efficient Multi-stain Kidney Glomeruli Segmentation via Self-supervision
Semantic segmentation under domain shift remains a fundamental challenge in computer vision, particularly when labelled training data is scarce. This challenge is particularly exemplified in histopathology image analysis, where the same tissue structures must be segmented across images captured under different imaging conditions (stains), each representing a distinct visual domain. Traditional deep learning methods like UNet require extensive labels, which is both costly and time-consuming, particularly when dealing with multiple domains (or stains). To mitigate this, various unsupervised domain adaptation based methods such as UDAGAN have been proposed, which reduce the need for labels by requiring only one (source) stain to be labelled. Nonetheless, obtaining source stain labels can still be challenging. This article shows that through self-supervised pre-training -- including SimCLR, BYOL, and a novel approach, HR-CS-CO -- the performance of these segmentation methods (UNet, and UDAGAN) can be retained even with 95% fewer labels. Notably, with self-supervised pre-training and using only 5% labels, the performance drops are minimal: 5.9% for UNet and 6.2% for UDAGAN, averaged over all stains, compared to their respective fully supervised counterparts (without pre-training, using 100% labels). Furthermore, these findings are shown to generalise beyond their training distribution to public benchmark datasets. Implementations and pre-trained models are publicly available \href{https://github.com/zeeshannisar/resource-effecient-multi-stain-kidney-glomeruli-segmentation.git}{online}.
comment: 39 pages, 10 figures, 4 Tables
♻ ☆ CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
comment: 39 pages, 24 figures
♻ ☆ Fit for Purpose? Deepfake Detection in the Real World
The rapid proliferation of AI-generated content, driven by advances in generative adversarial networks, diffusion models, and multimodal large language models, has made the creation and dissemination of synthetic media effortless, heightening the risks of misinformation, particularly political deepfakes that distort truth and undermine trust in political institutions. In turn, governments, research institutions, and industry have strongly promoted deepfake detection initiatives as solutions. Yet, most existing models are trained and validated on synthetic, laboratory-controlled datasets, limiting their generalizability to the kinds of real-world political deepfakes circulating on social platforms that affect the public. In this work, we introduce the first systematic benchmark based on the Political Deepfakes Incident Database, a curated collection of real-world political deepfakes shared on social media since 2018. Our study includes a systematic evaluation of state-of-the-art deepfake detectors across academia, government, and industry. We find that the detectors from academia and government perform relatively poorly. While paid detection tools achieve relatively higher performance than free-access models, all evaluated detectors struggle to generalize effectively to authentic political deepfakes, and are vulnerable to simple manipulations, especially in the video domain. Results urge the need for politically contextualized deepfake detection frameworks to better safeguard the public in real-world settings.
♻ ☆ DDL: A Large-Scale Datasets for Deepfake Detection and Localization in Diversified Real-World Scenarios
Recent advances in AIGC have exacerbated the misuse of malicious deepfake content, making the development of reliable deepfake detection methods an essential means to address this challenge. Although existing deepfake detection models demonstrate outstanding performance in detection metrics, most methods only provide simple binary classification results, lacking interpretability. Recent studies have attempted to enhance the interpretability of classification results by providing spatial manipulation masks or temporal forgery segments. However, due to the limitations of forgery datasets, the practical effectiveness of these methods remains suboptimal. The primary reason lies in the fact that most existing deepfake datasets contain only binary labels, with limited variety in forgery scenarios, insufficient diversity in deepfake types, and relatively small data scales, making them inadequate for complex real-world scenarios.To address this predicament, we construct a novel large-scale deepfake detection and localization (\textbf{DDL}) dataset containing over $\textbf{1.4M+}$ forged samples and encompassing up to $\textbf{80}$ distinct deepfake methods. The DDL design incorporates four key innovations: (1) \textbf{Comprehensive Deepfake Methods} (covering 7 different generation architectures and a total of 80 methods), (2) \textbf{Varied Manipulation Modes} (incorporating 7 classic and 3 novel forgery modes), (3) \textbf{Diverse Forgery Scenarios and Modalities} (including 3 scenarios and 3 modalities), and (4) \textbf{Fine-grained Forgery Annotations} (providing 1.18M+ precise spatial masks and 0.23M+ precise temporal segments).Through these improvements, our DDL not only provides a more challenging benchmark for complex real-world forgeries but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods.
comment: This paper is a preliminary version, with an extended and comprehensive version currently under development
♻ ☆ HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis
Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.
♻ ☆ MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.
comment: 20 pages, 8 figures
♻ ☆ LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution
Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel "knee point"-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our "precision-over-volume" principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.
comment: 19 pages, 9 figures, 6 tables
♻ ☆ UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping ICLR2025
In recent research, adversarial attacks on person detectors using patches or static 3D model-based texture modifications have struggled with low success rates due to the flexible nature of human movement. Modeling the 3D deformations caused by various actions has been a major challenge. Fortunately, advancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer new possibilities. In this paper, we introduce UV-Attack, a groundbreaking approach that achieves high success rates even with extensive and unseen human actions. We address the challenge above by leveraging dynamic-NeRF-based UV mapping. UV-Attack can generate human images across diverse actions and viewpoints, and even create novel actions by sampling from the SMPL parameter space. While dynamic NeRF models are capable of modeling human bodies, modifying clothing textures is challenging because they are embedded in neural network parameters. To tackle this, UV-Attack generates UV maps instead of RGB images and modifies the texture stacks. This approach enables real-time texture edits and makes the attack more practical. We also propose a novel Expectation over Pose Transformation loss (EoPT) to improve the evasion success rate on unseen poses and views. Our experiments show that UV-Attack achieves a 92.7% attack success rate against the FastRCNN model across varied poses in dynamic video settings, significantly outperforming the state-of-the-art AdvCamou attack, which only had a 28.5% ASR. Moreover, we achieve 49.5% ASR on the latest YOLOv8 detector in black-box settings. This work highlights the potential of dynamic NeRF-based UV mapping for creating more effective adversarial attacks on person detectors, addressing key challenges in modeling human movement and texture modification. The code is available at https://github.com/PolyLiYJ/UV-Attack.
comment: 23 pages, 22 figures, accepted by ICLR2025
♻ ☆ A Continuous and Interpretable Morphometric for Robust Quantification of Dynamic Biological Shapes
We introduce the Push-Forward Signed Distance Morphometric (PF-SDM) for shape quantification in biomedical imaging. The PF-SDM compactly encodes geometric and topological properties of closed shapes, including their skeleton and symmetries. This provides robust and interpretable features for shape comparison and machine learning. The PF-SDM is mathematically smooth, providing access to gradients and differential-geometric quantities. It also extends to temporal dynamics and allows fusing spatial intensity distributions, such as genetic markers, with shape dynamics. We present the PF-SDM theory, benchmark it on synthetic data, and apply it to predicting body-axis formation in mouse gastruloids, outperforming a CNN baseline in both accuracy and speed.
♻ ☆ VerifIoU -- Robustness of Object Detection to Perturbations SC
We introduce a novel Interval Bound Propagation (IBP) approach for the formal verification of object detection models, specifically targeting the Intersection over Union (IoU) metric. The approach has been implemented in an open source code, named IBP IoU, compatible with popular abstract interpretation based verification tools. The resulting verifier is evaluated on landing approach runway detection and handwritten digit recognition case studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the superior performance of IBP IoU in ensuring accuracy and stability, contributing to more secure and robust machine learning applications.
comment: 44th Digital Avionics Systems Conference (DASC), Sep 2025, Montreal, Canada
♻ ☆ ReCon-GS: Continuum-Preserved Gaussian Streaming for Fast and Compact Reconstruction of Dynamic Scenes NeurIPS 2025
Online free-viewpoint video (FVV) reconstruction is challenged by slow per-frame optimization, inconsistent motion estimation, and unsustainable storage demands. To address these challenges, we propose the Reconfigurable Continuum Gaussian Stream, dubbed ReCon-GS, a novel storage-aware framework that enables high fidelity online dynamic scene reconstruction and real-time rendering. Specifically, we dynamically allocate multi-level Anchor Gaussians in a density-adaptive fashion to capture inter-frame geometric deformations, thereby decomposing scene motion into compact coarse-to-fine representations. Then, we design a dynamic hierarchy reconfiguration strategy that preserves localized motion expressiveness through on-demand anchor re-hierarchization, while ensuring temporal consistency through intra-hierarchical deformation inheritance that confines transformation priors to their respective hierarchy levels. Furthermore, we introduce a storage-aware optimization mechanism that flexibly adjusts the density of Anchor Gaussians at different hierarchy levels, enabling a controllable trade-off between reconstruction fidelity and memory usage. Extensive experiments on three widely used datasets demonstrate that, compared to state-of-the-art methods, ReCon-GS improves training efficiency by approximately 15% and achieves superior FVV synthesis quality with enhanced robustness and stability. Moreover, at equivalent rendering quality, ReCon-GS slashes memory requirements by over 50% compared to leading state-of-the-art methods.
comment: Published in NeurIPS 2025
♻ ☆ StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations NIPS2025
Recently, text-to-image diffusion models have been widely used for style mimicry and personalized customization through methods such as DreamBooth and Textual Inversion. This has raised concerns about intellectual property protection and the generation of deceptive content. Recent studies, such as Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect images from these attacks. However, recent purification-based methods, such as DiffPure and Noise Upscaling, have successfully attacked these latest defenses, showing the vulnerabilities of these methods. Moreover, present methods show limited transferability across models, making them less effective against unknown text-to-image models. To address these issues, we propose a novel anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes the style-related features in the latent space to make it deviate from the original image, which improves model-agnostic transferability. Additionally, to enhance the perturbation's ability to bypass diffusion-based purification, we designed a novel upscale loss that involves ensemble purifiers and upscalers during training. Extensive experiments on the WikiArt and CelebA datasets demonstrate that StyleGuard outperforms existing methods in robustness against various transformations and purifications, effectively countering style mimicry in various models. Moreover, StyleGuard is effective on different style mimicry methods, including DreamBooth and Textual Inversion. The code is available at https://github.com/PolyLiYJ/StyleGuard.
comment: Accepted by NIPS2025
♻ ☆ LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification
As an important task in intelligent transportation systems, Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different viewpoints. Previous methods typically adopt deep learning-based models, focusing on extracting view-invariant features. However, they usually overlook the semantic information in person attributes. In addition, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. Specifically, with the Contrastive Language-Image Pre-training (CLIP) model, we first propose an Attribute-aware Image Encoder (AIE) to extract both global semantic features and attribute-aware features from input images. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to predict person attributes and obtain attribute representations. Finally, we design a Coupled Prompt Template (CPT) to transform attribute representations and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve AG-ReID performance. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed methods. The source code is available at https://github.com/kevinhu314/LATex.
comment: More modifications may be performed
♻ ☆ EmoAttack: Emotion-to-Image Diffusion Models for Emotional Backdoor Generation
Text-to-image diffusion models can generate realistic images based on textual inputs, enabling users to convey their opinions visually through language. Meanwhile, within language, emotion plays a crucial role in expressing personal opinions in our daily lives and the inclusion of maliciously negative content can lead users astray, exacerbating negative emotions. Recognizing the success of diffusion models and the significance of emotion, we investigate a previously overlooked risk associated with text-to-image diffusion models, that is, utilizing emotion in the input texts to introduce negative content and provoke unfavorable emotions in users. Specifically, we identify a new backdoor attack, i.e., emotion-aware backdoor attack (EmoAttack), which introduces malicious negative content triggered by emotional texts during image generation. We formulate such an attack as a diffusion personalization problem to avoid extensive model retraining and propose the EmoBooth. Unlike existing personalization methods, our approach fine-tunes a pre-trained diffusion model by establishing a mapping between a cluster of emotional words and a given reference image containing malicious negative content. To validate the effectiveness of our method, we built a dataset and conducted extensive analysis and discussion about its effectiveness. Given consumers' widespread use of diffusion models, uncovering this threat is critical for society.
♻ ☆ Two Heads are Better than One: Robust Learning Meets Multi-branch Models
Deep neural networks (DNNs) are vulnerable to adversarial examples, in which DNNs are misled to false outputs due to inputs containing imperceptible perturbations. Adversarial training, a reliable and effective method of defense, may significantly reduce the vulnerability of neural networks and becomes the de facto standard for robust learning. While many recent works practice the data-centric philosophy, such as how to generate better adversarial examples or use generative models to produce additional training data, we look back to the models themselves and revisit the adversarial robustness from the perspective of deep feature distribution as an insightful complementarity. In this paper, we propose \textit{Branch Orthogonality adveRsarial Training} (BORT) to obtain state-of-the-art performance with solely the original dataset for adversarial training. To practice our design idea of integrating multiple orthogonal solution spaces, we leverage a simple and straightforward multi-branch neural network that eclipses adversarial attacks with no increase in inference time. We heuristically propose a corresponding loss function, branch-orthogonal loss, to make each solution space of the multi-branch model orthogonal. We evaluate our approach on CIFAR-10, CIFAR-100 and SVHN against $\ell_{\infty}$ norm-bounded perturbations of size $\epsilon = 8/255$, respectively. Exhaustive experiments are conducted to show that our method goes beyond all state-of-the-art methods without any tricks. Compared to all methods that do not use additional data for training, our models achieve 67.3\% and 41.5\% robust accuracy on CIFAR-10 and CIFAR-100 (improving upon the state-of-the-art by +7.23\% and +9.07\%). We also outperform methods using a training set with a far larger scale than ours.
comment: Camera-ready version for ICPADS 2025
♻ ☆ Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data
Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65\%, 58.44\%, and 76.68\%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.
comment: The code and model are publicly available here https://github.com/MSA-LMC/S4D
♻ ☆ Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction
Current models based on deep learning for low-dose CT denoising rely heavily on paired data and generalize poorly. Even the more concerned diffusion models need to learn the distribution of clean data for reconstruction, which is difficult to satisfy in medical clinical applications. At the same time, self-supervised-based methods face the challenge of significant degradation of generalizability of models pre-trained for the current dose to expand to other doses. To address these issues, this work proposes a novel method of TUnable-geneRalizatioN Diffusion (TurnDiff) powered by self-supervised contextual sub-data for low-dose CT reconstruction. Firstly, a contextual subdata self-enhancing similarity strategy is designed for denoising centered on the LDCT projection domain, which provides an initial prior for the subsequent progress. Subsequently, the initial prior is used to combine knowledge distillation with a deep combination of latent diffusion models for optimizing image details. The pre-trained model is used for inference reconstruction, and the pixel-level self-correcting fusion technique is proposed for fine-grained reconstruction of the image domain to enhance the image fidelity, using the initial prior and the LDCT image as a guide. In addition, the technique is flexibly applied to the generalization of upper and lower doses or even unseen doses. Dual-domain strategy cascade for self-supervised LDCT denoising, TurnDiff requires only LDCT projection domain data for training and testing. Comprehensive evaluation on both benchmark datasets and real-world data demonstrates that TurnDiff consistently outperforms state-of-the-art methods in both reconstruction and generalization.
♻ ☆ Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction
Physics Informed Neural Networks (PINNs) are gaining attention for their ability to embed physical laws into deep learning models, which is particularly useful in structural engineering tasks with limited data. This paper aims to explore the use of PINNs to predict the weight of small scale spaghetti bridges, a task relevant to understanding load limits and potential failure modes in simplified structural models. Our proposed framework incorporates physics-based constraints to the prediction model for improved performance. In addition to standard PINNs, we introduce a novel architecture named Physics Informed Kolmogorov Arnold Network (PIKAN), which blends universal function approximation theory with physical insights. The structural parameters provided as input to the model are collected either manually or through computer vision methods. Our dataset includes 15 real bridges, augmented to 100 samples, and our best model achieves an $R^2$ score of 0.9603 and a mean absolute error (MAE) of 10.50 units. From applied perspective, we also provide a web based interface for parameter entry and prediction. These results show that PINNs can offer reliable estimates of structural weight, even with limited data, and may help inform early stage failure analysis in lightweight bridge designs. The complete data and code are available at https://github.com/OmerJauhar/PINNS-For-Spaghetti-Bridges.
comment: 12 pages, 17 figures. Preprint
♻ ☆ SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification
Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code will be available.
comment: More modifications may performed
♻ ☆ TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection EMNLP 2025
Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.
comment: EMNLP 2025 Oral; Project Homepage: https://yanzehong.github.io/trust-vl/
♻ ☆ Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
comment: Project Page: https://github.com/Paper2Poster/Paper2Poster
♻ ☆ MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? NeurIPS'25
Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands. Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.
comment: Accepted by NeurIPS'25. 30 pages, 2 figures, 13 tables
♻ ☆ D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning -- A Benchmark Dataset and Method ICDM
Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
comment: Accepted at IEEE International Conference on Data Mining (ICDM) 2025
♻ ☆ Disentangled 4D Gaussian Splatting: Rendering High-Resolution Dynamic World at 343 FPS
While dynamic novel view synthesis from 2D videos has seen progress, achieving efficient reconstruction and rendering of dynamic scenes remains a challenging task. In this paper, we introduce Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel representation and rendering pipeline that achieves real-time performance without compromising visual fidelity. Disentangled4DGS decouples the temporal and spatial components of 4D Gaussians, avoiding the need for slicing first and four-dimensional matrix calculations in prior methods. By projecting temporal and spatial deformations into dynamic 2D Gaussians and deferring temporal processing, we minimize redundant computations of 4DGS. Our approach also features a gradient-guided flow loss and temporal splitting strategy to reduce artifacts. Experiments demonstrate a significant improvement in rendering speed and quality, achieving 343 FPS when render 1352*1014 resolution images on a single RTX3090 while reducing storage requirements by at least 4.5%. Our approach sets a new benchmark for dynamic novel view synthesis, outperforming existing methods on both multi-view and monocular dynamic scene datasets.
♻ ☆ GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
comment: Project Page: https://jingw193.github.io/GRPO-Guard/
♻ ☆ RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection
Infrared small target detection is a challenging task due to its unique characteristics (e.g., small, dim, shapeless and changeable). Recently published CNN-based methods have achieved promising performance with heavy feature extraction and fusion modules. To achieve efficient and effective detection, we propose a recurrent reusable-convolution attention network (RRCA-Net) for infrared small target detection. Specifically, RRCA-Net incorporates reusable-convolution block (RuCB) in a recurrent manner without introducing extra parameters. With the help of the repetitive iteration in RuCB, the high-level information of small targets in the deep layers can be well maintained and further refined. Then, a dual interactive attention aggregation module (DIAAM) is proposed to promote the mutual enhancement and fusion of refined information. In this way, RRCA-Net can both achieve high-level feature refinement and enhance the correlation of contextual information between adjacent layers. Moreover, to achieve steady convergence, we design a target characteristic inspired loss function (DpT-k loss) by integrating physical and mathematical constraints. Experimental results on three benchmark datasets (e.g. NUAA-SIRST, IRSTD-1k, DenseSIRST) demonstrate that our RRCA-Net can achieve comparable performance to the state-of-the-art methods while maintaining a small number of parameters, and act as a plug and play module to introduce consistent performance improvement for several popular IRSTD methods.
comment: We have updated the journal reference and DOI
♻ ☆ Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model's spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.
♻ ☆ FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes ECAI
The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.
comment: 8 pages, 6 figures, In Proceedings of European Conference on Artificial Intelligence (ECAI) 2025
♻ ☆ Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.
♻ ☆ Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning
Multimodal contrastive learning models (e.g., CLIP) can learn high-quality representations from large-scale image-text datasets, while they exhibit significant vulnerabilities to backdoor attacks, raising serious safety concerns. In this paper, we reveal that CLIP's vulnerabilities primarily stem from its tendency to encode features beyond in-dataset predictive patterns, compromising its visual feature resistivity to input perturbations. This makes its encoded features highly susceptible to being reshaped by backdoor triggers. To address this challenge, we propose Repulsive Visual Prompt Tuning (RVPT), a novel defense approach that employs deep visual prompt tuning with a specially designed feature-repelling loss. Specifically, RVPT adversarially repels the encoded features from deeper layers while optimizing the standard cross-entropy loss, ensuring that only predictive features in downstream tasks are encoded, thereby enhancing CLIP's visual feature resistivity against input perturbations and mitigating its susceptibility to backdoor attacks. Unlike existing multimodal backdoor defense methods that typically require the availability of poisoned data or involve fine-tuning the entire model, RVPT leverages few-shot downstream clean samples and only tunes a small number of parameters. Empirical results demonstrate that RVPT tunes only 0.27\% of the parameters in CLIP, yet it significantly outperforms state-of-the-art defense methods, reducing the attack success rate from 89.70\% to 2.76\% against the most advanced multimodal attacks on ImageNet and effectively generalizes its defensive capabilities across multiple datasets.
♻ ☆ FARMER: Flow AutoRegressive Transformer over Pixels
Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
comment: Bytedance Seed Technical Report
♻ ☆ SPARKE: Scalable Prompt-Aware Diversity and Novelty Guidance in Diffusion Models via RKE Score
Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware R\'eny Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: https://mjalali.github.io/SPARKE
♻ ☆ Real-Time Neural Video Compression with Unified Intra and Inter Coding
Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
comment: 10 pages
♻ ☆ TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.
♻ ☆ DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution NeurIPS 2025
Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (i.e., CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a 28$\times$ speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.
comment: Accepted to NeurIPS 2025. Code is available at: https://github.com/zhengchen1999/DOVE
♻ ☆ Language-guided Open-world Video Anomaly Detection under Weak Supervision
Video anomaly detection (VAD) aims to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask may be considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly scores. Therefore, we propose LaGoVAD (Language-guided Open-world Video Anomaly Detector), a model that dynamically adapts anomaly definitions under weak supervision with two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate LaGoVAD's SOTA performance. Our dataset and code will be released at https://github.com/Kamino666/LaGoVAD-PreVAD.
♻ ☆ Buffer layers for Test-Time Adaptation NeurIPS 2025
In recent advancements in Test Time Adaptation (TTA), most existing methodologies focus on updating normalization layers to adapt to the test domain. However, the reliance on normalization-based adaptation presents key challenges. First, normalization layers such as Batch Normalization (BN) are highly sensitive to small batch sizes, leading to unstable and inaccurate statistics. Moreover, normalization-based adaptation is inherently constrained by the structure of the pre-trained model, as it relies on training-time statistics that may not generalize well to unseen domains. These issues limit the effectiveness of normalization-based TTA approaches, especially under significant domain shift. In this paper, we introduce a novel paradigm based on the concept of a Buffer layer, which addresses the fundamental limitations of normalization layer updates. Unlike existing methods that modify the core parameters of the model, our approach preserves the integrity of the pre-trained backbone, inherently mitigating the risk of catastrophic forgetting during online adaptation. Through comprehensive experimentation, we demonstrate that our approach not only outperforms traditional methods in mitigating domain shift and enhancing model robustness, but also exhibits strong resilience to forgetting. Furthermore, our Buffer layer is modular and can be seamlessly integrated into nearly all existing TTA frameworks, resulting in consistent performance improvements across various architectures. These findings validate the effectiveness and versatility of the proposed solution in real-world domain adaptation scenarios. The code is available at https://github.com/hyeongyu-kim/Buffer_TTA.
comment: Accepted at NeurIPS 2025
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce \paper, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.
comment: Accepted to AICCC 2025
♻ ☆ From One to More: Contextual Part Latents for 3D Generation
Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.
comment: Project page: https://copart3d.github.io/
♻ ☆ Cycle Diffusion Model for Counterfactual Image Generation
Deep generative models have demonstrated remarkable success in medical image synthesis. However, ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation remains a challenge. In this work, we introduce a cycle training framework to fine-tune diffusion models for improved conditioning adherence and enhanced synthetic image realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency between generated and original images by incorporating cycle constraints, enabling more reliable direct and counterfactual generation. Experiments on a combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and PPMI) show that our method improves conditioning accuracy and enhances image quality as measured by FID and SSIM. The results suggest that the cycle strategy used in CDM can be an effective method for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual, and disease progression modeling.
♻ ☆ Empowering Agentic Video Analytics Systems with Video Language Models
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
comment: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix
♻ ☆ Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations NeurIPS 2025
Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).
comment: NeurIPS 2025
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
♻ ☆ Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features ICCV 2025
The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.
comment: 14 pages, 9 figures, accepted at ICCV 2025
♻ ☆ Signal-SGN: A Spiking Graph Convolutional Network for Skeletal Action Recognition via Learning Temporal-Frequency Dynamics
For multimodal skeleton-based action recognition, Graph Convolutional Networks (GCNs) are effective models. Still, their reliance on floating-point computations leads to high energy consumption, limiting their applicability in battery-powered devices. While energy-efficient, Spiking Neural Networks (SNNs) struggle to model skeleton dynamics, leading to suboptimal solutions. We propose Signal-SGN (Spiking Graph Convolutional Network), which utilizes the temporal dimension of skeleton sequences as the spike time steps and represents features as multi-dimensional discrete stochastic signals for temporal-frequency domain feature extraction. It combines the 1D Spiking Graph Convolution (1D-SGC) module and the Frequency Spiking Convolution (FSC) module to extract features from the skeleton represented as spiking form. Additionally, the Multi-Scale Wavelet Transform Feature Fusion (MWTF) module is proposed to extract dynamic spiking features and capture frequency-specific characteristics, enhancing classification performance. Experiments across three large-scale datasets reveal Signal-SGN exceeding state-of-the-art SNN-based methods in accuracy and computational efficiency while attaining comparable performance with GCN methods and significantly reducing theoretical energy consumption.
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models NeurIPS 2025
Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.
comment: NeurIPS 2025 Datasets & Benchmarks
♻ ☆ Neighborhood Feature Pooling for Remote Sensing Image Classification
In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.
comment: 9 pages, 5 figures
♻ ☆ GCVAMD: A Modified CausalVAE Model for Causal Age-related Macular Degeneration Risk Factor Detection and Prediction
Age Related Macular Degeneration(AMD) has been one of the most leading causes of permanent vision impairment in ophthalmology. Though treatments, such as anti VEGF drugs or photodynamic therapies, were developed to slow down the degenerative process of AMD, there is still no specific cure to reverse vision loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD or AMD itself within the patient retina in early stages is a crucial task to reduce the possibility of vision impairment. Apart from traditional approaches, deep learning based methods, especially attention mechanism based CNNs and GradCAM based XAI analysis on OCT scans, exhibited successful performance in distinguishing AMD retina from normal retinas, making it possible to use AI driven models to aid medical diagnosis and analysis by ophthalmologists regarding AMD. However, though having significant success, previous works mostly focused on prediction performance itself, not pathologies or underlying causal mechanisms of AMD, which can prohibit intervention analysis on specific factors or even lead to less reliable decisions. Thus, this paper introduces a novel causal AMD analysis model: GCVAMD, which incorporates a modified CausalVAE approach that can extract latent causal factors from only raw OCT images. By considering causality in AMD detection, GCVAMD enables causal inference such as treatment simulation or intervention analysis regarding major risk factors: drusen and neovascularization, while returning informative latent causal features that can enhance downstream tasks. Results show that through GCVAMD, drusen status and neovascularization status can be identified with AMD causal mechanisms in GCVAMD latent spaces, which can in turn be used for various tasks from AMD detection(classification) to intervention analysis.
♻ ☆ GameFactory: Creating New Games with Generative Interactive Videos ICCV 2025
Generative videos have the potential to revolutionize game development by autonomously creating new content. In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. We first address the fundamental challenge of action controllability by introducing GF-Minecraft, an action-annotated game video dataset without human bias, and developing an action control module that enables precise control over both keyboard and mouse inputs. We further extend to support autoregressive generation for unlimited-length interactive videos. More importantly, GameFactory tackles the critical challenge of scene-generalizable action control, which most existing methods fail to address. To enable the creation of entirely new and diverse games beyond fixed styles and scenes, we leverage the open-domain generative priors from pre-trained video diffusion models. To bridge the domain gap between open-domain priors and small-scale game datasets, we propose a multi-phase training strategy with a domain adapter that decouples game style learning from action control. This decoupling ensures that action control learning is no longer bound to specific game styles, thereby achieving scene-generalizable action control. Experimental results demonstrate that GameFactory effectively generates open-domain action-controllable game videos, representing a significant step forward in AI-driven game generation.
comment: ICCV 2025 Highlight, Project Page: https://yujiwen.github.io/gamefactory
♻ ☆ Reasoning Visual Language Model for Chest X-Ray Analysis
Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.
comment: NV-Reason-CXR-3B
♻ ☆ DPA: A one-stop metric to measure bias amplification in classification datasets NeurIPS 2025
Most ML datasets today contain biases. When we train models on these datasets, they often not only learn these biases but can worsen them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification in classification datasets. They measure bias amplification between a protected attribute (e.g., gender) and a task (e.g., cooking). These metrics also support fine-grained bias analysis by identifying the direction in which a model amplifies biases. However, co-occurrence-based metrics have limitations -- some fail to measure bias amplification in balanced datasets, while others fail to measure negative bias amplification. To solve these issues, recent work proposed a predictability-based metric called leakage amplification (LA). However, LA cannot identify the direction in which a model amplifies biases. We propose Directional Predictability Amplification (DPA), a predictability-based metric that is (1) directional, (2) works with balanced and unbalanced datasets, and (3) correctly identifies positive and negative bias amplification. DPA eliminates the need to evaluate models on multiple metrics to verify these three aspects. DPA also improves over prior predictability-based metrics like LA: it is less sensitive to the choice of attacker function (a hyperparameter in predictability-based metrics), reports scores within a bounded range, and accounts for dataset bias by measuring relative changes in predictability. Our experiments on well-known datasets like COMPAS (a tabular dataset), COCO, and ImSitu (image datasets) show that DPA is the most reliable metric to measure bias amplification in classification problems. To compare DPA with existing bias amplification metrics, we released a one-stop library of major bias amplification metrics at https://github.com/kerner-lab/Bias-Amplification.
comment: Accepted at NeurIPS 2025
♻ ☆ SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding
Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG
comment: The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG
♻ ☆ $\mathtt{M^3VIR}$: A Large-Scale Multi-Modality Multi-View Synthesized Benchmark Dataset for Image Restoration and Content Creation
The gaming and entertainment industry is rapidly evolving, driven by immersive experiences and the integration of generative AI (GAI) technologies. Training such models effectively requires large-scale datasets that capture the diversity and context of gaming environments. However, existing datasets are often limited to specific domains or rely on artificial degradations, which do not accurately capture the unique characteristics of gaming content. Moreover, benchmarks for controllable video generation remain absent. To address these limitations, we introduce $\mathtt{M^3VIR}$, a large-scale, multi-modal, multi-view dataset specifically designed to overcome the shortcomings of current resources. Unlike existing datasets, $\mathtt{M^3VIR}$ provides diverse, high-fidelity gaming content rendered with Unreal Engine 5, offering authentic ground-truth LR-HR paired and multi-view frames across 80 scenes in 8 categories. It includes $\mathtt{M^3VIR\_MR}$ for super-resolution (SR), novel view synthesis (NVS), and combined NVS+SR tasks, and $\mathtt{M^3VIR\_{MS}}$, the first multi-style, object-level ground-truth set enabling research on controlled video generation. Additionally, we benchmark several state-of-the-art SR and NVS methods to establish performance baselines. While no existing approaches directly handle controlled video generation, $\mathtt{M^3VIR}$ provides a benchmark for advancing this area. By releasing the dataset, we aim to facilitate research in AI-powered restoration, compression, and controllable content generation for next-generation cloud gaming and entertainment.
♻ ☆ StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory. Project page: https://insait-institute.github.io/StateSpaceDiffuser/.
♻ ☆ Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) -- variation in human confidence across annotations -- but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages -- discriminate, self-annotate, error trigger, and training -- to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5\% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.
♻ ☆ How Should One Evaluate Monocular Depth Estimation?
Monocular depth estimation is an important task with rapid progress, but how to evaluate it remains an open question, as evidenced by a lack of standardization in existing literature and a large selection of evaluation metrics whose trade-offs and behaviors are not well understood. This paper contributes a novel, quantitative analysis of existing metrics in terms of their sensitivity to various types of perturbations of ground truth, emphasizing comparison to human judgment. Our analysis reveals that existing metrics are severely under-sensitive to curvature perturbation such as making flat surfaces wavy. To remedy this, we introduce a new metric based on relative surface normals, along with new depth visualization tools and a principled method to create composite metrics with better human alignment. Code and data are available at: https://github.com/princeton-vl/evalmde.
♻ ☆ CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams
The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.
♻ ☆ PROFIT: A Specialized Optimizer for Deep Fine Tuning NeurIPS 2025
The fine-tuning of pre-trained models has become ubiquitous in generative AI, computer vision, and robotics. Although much attention has been paid to improving the efficiency of fine-tuning model, there has been less scholarship around fine-tuning specifically for improved model performance. To remedy this gap, we present PROFIT, one of the first optimizers designed to incrementally fine-tune converged models on new tasks and/or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initializations, PROFIT takes the properties of a converged model into account explicitly to regularize the optimization process. Employing a temporal gradient-orthogonalization process, PROFIT outperforms fine-tuning methods in various tasks, from image classification to multimodal language model training to large-scale motion prediction. Moreover, PROFIT is encapsulated as a modular optimizer, which makes it easy to integrate directly into any training pipeline with minimal engineering effort.
comment: technical report, 23 pages, NeurIPS 2025 poster
♻ ☆ ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
comment: 10 pages main text, 17 pages total
Machine Learning 150
☆ OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.
comment: Project page: https://yukun-huang.github.io/OmniX/
☆ Scaling Image Geo-Localization to Continent Level NeurIPS 2025
Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.
comment: NeurIPS 2025
☆ Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability
We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the model spontaneously groups the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.
comment: 10+13 pages, 8+19 figures
☆ Defeating the Training-Inference Mismatch via FP16
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
☆ Remote Labor Index: Measuring AI Automation of Remote Work
AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.
comment: Website: https://www.remotelabor.ai
☆ HEIR: Learning Graph-Based Motion Hierarchies
Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/
comment: Code link: https://github.com/princeton-computational-imaging/HEIR
☆ A Unified Theory for Causal Inference: Direct Debiased Machine Learning via Bregman-Riesz Regression
This note introduces a unified theory for causal inference that integrates Riesz regression, covariate balancing, density-ratio estimation (DRE), targeted maximum likelihood estimation (TMLE), and the matching estimator in average treatment effect (ATE) estimation. In ATE estimation, the balancing weights and the regression functions of the outcome play important roles, where the balancing weights are referred to as the Riesz representer, bias-correction term, and clever covariates, depending on the context. Riesz regression, covariate balancing, DRE, and the matching estimator are methods for estimating the balancing weights, where Riesz regression is essentially equivalent to DRE in the ATE context, the matching estimator is a special case of DRE, and DRE is in a dual relationship with covariate balancing. TMLE is a method for constructing regression function estimators such that the leading bias term becomes zero. Nearest Neighbor Matching is equivalent to Least Squares Density Ratio Estimation and Riesz Regression.
☆ Clone Deterministic 3D Worlds with Geometrically-Regularized World Models
A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.
☆ Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.
Pre-trained Forecasting Models: Strong Zero-Shot Feature Extractors for Time Series Classification NeurIPS 2025
Recent research on time series foundation models has primarily focused on forecasting, leaving it unclear how generalizable their learned representations are. In this study, we examine whether frozen pre-trained forecasting models can provide effective representations for classification. To this end, we compare different representation extraction strategies and introduce two model-agnostic embedding augmentations. Our experiments show that the best forecasting models achieve classification accuracy that matches or even surpasses that of state-of-the-art models pre-trained specifically for classification. Moreover, we observe a positive correlation between forecasting and classification performance. These findings challenge the assumption that task-specific pre-training is necessary, and suggest that learning to forecast may provide a powerful route toward constructing general-purpose time series foundation models.
comment: NeurIPS 2025 Workshop on Recent Advances in Time Series Foundation Models (BERT2S)
☆ Faithful and Fast Influence Function via Advanced Sampling
How can we explain the influence of training data on black-box models? Influence functions (IFs) offer a post-hoc solution by utilizing gradients and Hessians. However, computing the Hessian for an entire dataset is resource-intensive, necessitating a feasible alternative. A common approach involves randomly sampling a small subset of the training data, but this method often results in highly inconsistent IF estimates due to the high variance in sample configurations. To address this, we propose two advanced sampling techniques based on features and logits. These samplers select a small yet representative subset of the entire dataset by considering the stochastic distribution of features or logits, thereby enhancing the accuracy of IF estimations. We validate our approach through class removal experiments, a typical application of IFs, using the F1-score to measure how effectively the model forgets the removed class while maintaining inference consistency on the remaining classes. Our method reduces computation time by 30.1% and memory usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.
☆ STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization
Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose \textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the \textit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.
comment: 10 pages main text, 8 pages supplementary material
☆ SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.
☆ The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy
As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or to engage in oversight (oversee). If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown. We model this interaction as a two-player Markov Game. Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee: under a structural assumption on the human's value function, any decision by the agent to act more autonomously that benefits itself cannot harm the human's value. We also analyze extensions to this MPG framework. Theoretically, this perspective provides conditions for a specific form of intrinsic alignment. If the reward structures of the human-agent game meet these conditions, we have a formal guarantee that the agent improving its own outcome will not harm the human's. Practically, this model motivates a transparent control layer with predictable incentives where the agent learns to defer when risky and act when safe, while its pretrained policy and the environment's reward structure remain untouched. Our gridworld simulation shows that through independent learning, the agent and human discover their optimal oversight roles. The agent learns to ask when uncertain and the human learns when to oversee, leading to an emergent collaboration that avoids safety violations introduced post-training. This demonstrates a practical method for making misaligned models safer after deployment.
☆ Deep sequence models tend to memorize geometrically; it is unclear why
In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.
☆ Bridging the Gap between Empirical Welfare Maximization and Conditional Average Treatment Effect Estimation in Policy Learning
The goal of policy learning is to train a policy function that recommends a treatment given covariates to maximize population welfare. There are two major approaches in policy learning: the empirical welfare maximization (EWM) approach and the plug-in approach. The EWM approach is analogous to a classification problem, where one first builds an estimator of the population welfare, which is a functional of policy functions, and then trains a policy by maximizing the estimated welfare. In contrast, the plug-in approach is based on regression, where one first estimates the conditional average treatment effect (CATE) and then recommends the treatment with the highest estimated outcome. This study bridges the gap between the two approaches by showing that both are based on essentially the same optimization problem. In particular, we prove an exact equivalence between EWM and least squares over a reparameterization of the policy class. As a consequence, the two approaches are interchangeable in several respects and share the same theoretical guarantees under common conditions. Leveraging this equivalence, we propose a novel regularization method for policy learning. Our findings yield a convex and computationally efficient training procedure that avoids the NP-hard combinatorial step typically required in EWM.
☆ Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
☆ On Purely Private Covariance Estimation
We present a simple perturbation mechanism for the release of $d$-dimensional covariance matrices $\Sigma$ under pure differential privacy. For large datasets with at least $n\geq d^2/\varepsilon$ elements, our mechanism recovers the provably optimal Frobenius norm error guarantees of \cite{nikolov2023private}, while simultaneously achieving best known error for all other $p$-Schatten norms, with $p\in [1,\infty]$. Our error is information-theoretically optimal for all $p\ge 2$, in particular, our mechanism is the first purely private covariance estimator that achieves optimal error in spectral norm. For small datasets $n< d^2/\varepsilon$, we further show that by projecting the output onto the nuclear norm ball of appropriate radius, our algorithm achieves the optimal Frobenius norm error $O(\sqrt{d\;\text{Tr}(\Sigma) /n})$, improving over the known bounds of $O(\sqrt{d/n})$ of \cite{nikolov2023private} and ${O}\big(d^{3/4}\sqrt{\text{Tr}(\Sigma)/n}\big)$ of \cite{dong2022differentially}.
comment: equal contribution
☆ LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation
A vast majority of mass spectrometry data remains uncharacterized, leaving much of its biological and chemical information untapped. Recent advances in machine learning have begun to address this gap, particularly for tasks such as spectral identification in tandem mass spectrometry data. Here, we present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space. LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds, yielding 42% more correct identifications in complex biological samples, and maintaining robustness under low-concentration conditions. Furthermore, LSM-MS2 produces rich spectral embeddings that enable direct biological interpretation from minimal downstream data, successfully differentiating disease states and predicting clinical outcomes across diverse translational applications.
☆ On the limitation of evaluating machine unlearning using only a single training seed
Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comphttps://info.arxiv.org/help/prep#commentsarisons of MU algorithms should also reflect the variability across different model training seeds.
comment: mini paper, 2 figures
☆ An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$\ discards structural information and performs poorly in practice, while Top-$K$\ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$\ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$\ matches the accuracy of Top-$K$\ while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand-$K$\ with the strong performance of Top-$K$.
comment: 8 pages, 2 figures
☆ Value Drifts: Tracing Value Alignment During LLM Post-Training
As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.
☆ Budgeted Multiple-Expert Deferral
Learning to defer uncertain predictions to costly experts offers a powerful strategy for improving the accuracy and efficiency of machine learning systems. However, standard training procedures for deferral algorithms typically require querying all experts for every training instance, an approach that becomes prohibitively expensive when expert queries incur significant computational or resource costs. This undermines the core goal of deferral: to limit unnecessary expert usage. To overcome this challenge, we introduce the budgeted deferral framework, which aims to train effective deferral algorithms while minimizing expert query costs during training. We propose new algorithms for both two-stage and single-stage multiple-expert deferral settings that selectively query only a subset of experts per training example. While inspired by active learning, our setting is fundamentally different: labels are already known, and the core challenge is to decide which experts to query in order to balance cost and predictive performance. We establish theoretical guarantees for both of our algorithms, including generalization bounds and label complexity analyses. Empirical results across several domains show that our algorithms substantially reduce training costs without sacrificing prediction accuracy, demonstrating the practical value of our budget-aware deferral algorithms.
☆ How Regularization Terms Make Invertible Neural Networks Bayesian Point Estimators
Can regularization terms in the training of invertible neural networks lead to known Bayesian point estimators in reconstruction? Invertible networks are attractive for inverse problems due to their inherent stability and interpretability. Recently, optimization strategies for invertible neural networks that approximate either a reconstruction map or the forward operator have been studied from a Bayesian perspective, but each has limitations. To address this, we introduce and analyze two regularization terms for the network training that, upon inversion of the network, recover properties of classical Bayesian point estimators: while the first can be connected to the posterior mean, the second resembles the MAP estimator. Our theoretical analysis characterizes how each loss shapes both the learned forward operator and its inverse reconstruction map. Numerical experiments support our findings and demonstrate how these loss-term regularizers introduce data-dependence in a stable and interpretable way.
comment: Preprint, under review
☆ Assessment of the conditional exchangeability assumption in causal machine learning models: a simulation study
Observational studies developing causal machine learning (ML) models for the prediction of individualized treatment effects (ITEs) seldom conduct empirical evaluations to assess the conditional exchangeability assumption. We aimed to evaluate the performance of these models under conditional exchangeability violations and the utility of negative control outcomes (NCOs) as a diagnostic. We conducted a simulation study to examine confounding bias in ITE estimates generated by causal forest and X-learner models under varying conditions, including the presence or absence of true heterogeneity. We simulated data to reflect real-world scenarios with differing levels of confounding, sample size, and NCO confounding structures. We then estimated and compared subgroup-level treatment effects on the primary outcome and NCOs across settings with and without unmeasured confounding. When conditional exchangeability was violated, causal forest and X-learner models failed to recover true treatment effect heterogeneity and, in some cases, falsely indicated heterogeneity when there was none. NCOs successfully identified subgroups affected by unmeasured confounding. Even when NCOs did not perfectly satisfy its ideal assumptions, it remained informative, flagging potential bias in subgroup level estimates, though not always pinpointing the subgroup with the largest confounding. Violations of conditional exchangeability substantially limit the validity of ITE estimates from causal ML models in routinely collected observational data. NCOs serve a useful empirical diagnostic tool for detecting subgroup-specific unmeasured confounding and should be incorporated into causal ML workflows to support the credibility of individualized inference.
☆ Kimi Linear: An Expressive, Efficient Attention Architecture
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
comment: Kimi Linear tech report
☆ LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits
Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.
☆ FlowQ-Net: A Generative Framework for Automated Quantum Circuit Design
Designing efficient quantum circuits is a central bottleneck to exploring the potential of quantum computing, particularly for noisy intermediate-scale quantum (NISQ) devices, where circuit efficiency and resilience to errors are paramount. The search space of gate sequences grows combinatorially, and handcrafted templates often waste scarce qubit and depth budgets. We introduce \textsc{FlowQ-Net} (Flow-based Quantum design Network), a generative framework for automated quantum circuit synthesis based on Generative Flow Networks (GFlowNets). This framework learns a stochastic policy to construct circuits sequentially, sampling them in proportion to a flexible, user-defined reward function that can encode multiple design objectives such as performance, depth, and gate count. This approach uniquely enables the generation of a diverse ensemble of high-quality circuits, moving beyond single-solution optimization. We demonstrate the efficacy of \textsc{FlowQ-Net} through an extensive set of simulations. We apply our method to Variational Quantum Algorithm (VQA) ansatz design for molecular ground state estimation, Max-Cut, and image classification, key challenges in near-term quantum computing. Circuits designed by \textsc{FlowQ-Net} achieve significant improvements, yielding circuits that are 10$\times$-30$\times$ more compact in terms of parameters, gates, and depth compared to commonly used unitary baselines, without compromising accuracy. This trend holds even when subjected to error profiles from real-world quantum devices. Our results underline the potential of generative models as a general-purpose methodology for automated quantum circuit design, offering a promising path towards more efficient quantum algorithms and accelerating scientific discovery in the quantum domain.
☆ Tight Differentially Private PCA via Matrix Coherence
We revisit the task of computing the span of the top $r$ singular vectors $u_1, \ldots, u_r$ of a matrix under differential privacy. We show that a simple and efficient algorithm -- based on singular value decomposition and standard perturbation mechanisms -- returns a private rank-$r$ approximation whose error depends only on the \emph{rank-$r$ coherence} of $u_1, \ldots, u_r$ and the spectral gap $\sigma_r - \sigma_{r+1}$. This resolves a question posed by Hardt and Roth~\cite{hardt2013beyond}. Our estimator outperforms the state of the art -- significantly so in some regimes. In particular, we show that in the dense setting, it achieves the same guarantees for single-spike PCA in the Wishart model as those attained by optimal non-private algorithms, whereas prior private algorithms failed to do so. In addition, we prove that (rank-$r$) coherence does not increase under Gaussian perturbations. This implies that any estimator based on the Gaussian mechanism -- including ours -- preserves the coherence of the input. We conjecture that similar behavior holds for other structured models, including planted problems in graphs. We also explore applications of coherence to graph problems. In particular, we present a differentially private algorithm for Max-Cut and other constraint satisfaction problems under low coherence assumptions.
comment: SODA 2026; equal contribution
☆ Action-Driven Processes for Continuous-Time Control
At the heart of reinforcement learning are actions -- decisions made in response to observations of the environment. Actions are equally fundamental in the modeling of stochastic processes, as they trigger discontinuous state transitions and enable the flow of information through large, complex systems. In this paper, we unify the perspectives of stochastic processes and reinforcement learning through action-driven processes, and illustrate their application to spiking neural networks. Leveraging ideas from control-as-inference, we show that minimizing the Kullback-Leibler divergence between a policy-driven true distribution and a reward-driven model distribution for a suitably defined action-driven process is equivalent to maximum entropy reinforcement learning.
☆ Heuristic Adaptation of Potentially Misspecified Domain Support for Likelihood-Free Inference in Stochastic Dynamical Systems
In robotics, likelihood-free inference (LFI) can provide the domain distribution that adapts a learnt agent in a parametric set of deployment conditions. LFI assumes an arbitrary support for sampling, which remains constant as the initial generic prior is iteratively refined to more descriptive posteriors. However, a potentially misspecified support can lead to suboptimal, yet falsely certain, posteriors. To address this issue, we propose three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the posterior mode shift over inference steps in its own way and, when integrated into an LFI step, adapts the support alongside posterior inference. We first expose the support misspecification issue and evaluate our heuristics using stochastic dynamical benchmarks. We then evaluate the impact of heuristic support adaptation on parameter inference and policy learning for a dynamic deformable linear object (DLO) manipulation task. Inference results in a finer length and stiffness classification for a parametric set of DLOs. When the resulting posteriors are used as domain distributions for sim-based policy learning, they lead to more robust object-centric agent performance.
☆ Hybrid DQN-TD3 Reinforcement Learning for Autonomous Navigation in Dynamic Environments
This paper presents a hierarchical path-planning and control framework that combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller for continuous actuation. The high-level module selects behaviors and sub-goals; the low-level module executes smooth velocity commands. We design a practical reward shaping scheme (direction, distance, obstacle avoidance, action smoothness, collision penalty, time penalty, and progress), together with a LiDAR-based safety gate that prevents unsafe motions. The system is implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics, including success rate, collision rate, path efficiency, and re-planning efficiency, in dynamic and partially observable environments. Experiments show improved success rate and sample efficiency over single-algorithm baselines (DQN or TD3 alone) and rule-based planners, with better generalization to unseen obstacle configurations and reduced abrupt control changes. Code and evaluation scripts are available at the project repository.
comment: 6 pages, 5 figures; ROS+Gazebo (TurtleBot3) implementation; evaluation with PathBench metrics; code (primary): https://github.com/MayaCHEN-github/HierarchicalRL-robot-navigation; mirror (for reproducibility): https://github.com/ShowyHe/DRL-robot-navigation
☆ Curly Flow Matching for Learning Non-gradient Field Dynamics NeurIPS 2025
Modeling the transport dynamics of natural processes from population-level observations is a ubiquitous problem in the natural sciences. Such models rely on key assumptions about the underlying process in order to enable faithful learning of governing dynamics that mimic the actual system behavior. The de facto assumption in current approaches relies on the principle of least action that results in gradient field dynamics and leads to trajectories minimizing an energy functional between two probability measures. However, many real-world systems, such as cell cycles in single-cell RNA, are known to exhibit non-gradient, periodic behavior, which fundamentally cannot be captured by current state-of-the-art methods such as flow and bridge matching. In this paper, we introduce Curly Flow Matching (Curly-FM), a novel approach that is capable of learning non-gradient field dynamics by designing and solving a Schr\"odinger bridge problem with a non-zero drift reference process -- in stark contrast to typical zero-drift reference processes -- which is constructed using inferred velocities in addition to population snapshot data. We showcase Curly-FM by solving the trajectory inference problems for single cells, computational fluid dynamics, and ocean currents with approximate velocities. We demonstrate that Curly-FM can learn trajectories that better match both the reference process and population marginals. Curly-FM expands flow matching models beyond the modeling of populations and towards the modeling of known periodic behavior in physical systems. Our code repository is accessible at: https://github.com/kpetrovicc/curly-flow-matching.git
comment: Accepted to NeurIPS 2025
☆ MSAD: A Deep Dive into Model Selection for Time series Anomaly Detection VLDB
Anomaly detection is a fundamental task for time series analytics with important implications for the downstream performance of many applications. Despite increasing academic interest and the large number of methods proposed in the literature, recent benchmarks and evaluation studies demonstrated that no overall best anomaly detection methods exist when applied to very heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will select, based on time series characteristics, the best anomaly detection methods to run. Existing AutoML solutions are, unfortunately, not directly applicable to time series anomaly detection, and no evaluation of time series-based approaches for model selection exists. Towards that direction, this paper studies the performance of time series classification methods used as model selection for anomaly detection. In total, we evaluate 234 model configurations derived from 16 base classifiers across more than 1980 time series, and we propose the first extensive experimental evaluation of time series classification as model selection for anomaly detection. Our results demonstrate that model selection methods outperform every single anomaly detection method while being in the same order of magnitude regarding execution time. This evaluation is the first step to demonstrate the accuracy and efficiency of time series classification algorithms for anomaly detection, and represents a strong baseline that can then be used to guide the model selection step in general AutoML pipelines. Preprint version of an article accepted at the VLDB Journal.
comment: 25 pages, 13 figures, VLDB Journal
☆ Omnipresent Yet Overlooked: Heat Kernels in Combinatorial Bayesian Optimization
Bayesian Optimization (BO) has the potential to solve various combinatorial tasks, ranging from materials science to neural architecture search. However, BO requires specialized kernels to effectively model combinatorial domains. Recent efforts have introduced several combinatorial kernels, but the relationships among them are not well understood. To bridge this gap, we develop a unifying framework based on heat kernels, which we derive in a systematic way and express as simple closed-form expressions. Using this framework, we prove that many successful combinatorial kernels are either related or equivalent to heat kernels, and validate this theoretical claim in our experiments. Moreover, our analysis confirms and extends the results presented in Bounce: certain algorithms' performance decreases substantially when the unknown optima of the function do not have a certain structure. In contrast, heat kernels are not sensitive to the location of the optima. Lastly, we show that a fast and simple pipeline, relying on heat kernels, is able to achieve state-of-the-art results, matching or even outperforming certain slow or complex algorithms.
☆ Aeolus: A Multi-structural Flight Delay Dataset
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data
☆ CYPRESS: Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing
Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.
☆ Wasserstein Regression as a Variational Approximation of Probabilistic Trajectories through the Bernstein Basis
This paper considers the problem of regression over distributions, which is becoming increasingly important in machine learning. Existing approaches often ignore the geometry of the probability space or are computationally expensive. To overcome these limitations, a new method is proposed that combines the parameterization of probability trajectories using a Bernstein basis and the minimization of the Wasserstein distance between distributions. The key idea is to model a conditional distribution as a smooth probability trajectory defined by a weighted sum of Gaussian components whose parameters -- the mean and covariance -- are functions of the input variable constructed using Bernstein polynomials. The loss function is the averaged squared Wasserstein distance between the predicted Gaussian distributions and the empirical data, which takes into account the geometry of the distributions. An autodiff-based optimization method is used to train the model. Experiments on synthetic datasets that include complex trajectories demonstrated that the proposed method provides competitive approximation quality in terms of the Wasserstein distance, Energy Distance, and RMSE metrics, especially in cases of pronounced nonlinearity. The model demonstrates trajectory smoothness that is better than or comparable to alternatives and robustness to changes in data structure, while maintaining high interpretability due to explicit parameterization via control points. The developed approach represents a balanced solution that combines geometric accuracy, computational practicality, and interpretability. Prospects for further research include extending the method to non-Gaussian distributions, applying entropy regularization to speed up computations, and adapting the approach to working with high-dimensional data for approximating surfaces and more complex structures.
☆ Hybrid Physical-Neural Simulator for Fast Cosmological Hydrodynamics NeurIPS 2025
Cosmological field-level inference requires differentiable forward models that solve the challenging dynamics of gas and dark matter under hydrodynamics and gravity. We propose a hybrid approach where gravitational forces are computed using a differentiable particle-mesh solver, while the hydrodynamics are parametrized by a neural network that maps local quantities to an effective pressure field. We demonstrate that our method improves upon alternative approaches, such as an Enthalpy Gradient Descent baseline, both at the field and summary-statistic level. The approach is furthermore highly data efficient, with a single reference simulation of cosmological structure formation being sufficient to constrain the neural pressure model. This opens the door for future applications where the model is fit directly to observational data, rather than a training set of simulations.
comment: Accepted to the NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences
☆ Physics-Informed Mixture Models and Surrogate Models for Precision Additive Manufacturing
In this study, we leverage a mixture model learning approach to identify defects in laser-based Additive Manufacturing (AM) processes. By incorporating physics based principles, we also ensure that the model is sensitive to meaningful physical parameter variations. The empirical evaluation was conducted by analyzing real-world data from two AM processes: Directed Energy Deposition and Laser Powder Bed Fusion. In addition, we also studied the performance of the developed framework over public datasets with different alloy type and experimental parameter information. The results show the potential of physics-guided mixture models to examine the underlying physical behavior of an AM system.
comment: Five pages, four figures, to be presented at the AI in Science Summit, Denmark, November, 2025
☆ Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.
☆ Multiclass Local Calibration With the Jensen-Shannon Distance
Developing trustworthy Machine Learning (ML) models requires their predicted probabilities to be well-calibrated, meaning they should reflect true-class frequencies. Among calibration notions in multiclass classification, strong calibration is the most stringent, as it requires all predicted probabilities to be simultaneously calibrated across all classes. However, existing approaches to multiclass calibration lack a notion of distance among inputs, which makes them vulnerable to proximity bias: predictions in sparse regions of the feature space are systematically miscalibrated. This is especially relevant in high-stakes settings, such as healthcare, where the sparse instances are exactly those most at risk of biased treatment. In this work, we address this main shortcoming by introducing a local perspective on multiclass calibration. First, we formally define multiclass local calibration and establish its relationship with strong calibration. Second, we theoretically analyze the pitfalls of existing evaluation metrics when applied to multiclass local calibration. Third, we propose a practical method for enhancing local calibration in Neural Networks, which enforces alignment between predicted probabilities and local estimates of class frequencies using the Jensen-Shannon distance. Finally, we empirically validate our approach against existing multiclass calibration techniques.
☆ On Measuring Localization of Shortcuts in Deep Networks
Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.
☆ Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices
Deploying machine learning models on compute-constrained devices has become a key building block of modern IoT applications. In this work, we present a compression scheme for boosted decision trees, addressing the growing need for lightweight machine learning models. Specifically, we provide techniques for training compact boosted decision tree ensembles that exhibit a reduced memory footprint by rewarding, among other things, the reuse of features and thresholds during training. Our experimental evaluation shows that models achieved the same performance with a compression ratio of 4-16x compared to LightGBM models using an adapted training process and an alternative memory layout. Once deployed, the corresponding IoT devices can operate independently of constant communication or external energy supply, and, thus, autonomously, requiring only minimal computing power and energy. This capability opens the door to a wide range of IoT applications, including remote monitoring, edge analytics, and real-time decision making in isolated or power-limited environments.
☆ Adaptive Inverse Kinematics Framework for Learning Variable-Length Tool Manipulation in Robotics
Conventional robots possess a limited understanding of their kinematics and are confined to preprogrammed tasks, hindering their ability to leverage tools efficiently. Driven by the essential components of tool usage - grasping the desired outcome, selecting the most suitable tool, determining optimal tool orientation, and executing precise manipulations - we introduce a pioneering framework. Our novel approach expands the capabilities of the robot's inverse kinematics solver, empowering it to acquire a sequential repertoire of actions using tools of varying lengths. By integrating a simulation-learned action trajectory with the tool, we showcase the practicality of transferring acquired skills from simulation to real-world scenarios through comprehensive experimentation. Remarkably, our extended inverse kinematics solver demonstrates an impressive error rate of less than 1 cm. Furthermore, our trained policy achieves a mean error of 8 cm in simulation. Noteworthy, our model achieves virtually indistinguishable performance when employing two distinct tools of different lengths. This research provides an indication of potential advances in the exploration of all four fundamental aspects of tool usage, enabling robots to master the intricate art of tool manipulation across diverse tasks.
comment: 10 pages, 5 figures. Demonstrates a reinforcement learning framework for adaptive tool manipulation with variable-length extensions
☆ The Structure of Relation Decoding Linear Operators in Large Language Models NeurIPS 2025
This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators' compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.
comment: NeurIPS 2025 (Spotlight)
☆ A Three-Stage Bayesian Transfer Learning Framework to Improve Predictions in Data-Scarce Domains
The use of ML in engineering has grown steadily to support a wide array of applications. Among these methods, deep neural networks have been widely adopted due to their performance and accessibility, but they require large, high-quality datasets. Experimental data are often sparse, noisy, or insufficient to build resilient data-driven models. Transfer learning, which leverages relevant data-abundant source domains to assist learning in data-scarce target domains, has shown efficacy. Parameter transfer, where pretrained weights are reused, is common but degrades under large domain shifts. Domain-adversarial neural networks (DANNs) help address this issue by learning domain-invariant representations, thereby improving transfer under greater domain shifts in a semi-supervised setting. However, DANNs can be unstable during training and lack a native means for uncertainty quantification. This study introduces a fully-supervised three-stage framework, the staged Bayesian domain-adversarial neural network (staged B-DANN), that combines parameter transfer and shared latent space adaptation. In Stage 1, a deterministic feature extractor is trained on the source domain. This feature extractor is then adversarially refined using a DANN in Stage 2. In Stage 3, a Bayesian neural network is built on the adapted feature extractor for fine-tuning on the target domain to handle conditional shifts and yield calibrated uncertainty estimates. This staged B-DANN approach was first validated on a synthetic benchmark, where it was shown to significantly outperform standard transfer techniques. It was then applied to the task of predicting critical heat flux in rectangular channels, leveraging data from tube experiments as the source domain. The results of this study show that the staged B-DANN method can improve predictive accuracy and generalization, potentially assisting other domains in nuclear engineering.
comment: Submitted to Engineering Applications of Artificial Intelligence
☆ Higher-Order Regularization Learning on Hypergraphs
Higher-Order Hypergraph Learning (HOHL) was recently introduced as a principled alternative to classical hypergraph regularization, enforcing higher-order smoothness via powers of multiscale Laplacians induced by the hypergraph structure. Prior work established the well- and ill-posedness of HOHL through an asymptotic consistency analysis in geometric settings. We extend this theoretical foundation by proving the consistency of a truncated version of HOHL and deriving explicit convergence rates when HOHL is used as a regularizer in fully supervised learning. We further demonstrate its strong empirical performance in active learning and in datasets lacking an underlying geometric structure, highlighting HOHL's versatility and robustness across diverse learning settings.
☆ Polybasic Speculative Decoding Through a Theoretical Perspective
Inference latency stands as a critical bottleneck in the large-scale deployment of Large Language Models (LLMs). Speculative decoding methods have recently shown promise in accelerating inference without compromising the output distribution. However, existing work typically relies on a dualistic draft-verify framework and lacks rigorous theoretical grounding. In this paper, we introduce a novel \emph{polybasic} speculative decoding framework, underpinned by a comprehensive theoretical analysis. Specifically, we prove a fundamental theorem that characterizes the optimal inference time for multi-model speculative decoding systems, shedding light on how to extend beyond the dualistic approach to a more general polybasic paradigm. Through our theoretical investigation of multi-model token generation, we expose and optimize the interplay between model capabilities, acceptance lengths, and overall computational cost. Our framework supports both standalone implementation and integration with existing speculative techniques, leading to accelerated performance in practice. Experimental results across multiple model families demonstrate that our approach yields speedup ratios ranging from $3.31\times$ to $4.01\times$ for LLaMA2-Chat 7B, up to $3.87 \times$ for LLaMA3-8B, up to $4.43 \times$ for Vicuna-7B and up to $3.85 \times$ for Qwen2-7B -- all while preserving the original output distribution. We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding.
☆ Think Outside the Policy: In-Context Steered Policy Optimization
Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts where confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advaned models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates Expert Region Reject Sampling to filter unreliable off-policy trajectories and Annealed Expert-Bonus Reward Shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs.
comment: Work in progress
☆ Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs ICDM 2025
Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.
comment: ICDM 2025 Workshop
☆ LLMs as In-Context Meta-Learners for Model and Hyperparameter Selection
Model and hyperparameter selection are critical but challenging in machine learning, typically requiring expert intuition or expensive automated search. We investigate whether large language models (LLMs) can act as in-context meta-learners for this task. By converting each dataset into interpretable metadata, we prompt an LLM to recommend both model families and hyperparameters. We study two prompting strategies: (1) a zero-shot mode relying solely on pretrained knowledge, and (2) a meta-informed mode augmented with examples of models and their performance on past tasks. Across synthetic and real-world benchmarks, we show that LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search, and that improvements from meta-informed prompting demonstrate their capacity for in-context meta-learning. These results highlight a promising new role for LLMs as lightweight, general-purpose assistants for model selection and hyperparameter optimization.
comment: 27 pages, 6 figures
☆ Enhancing ECG Classification Robustness with Lightweight Unsupervised Anomaly Detection Filters
Continuous electrocardiogram (ECG) monitoring via wearables offers significant potential for early cardiovascular disease (CVD) detection. However, deploying deep learning models for automated analysis in resource-constrained environments faces reliability challenges due to inevitable Out-of-Distribution (OOD) data. OOD inputs, such as unseen pathologies or noisecorrupted signals, often cause erroneous, high-confidence predictions by standard classifiers, compromising patient safety. Existing OOD detection methods either neglect computational constraints or address noise and unseen classes separately. This paper explores Unsupervised Anomaly Detection (UAD) as an independent, upstream filtering mechanism to improve robustness. We benchmark six UAD approaches, including Deep SVDD, reconstruction-based models, Masked Anomaly Detection, normalizing flows, and diffusion models, optimized via Neural Architecture Search (NAS) under strict resource constraints (at most 512k parameters). Evaluation on PTB-XL and BUT QDB datasets assessed detection of OOD CVD classes and signals unsuitable for analysis due to noise. Results show Deep SVDD consistently achieves the best trade-off between detection and efficiency. In a realistic deployment simulation, integrating the optimized Deep SVDD filter with a diagnostic classifier improved accuracy by up to 21 percentage points over a classifier-only baseline. This study demonstrates that optimized UAD filters can safeguard automated ECG analysis, enabling safer, more reliable continuous cardiovascular monitoring on wearables.
comment: Submitted to the 24th International Conference on Pervasive Computing and Communications (PerCom 2026)
☆ Data-Efficient RLVR via Off-Policy Influence Guidance
Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop \textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance (\textbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
☆ Quantum Gated Recurrent GAN with Gaussian Uncertainty for Network Anomaly Detection
Anomaly detection in time-series data is a critical challenge with significant implications for network security. Recent quantum machine learning approaches, such as quantum kernel methods and variational quantum circuits, have shown promise in capturing complex data distributions for anomaly detection but remain constrained by limited qubit counts. We introduce in this work a novel Quantum Gated Recurrent Unit (QGRU)-based Generative Adversarial Network (GAN) employing Successive Data Injection (SuDaI) and a multi-metric gating strategy for robust network anomaly detection. Our model uniquely utilizes a quantum-enhanced generator that outputs parameters (mean and log-variance) of a Gaussian distribution via reparameterization, combined with a Wasserstein critic to stabilize adversarial training. Anomalies are identified through a novel gating mechanism that initially flags potential anomalies based on Gaussian uncertainty estimates and subsequently verifies them using a composite of critic scores and reconstruction errors. Evaluated on benchmark datasets, our method achieves a high time-series aware F1 score (TaF1) of 89.43% demonstrating superior capability in detecting anomalies accurately and promptly as compared to existing classical and quantum models. Furthermore, the trained QGRU-WGAN was deployed on real IBM Quantum hardware, where it retained high anomaly detection performance, confirming its robustness and practical feasibility on current noisy intermediate-scale quantum (NISQ) devices.
☆ LINK-KG: LLM-Driven Coreference-Resolved Knowledge Graphs for Human Smuggling Networks
Human smuggling networks are complex and constantly evolving, making them difficult to analyze comprehensively. Legal case documents offer rich factual and procedural insights into these networks but are often long, unstructured, and filled with ambiguous or shifting references, posing significant challenges for automated knowledge graph (KG) construction. Existing methods either overlook coreference resolution or fail to scale beyond short text spans, leading to fragmented graphs and inconsistent entity linking. We propose LINK-KG, a modular framework that integrates a three-stage, LLM-guided coreference resolution pipeline with downstream KG extraction. At the core of our approach is a type-specific Prompt Cache, which consistently tracks and resolves references across document chunks, enabling clean and disambiguated narratives for structured knowledge graph construction from both short and long legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes by 32.22% compared to baseline methods, resulting in cleaner and more coherent graph structures. These improvements establish LINK-KG as a strong foundation for analyzing complex criminal networks.
comment: Accepted in ICKG 2025 Conference, 8 Pages, 2 Figures
☆ ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75\% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.
☆ Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
comment: Preprint
☆ Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
☆ Vectorized Context-Aware Embeddings for GAT-Based Collaborative Filtering
Recommender systems often struggle with data sparsity and cold-start scenarios, limiting their ability to provide accurate suggestions for new or infrequent users. This paper presents a Graph Attention Network (GAT) based Collaborative Filtering (CF) framework enhanced with Large Language Model (LLM) driven context aware embeddings. Specifically, we generate concise textual user profiles and unify item metadata (titles, genres, overviews) into rich textual embeddings, injecting these as initial node features in a bipartite user item graph. To further optimize ranking performance, we introduce a hybrid loss function that combines Bayesian Personalized Ranking (BPR) with a cosine similarity term and robust negative sampling, ensuring explicit negative feedback is distinguished from unobserved data. Experiments on the MovieLens 100k and 1M datasets show consistent improvements over state-of-the-art baselines in Precision, NDCG, and MAP while demonstrating robustness for users with limited interaction history. Ablation studies confirm the critical role of LLM-augmented embeddings and the cosine similarity term in capturing nuanced semantic relationships. Our approach effectively mitigates sparsity and cold-start limitations by integrating LLM-derived contextual understanding into graph-based architectures. Future directions include balancing recommendation accuracy with coverage and diversity, and introducing fairness-aware constraints and interpretability features to enhance system performance further.
☆ Robust Graph Condensation via Classification Complexity Mitigation
Graph condensation (GC) has gained significant attention for its ability to synthesize smaller yet informative graphs. However, existing studies often overlook the robustness of GC in scenarios where the original graph is corrupted. In such cases, we observe that the performance of GC deteriorates significantly, while existing robust graph learning technologies offer only limited effectiveness. Through both empirical investigation and theoretical analysis, we reveal that GC is inherently an intrinsic-dimension-reducing process, synthesizing a condensed graph with lower classification complexity. Although this property is critical for effective GC performance, it remains highly vulnerable to adversarial perturbations. To tackle this vulnerability and improve GC robustness, we adopt the geometry perspective of graph data manifold and propose a novel Manifold-constrained Robust Graph Condensation framework named MRGC. Specifically, we introduce three graph data manifold learning modules that guide the condensed graph to lie within a smooth, low-dimensional manifold with minimal class ambiguity, thereby preserving the classification complexity reduction capability of GC and ensuring robust performance under universal adversarial attacks. Extensive experiments demonstrate the robustness of \ModelName\ across diverse attack scenarios.
☆ Personalized Treatment Outcome Prediction from Scarce Data via Dual-Channel Knowledge Distillation and Adaptive Fusion
Personalized treatment outcome prediction based on trial data for small-sample and rare patient groups is critical in precision medicine. However, the costly trial data limit the prediction performance. To address this issue, we propose a cross-fidelity knowledge distillation and adaptive fusion network (CFKD-AFN), which leverages abundant but low-fidelity simulation data to enhance predictions on scarce but high-fidelity trial data. CFKD-AFN incorporates a dual-channel knowledge distillation module to extract complementary knowledge from the low-fidelity model, along with an attention-guided fusion module to dynamically integrate multi-source information. Experiments on treatment outcome prediction for the chronic obstructive pulmonary disease demonstrates significant improvements of CFKD-AFN over state-of-the-art methods in prediction accuracy, ranging from 6.67\% to 74.55\%, and strong robustness to varying high-fidelity dataset sizes. Furthermore, we extend CFKD-AFN to an interpretable variant, enabling the exploration of latent medical semantics to support clinical decision-making.
☆ Co-Evolving Latent Action World Models
Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.
☆ Autograder+: A Multi-Faceted AI Framework for Rich Pedagogical Feedback in Programming Education
The rapid growth of programming education has outpaced traditional assessment tools, leaving faculty with limited means to provide meaningful, scalable feedback. Conventional autograders, while efficient, act as black-box systems that simply return pass/fail results, offering little insight into student thinking or learning needs. Autograder+ is designed to shift autograding from a purely summative process to a formative learning experience. It introduces two key capabilities: automated feedback generation using a fine-tuned Large Language Model, and visualization of student code submissions to uncover learning patterns. The model is fine-tuned on curated student code and expert feedback to ensure pedagogically aligned, context-aware guidance. In evaluation across 600 student submissions from multiple programming tasks, the system produced feedback with strong semantic alignment to instructor comments. For visualization, contrastively learned code embeddings trained on 1,000 annotated submissions enable grouping solutions into meaningful clusters based on functionality and approach. The system also supports prompt-pooling, allowing instructors to guide feedback style through selected prompt templates. By integrating AI-driven feedback, semantic clustering, and interactive visualization, Autograder+ reduces instructor workload while supporting targeted instruction and promoting stronger learning outcomes.
☆ Multi-Output Robust and Conjugate Gaussian Processes
Multi-output Gaussian process (MOGP) regression allows modelling dependencies among multiple correlated response variables. Similarly to standard Gaussian processes, MOGPs are sensitive to model misspecification and outliers, which can distort predictions within individual outputs. This situation can be further exacerbated by multiple anomalous response variables whose errors propagate due to correlations between outputs. To handle this situation, we extend and generalise the robust and conjugate Gaussian process (RCGP) framework introduced by Altamirano et al. (2024). This results in the multi-output RCGP (MO-RCGP): a provably robust MOGP that is conjugate, and jointly captures correlations across outputs. We thoroughly evaluate our approach through applications in finance and cancer research.
☆ Multi-Task Learning Based on Support Vector Machines and Twin Support Vector Machines: A Comprehensive Survey
Multi-task learning (MTL) enables simultaneous training across related tasks, leveraging shared information to improve generalization, efficiency, and robustness, especially in data-scarce or high-dimensional scenarios. While deep learning dominates recent MTL research, Support Vector Machines (SVMs) and Twin SVMs (TWSVMs) remain relevant due to their interpretability, theoretical rigor, and effectiveness with small datasets. This chapter surveys MTL approaches based on SVM and TWSVM, highlighting shared representations, task regularization, and structural coupling strategies. Special attention is given to emerging TWSVM extensions for multi-task settings, which show promise but remain underexplored. We compare these models in terms of theoretical properties, optimization strategies, and empirical performance, and discuss applications in fields such as computer vision, natural language processing, and bioinformatics. Finally, we identify research gaps and outline future directions for building scalable, interpretable, and reliable margin-based MTL frameworks. This work provides a comprehensive resource for researchers and practitioners interested in SVM- and TWSVM-based multi-task learning.
☆ Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning
Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).
☆ Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings
The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks (`cold-start'), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we challenge this paradigm and propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.5\% data subset, we predict full benchmark scores with a 2.9% mean absolute error. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.
comment: 9 pages, 2 figures, 4 tables
☆ Efficient Generative AI Boosts Probabilistic Forecasting of Sudden Stratospheric Warmings
Sudden Stratospheric Warmings (SSWs) are key sources of subseasonal predictability and major drivers of extreme winter weather. Yet, their accurate and efficient forecast remains a persistent challenge for numerical weather prediction (NWP) systems due to limitations in physical representation, initialization, and the immense computational demands of ensemble forecasts. While data-driven forecasting is rapidly evolving, its application to the complex, three-dimensional dynamics of SSWs, particularly for probabilistic forecast, remains underexplored. Here, we bridge this gap by developing a Flow Matching-based generative AI model (FM-Cast) for efficient and skillful probabilistic forecasting of the spatiotemporal evolution of stratospheric circulation. Evaluated across 18 major SSW events (1998-2024), FM-Cast skillfully forecasts the onset, intensity, and morphology of 10 events up to 20 days in advance, achieving ensemble accuracies above 50%. Its performance is comparable to or exceeds leading NWP systems while requiring only two minutes for a 50-member, 30-day forecast on a consumer GPU. Furthermore, leveraging FM-Cast as a scientific tool, we demonstrate through idealized experiments that SSW predictability is fundamentally linked to its underlying physical drivers, distinguishing between events forced from the troposphere and those driven by internal stratospheric dynamics. Our work thus establishes a computationally efficient paradigm for probabilistic forecasting stratospheric anomalies and showcases generative AI's potential to deepen the physical understanding of atmosphere-climate dynamics.
☆ CorVS: Person Identification via Video Trajectory-Sensor Correspondence in a Real-World Warehouse
Worker location data is key to higher productivity in industrial sites. Cameras are a promising tool for localization in logistics warehouses since they also offer valuable environmental contexts such as package status. However, identifying individuals with only visual data is often impractical. Accordingly, several prior studies identified people in videos by comparing their trajectories and wearable sensor measurements. While this approach has advantages such as independence from appearance, the existing methods may break down under real-world conditions. To overcome this challenge, we propose CorVS, a novel data-driven person identification method based on correspondence between visual tracking trajectories and sensor measurements. Firstly, our deep learning model predicts correspondence probabilities and reliabilities for every pair of a trajectory and sensor measurements. Secondly, our algorithm matches the trajectories and sensor measurements over time using the predicted probabilities and reliabilities. We developed a dataset with actual warehouse operations and demonstrated the method's effectiveness for real-world applications.
comment: 7 pages, 3 figures, accepted to IPIN 2025
☆ Towards Explainable and Reliable AI in Finance
Financial forecasting increasingly uses large neural network models, but their opacity raises challenges for trust and regulatory compliance. We present several approaches to explainable and reliable AI in finance. \emph{First}, we describe how Time-LLM, a time series foundation model, uses a prompt to avoid a wrong directional forecast. \emph{Second}, we show that combining foundation models for time series forecasting with a reliability estimator can filter our unreliable predictions. \emph{Third}, we argue for symbolic reasoning encoding domain rules for transparent justification. These approaches shift emphasize executing only forecasts that are both reliable and explainable. Experiments on equity and cryptocurrency data show that the architecture reduces false positives and supports selective execution. By integrating predictive performance with reliability estimation and rule-based reasoning, our framework advances transparent and auditable financial AI systems.
☆ UnifiedFL: A Dynamic Unified Learning Framework for Equitable Federation
Federated learning (FL) has emerged as a key paradigm for collaborative model training across multiple clients without sharing raw data, enabling privacy-preserving applications in areas such as radiology and pathology. However, works on collaborative training across clients with fundamentally different neural architectures and non-identically distributed datasets remain scarce. Existing FL frameworks face several limitations. Despite claiming to support architectural heterogeneity, most recent FL methods only tolerate variants within a single model family (e.g., shallower, deeper, or wider CNNs), still presuming a shared global architecture and failing to accommodate federations where clients deploy fundamentally different network types (e.g., CNNs, GNNs, MLPs). Moreover, existing approaches often address only statistical heterogeneity while overlooking the domain-fracture problem, where each client's data distribution differs markedly from that faced at testing time, undermining model generalizability. When clients use different architectures, have non-identically distributed data, and encounter distinct test domains, current methods perform poorly. To address these challenges, we propose UnifiedFL, a dynamic federated learning framework that represents heterogeneous local networks as nodes and edges in a directed model graph optimized by a shared graph neural network (GNN). UnifiedFL introduces (i) a common GNN to parameterize all architectures, (ii) distance-driven clustering via Euclidean distances between clients' parameters, and (iii) a two-tier aggregation policy balancing convergence and diversity. Experiments on MedMNIST classification and hippocampus segmentation benchmarks demonstrate UnifiedFL's superior performance. Code and data: https://github.com/basiralab/UnifiedFL
☆ Reinforcement Learning for Pollution Detection in a Randomized, Sparse and Nonstationary Environment with an Autonomous Underwater Vehicle
Reinforcement learning (RL) algorithms are designed to optimize problem-solving by learning actions that maximize rewards, a task that becomes particularly challenging in random and nonstationary environments. Even advanced RL algorithms are often limited in their ability to solve problems in these conditions. In applications such as searching for underwater pollution clouds with autonomous underwater vehicles (AUVs), RL algorithms must navigate reward-sparse environments, where actions frequently result in a zero reward. This paper aims to address these challenges by revisiting and modifying classical RL approaches to efficiently operate in sparse, randomized, and nonstationary environments. We systematically study a large number of modifications, including hierarchical algorithm changes, multigoal learning, and the integration of a location memory as an external output filter to prevent state revisits. Our results demonstrate that a modified Monte Carlo-based approach significantly outperforms traditional Q-learning and two exhaustive search patterns, illustrating its potential in adapting RL to complex environments. These findings suggest that reinforcement learning approaches can be effectively adapted for use in random, nonstationary, and reward-sparse environments.
☆ MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.
☆ Linear Causal Discovery with Interventional Constraints
Incorporating causal knowledge and mechanisms is essential for refining causal models and improving downstream tasks such as designing new treatments. In this paper, we introduce a novel concept in causal discovery, termed interventional constraints, which differs fundamentally from interventional data. While interventional data require direct perturbations of variables, interventional constraints encode high-level causal knowledge in the form of inequality constraints on causal effects. For instance, in the Sachs dataset (Sachs et al.\ 2005), Akt has been shown to be activated by PIP3, meaning PIP3 exerts a positive causal effect on Akt. Existing causal discovery methods allow enforcing structural constraints (for example, requiring a causal path from PIP3 to Akt), but they may still produce incorrect causal conclusions such as learning that "PIP3 inhibits Akt". Interventional constraints bridge this gap by explicitly constraining the total causal effect between variable pairs, ensuring learned models respect known causal influences. To formalize interventional constraints, we propose a metric to quantify total causal effects for linear causal models and formulate the problem as a constrained optimization task, solved using a two-stage constrained optimization method. We evaluate our approach on real-world datasets and demonstrate that integrating interventional constraints not only improves model accuracy and ensures consistency with established findings, making models more explainable, but also facilitates the discovery of new causal relationships that would otherwise be costly to identify.
☆ SABER: Symbolic Regression-based Angle of Arrival and Beam Pattern Estimator
Accurate Angle-of-arrival (AoA) estimation is essential for next-generation wireless communication systems to enable reliable beamforming, high-precision localization, and integrated sensing. Unfortunately, classical high-resolution techniques require multi-element arrays and extensive snapshot collection, while generic Machine Learning (ML) approaches often yield black-box models that lack physical interpretability. To address these limitations, we propose a Symbolic Regression (SR)-based ML framework. Namely, Symbolic Regression-based Angle of Arrival and Beam Pattern Estimator (SABER), a constrained symbolic-regression framework that automatically discovers closed-form beam pattern and AoA models from path loss measurements with interpretability. SABER achieves high accuracy while bridging the gap between opaque ML methods and interpretable physics-driven estimators. First, we validate our approach in a controlled free-space anechoic chamber, showing that both direct inversion of the known $\cos^n$ beam and a low-order polynomial surrogate achieve sub-0.5 degree Mean Absolute Error (MAE). A purely unconstrained SR method can further reduce the error of the predicted angles, but produces complex formulas that lack physical insight. Then, we implement the same SR-learned inversions in a real-world, Reconfigurable Intelligent Surface (RIS)-aided indoor testbed. SABER and unconstrained SR models accurately recover the true AoA with near-zero error. Finally, we benchmark SABER against the Cram\'er-Rao Lower Bounds (CRLBs). Our results demonstrate that SABER is an interpretable and accurate alternative to state-of-the-art and black-box ML-based methods for AoA estimation.
comment: 12 pages, 11 figures
☆ Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections
Enabling continual learning in LLMs remains a key unresolved research challenge. In a recent announcement, a frontier LLM company made a step towards this by introducing Agent Skills, a framework that equips agents with new knowledge based on instructions stored in simple markdown files. Although Agent Skills can be a very useful tool, we show that they are fundamentally insecure, since they enable trivially simple prompt injections. We demonstrate how to hide malicious instructions in long Agent Skill files and referenced scripts to exfiltrate sensitive data, such as internal files or passwords. Importantly, we show how to bypass system-level guardrails of a popular coding agent: a benign, task-specific approval with the "Don't ask again" option can carry over to closely related but harmful actions. Overall, we conclude that despite ongoing research efforts and scaling model capabilities, frontier LLMs remain vulnerable to very simple prompt injections in realistic scenarios. Our code is available at https://github.com/aisa-group/promptinject-agent-skills.
☆ Posterior Sampling by Combining Diffusion Models with Annealed Langevin Dynamics NeurIPS 2025
Given a noisy linear measurement $y = Ax + \xi$ of a distribution $p(x)$, and a good approximation to the prior $p(x)$, when can we sample from the posterior $p(x \mid y)$? Posterior sampling provides an accurate and fair framework for tasks such as inpainting, deblurring, and MRI reconstruction, and several heuristics attempt to approximate it. Unfortunately, approximate posterior sampling is computationally intractable in general. To sidestep this hardness, we focus on (local or global) log-concave distributions $p(x)$. In this regime, Langevin dynamics yields posterior samples when the exact scores of $p(x)$ are available, but it is brittle to score--estimation error, requiring an MGF bound (sub-exponential error). By contrast, in the unconditional setting, diffusion models succeed with only an $L^2$ bound on the score error. We prove that combining diffusion models with an annealed variant of Langevin dynamics achieves conditional sampling in polynomial time using merely an $L^4$ bound on the score error.
comment: NeurIPS 2025
☆ On the Impact of Weight Discretization in QUBO-Based SVM Training ECML
Training Support Vector Machines (SVMs) can be formulated as a QUBO problem, enabling the use of quantum annealing for model optimization. In this work, we study how the number of qubits - linked to the discretization level of dual weights - affects predictive performance across datasets. We compare QUBO-based SVM training to the classical LIBSVM solver and find that even low-precision QUBO encodings (e.g., 1 bit per parameter) yield competitive, and sometimes superior, accuracy. While increased bit-depth enables larger regularization parameters, it does not always improve classification. Our findings suggest that selecting the right support vectors may matter more than their precise weighting. Although current hardware limits the size of solvable QUBOs, our results highlight the potential of quantum annealing for efficient SVM training as quantum devices scale.
comment: Presented at the 7th DSO Workshop at ECML PKDD 2025
☆ Model Inversion with Layer-Specific Modeling and Alignment for Data-Free Continual Learning NeurIPS 2025
Continual learning (CL) aims to incrementally train a model on a sequence of tasks while retaining performance on prior ones. However, storing and replaying data is often infeasible due to privacy or security constraints and impractical for arbitrary pre-trained models. Data-free CL seeks to update models without access to previous data. Beyond regularization, we employ model inversion to synthesize data from the trained model, enabling replay without storing samples. Yet, model inversion in predictive models faces two challenges: (1) generating inputs solely from compressed output labels causes drift between synthetic and real data, and replaying such data can erode prior knowledge; (2) inversion is computationally expensive since each step backpropagates through the full model. These issues are amplified in large pre-trained models such as CLIP. To improve efficiency, we propose Per-layer Model Inversion (PMI), inspired by faster convergence in single-layer optimization. PMI provides strong initialization for full-model inversion, substantially reducing iterations. To mitigate feature shift, we model class-wise features via Gaussian distributions and contrastive model, ensuring alignment between synthetic and real features. Combining PMI and feature modeling, our approach enables continual learning of new classes by generating pseudo-images from semantic-aware projected features, achieving strong effectiveness and compatibility across multiple CL settings.
comment: Accepted in NeurIPS 2025
☆ A Survey of Heterogeneous Graph Neural Networks for Cybersecurity Anomaly Detection
Anomaly detection is a critical task in cybersecurity, where identifying insider threats, access violations, and coordinated attacks is essential for ensuring system resilience. Graph-based approaches have become increasingly important for modeling entity interactions, yet most rely on homogeneous and static structures, which limits their ability to capture the heterogeneity and temporal evolution of real-world environments. Heterogeneous Graph Neural Networks (HGNNs) have emerged as a promising paradigm for anomaly detection by incorporating type-aware transformations and relation-sensitive aggregation, enabling more expressive modeling of complex cyber data. However, current research on HGNN-based anomaly detection remains fragmented, with diverse modeling strategies, limited comparative evaluation, and an absence of standardized benchmarks. To address this gap, we provide a comprehensive survey of HGNN-based anomaly detection methods in cybersecurity. We introduce a taxonomy that classifies approaches by anomaly type and graph dynamics, analyze representative models, and map them to key cybersecurity applications. We also review commonly used benchmark datasets and evaluation metrics, highlighting their strengths and limitations. Finally, we identify key open challenges related to modeling, data, and deployment, and outline promising directions for future research. This survey aims to establish a structured foundation for advancing HGNN-based anomaly detection toward scalable, interpretable, and practically deployable solutions.
comment: 37 pages, 4 figures, 86 references. Submitted to Journal of Computer Security (under review)
☆ Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$ close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.
comment: 50 pages
☆ Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.
☆ Offline Clustering of Preference Learning with Active-data Augmentation
Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C$^2$PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C$^2$PL. In this setting, our second algorithm, A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative dimensions of the test user's preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.
☆ Unravelling the Mechanisms of Manipulating Numbers in Language Models
Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information -- including the causes of output errors -- to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs' architectures.
☆ Empirical Bayesian Multi-Bandit Learning
Multi-task learning in contextual bandits has attracted significant research interest due to its potential to enhance decision-making across multiple related tasks by leveraging shared structures and task-specific heterogeneity. In this article, we propose a novel hierarchical Bayesian framework for learning in various bandit instances. This framework captures both the heterogeneity and the correlations among different bandit instances through a hierarchical Bayesian model, enabling effective information sharing while accommodating instance-specific variations. Unlike previous methods that overlook the learning of the covariance structure across bandits, we introduce an empirical Bayesian approach to estimate the covariance matrix of the prior distribution.This enhances both the practicality and flexibility of learning across multi-bandits. Building on this approach, we develop two efficient algorithms: ebmTS (Empirical Bayesian Multi-Bandit Thompson Sampling) and ebmUCB (Empirical Bayesian Multi-Bandit Upper Confidence Bound), both of which incorporate the estimated prior into the decision-making process. We provide the frequentist regret upper bounds for the proposed algorithms, thereby filling a research gap in the field of multi-bandit problems. Extensive experiments on both synthetic and real-world datasets demonstrate the superior performance of our algorithms, particularly in complex environments. Our methods achieve lower cumulative regret compared to existing techniques, highlighting their effectiveness in balancing exploration and exploitation across multi-bandits.
comment: 33 pages, 13 figures
☆ Distributional Multi-objective Black-box Optimization for Diffusion-model Inference-time Multi-Target Generation
Diffusion models have been successful in learning complex data distributions. This capability has driven their application to high-dimensional multi-objective black-box optimization problem. Existing approaches often employ an external optimization loop, such as an evolutionary algorithm, to the diffusion model. However, these approaches treat the diffusion model as a black-box refiner, which overlooks the internal distribution transition of the diffusion generation process, limiting their efficiency. To address these challenges, we propose the Inference-time Multi-target Generation (IMG) algorithm, which optimizes the diffusion process at inference-time to generate samples that simultaneously satisfy multiple objectives. Specifically, our IMG performs weighted resampling during the diffusion generation process according to the expected aggregated multi-objective values. This weighted resampling strategy ensures the diffusion-generated samples are distributed according to our desired multi-target Boltzmann distribution. We further derive that the multi-target Boltzmann distribution has an interesting log-likelihood interpretation, where it is the optimal solution to the distributional multi-objective optimization problem. We implemented IMG for a multi-objective molecule generation task. Experiments show that IMG, requiring only a single generation pass, achieves a significantly higher hypervolume than baseline optimization algorithms that often require hundreds of diffusion generations. Notably, our algorithm can be viewed as an optimized diffusion process and can be integrated into existing methods to further improve their performance.
☆ A Research Roadmap for Augmenting Software Engineering Processes and Software Products with Generative AI
Generative AI (GenAI) is rapidly transforming software engineering (SE) practices, influencing how SE processes are executed, as well as how software systems are developed, operated, and evolved. This paper applies design science research to build a roadmap for GenAI-augmented SE. The process consists of three cycles that incrementally integrate multiple sources of evidence, including collaborative discussions from the FSE 2025 "Software Engineering 2030" workshop, rapid literature reviews, and external feedback sessions involving peers. McLuhan's tetrads were used as a conceptual instrument to systematically capture the transforming effects of GenAI on SE processes and software products.The resulting roadmap identifies four fundamental forms of GenAI augmentation in SE and systematically characterizes their related research challenges and opportunities. These insights are then consolidated into a set of future research directions. By grounding the roadmap in a rigorous multi-cycle process and cross-validating it among independent author teams and peers, the study provides a transparent and reproducible foundation for analyzing how GenAI affects SE processes, methods and tools, and for framing future research within this rapidly evolving area. Based on these findings, the article finally makes ten predictions for SE in the year 2030.
☆ PVMark: Enabling Public Verifiability for LLM Watermarking Schemes
Watermarking schemes for large language models (LLMs) have been proposed to identify the source of the generated text, mitigating the potential threats emerged from model theft. However, current watermarking solutions hardly resolve the trust issue: the non-public watermark detection cannot prove itself faithfully conducting the detection. We observe that it is attributed to the secret key mostly used in the watermark detection -- it cannot be public, or the adversary may launch removal attacks provided the key; nor can it be private, or the watermarking detection is opaque to the public. To resolve the dilemma, we propose PVMark, a plugin based on zero-knowledge proof (ZKP), enabling the watermark detection process to be publicly verifiable by third parties without disclosing any secret key. PVMark hinges upon the proof of `correct execution' of watermark detection on which a set of ZKP constraints are built, including mapping, random number generation, comparison, and summation. We implement multiple variants of PVMark in Python, Rust and Circom, covering combinations of three watermarking schemes, three hash functions, and four ZKP protocols, to show our approach effectively works under a variety of circumstances. By experimental results, PVMark efficiently enables public verifiability on the state-of-the-art LLM watermarking schemes yet without compromising the watermarking performance, promising to be deployed in practice.
comment: This work has been submitted to the IEEE for possible publication
☆ Likely Interpolants of Generative Models
Interpolation in generative models allows for controlled generation, model inspection, and more. Unfortunately, most generative models lack a principal notion of interpolants without restrictive assumptions on either the model or data dimension. In this paper, we develop a general interpolation scheme that targets likely transition paths compatible with different metrics and probability distributions. We consider interpolants analogous to a geodesic constrained to a suitable data distribution and derive a novel algorithm for computing these curves, which requires no additional training. Theoretically, we show that our method locally can be considered as a geodesic under a suitable Riemannian metric. We quantitatively show that our interpolation scheme traverses higher density regions than baselines across a range of models and datasets.
♻ ☆ UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection
The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.
♻ ☆ S-CFE: Simple Counterfactual Explanations
We study the problem of finding optimal sparse, manifold-aligned counterfactual explanations for classifiers. Canonically, this can be formulated as an optimization problem with multiple non-convex components, including classifier loss functions and manifold alignment (or \emph{plausibility}) metrics. The added complexity of enforcing \emph{sparsity}, or shorter explanations, complicates the problem further. Existing methods often focus on specific models and plausibility measures, relying on convex $\ell_1$ regularizers to enforce sparsity. In this paper, we tackle the canonical formulation using the accelerated proximal gradient (APG) method, a simple yet efficient first-order procedure capable of handling smooth non-convex objectives and non-smooth $\ell_p$ (where $0 \leq p < 1$) regularizers. This enables our approach to seamlessly incorporate various classifiers and plausibility measures while producing sparser solutions. Our algorithm only requires differentiable data-manifold regularizers and supports box constraints for bounded feature ranges, ensuring the generated counterfactuals remain \emph{actionable}. Finally, experiments on real-world datasets demonstrate that our approach effectively produces sparse, manifold-aligned counterfactual explanations while maintaining proximity to the factual data and computational efficiency.
♻ ☆ Direct Debiased Machine Learning via Bregman Divergence Minimization
We develop a direct debiased machine learning framework comprising Neyman targeted estimation and generalized Riesz regression. Our framework unifies Riesz regression for automatic debiased machine learning, covariate balancing, targeted maximum likelihood estimation (TMLE), and density-ratio estimation. In many problems involving causal effects or structural models, the parameters of interest depend on regression functions. Plugging regression functions estimated by machine learning methods into the identifying equations can yield poor performance because of first-stage bias. To reduce such bias, debiased machine learning employs Neyman orthogonal estimating equations. Debiased machine learning typically requires estimation of the Riesz representer and the regression function. For this problem, we develop a direct debiased machine learning framework with an end-to-end algorithm. We formulate estimation of the nuisance parameters, the regression function and the Riesz representer, as minimizing the discrepancy between Neyman orthogonal scores computed with known and unknown nuisance parameters, which we refer to as Neyman targeted estimation. Neyman targeted estimation includes Riesz representer estimation, and we measure discrepancies using the Bregman divergence. The Bregman divergence encompasses various loss functions as special cases, where the squared loss yields Riesz regression and the Kullback-Leibler divergence yields entropy balancing. We refer to this Riesz representer estimation as generalized Riesz regression. Neyman targeted estimation also yields TMLE as a special case for regression function estimation. Furthermore, for specific pairs of models and Riesz representer estimation methods, we can automatically obtain the covariate balancing property without explicitly solving the covariate balancing objective.
♻ ☆ Adversarial generalization of unfolding (model-based) networks NeurIPS2025
Unfolding networks are interpretable networks emerging from iterative algorithms, incorporate prior knowledge of data structure, and are designed to solve inverse problems like compressed sensing, which deals with recovering data from noisy, missing observations. Compressed sensing finds applications in critical domains, from medical imaging to cryptography, where adversarial robustness is crucial to prevent catastrophic failures. However, a solid theoretical understanding of the performance of unfolding networks in the presence of adversarial attacks is still in its infancy. In this paper, we study the adversarial generalization of unfolding networks when perturbed with $l_2$-norm constrained attacks, generated by the fast gradient sign method. Particularly, we choose a family of state-of-the-art overaparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. Given this estimate, we provide adversarial generalization error bounds for the networks under study, which are tight with respect to the attack level. To our knowledge, this is the first theoretical analysis on the adversarial generalization of unfolding networks. We further present a series of experiments on real-world data, with results corroborating our derived theory, consistently for all data. Finally, we observe that the family's overparameterization can be exploited to promote adversarial robustness, shedding light on how to efficiently robustify neural networks.
comment: Accepted at NeurIPS2025
♻ ☆ Advancing Local Clustering on Graphs via Compressive Sensing: Semi-supervised and Unsupervised Methods
Local clustering aims to identify specific substructures within a large graph without any additional structural information of the graph. These substructures are typically small compared to the overall graph, enabling the problem to be approached by finding a sparse solution to a linear system associated with the graph Laplacian. In this work, we first propose a method for identifying specific local clusters when very few labeled data are given, which we term semi-supervised local clustering. We then extend this approach to the unsupervised setting when no prior information on labels is available. The proposed methods involve randomly sampling the graph, applying diffusion through local cluster extraction, then examining the overlap among the results to find each cluster. We establish the co-membership conditions for any pair of nodes, and rigorously prove the correctness of our methods. Additionally, we conduct extensive experiments to demonstrate that the proposed methods achieve state of the art results in the low-label rates regime.
♻ ☆ Partially-Supervised Neural Network Model For Quadratic Multiparametric Programming
Neural Networks (NN) with ReLU activation functions are used to model multiparametric quadratic optimization problems (mp-QP) in diverse engineering applications. Researchers have suggested leveraging the piecewise affine property of deep NN models to solve mp-QP with linear constraints, which also exhibit piecewise affine behaviour. However, traditional deep NN applications to mp-QP fall short of providing optimal and feasible predictions, even when trained on large datasets. This study proposes a partially-supervised NN (PSNN) architecture that directly represents the mathematical structure of the global solution function. In contrast to generic NN training approaches, the proposed PSNN method derives a large proportion of model weights directly from the mathematical properties of the optimization problem, producing more accurate solutions despite significantly smaller training data sets. Many energy management problems are formulated as QP, so we apply the proposed approach to energy systems (specifically DC optimal power flow) to demonstrate proof of concept. Model performance in terms of solution accuracy and speed of predictions was compared against a commercial solver and a generic Deep NN model based on classical training. Results show KKT sufficient conditions for PSNN consistently outperform generic NN architectures with classical training using far less data, including when tested on extreme, out-of-training distribution test data. Given its speed advantages over traditional solvers, the PSNN model can quickly produce optimal and feasible solutions within a second for millions of input parameters sampled from a distribution of stochastic demands and renewable generator dispatches, which can be used for simulations and long term planning.
comment: 36 pages including references and appendix
♻ ☆ Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes
Breast cancer's complexity and variability pose significant challenges in understanding its progression and guiding effective treatment. This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes and predict clinical outcomes. Using ProtGPT2, a language model specifically designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins. These embeddings were integrated with protein expression levels to form enriched biological representations, which were analyzed using machine learning methods, such as ensemble K-means for clustering and XGBoost for classification. Our approach enabled the successful clustering of patients into biologically distinct groups and accurately predicted clinical outcomes such as survival and biomarker status, achieving high performance metrics, notably an F1 score of 0.88 for survival and 0.87 for biomarker status prediction. Feature importance analysis identified KMT2C, CLASP2, and MYO1B as key proteins involved in hormone signaling, cytoskeletal remodeling, and therapy resistance in hormone receptor-positive and triple-negative breast cancer, with potential influence on breast cancer subtype behavior and progression. Furthermore, protein-protein interaction networks and correlation analyses revealed functional interdependencies among proteins that may influence the behavior and progression of breast cancer subtypes. These findings suggest that integrating protein sequence and expression data provides valuable insights into tumor biology and has significant potential to enhance personalized treatment strategies in breast cancer care.
♻ ☆ When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product
State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.
♻ ☆ RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025
comment: Added link to access models: https://huggingface.co/collections/nvidia/reward-models-10-2025
♻ ☆ Curriculum Abductive Learning NeurIPS 2025
Abductive Learning (ABL) integrates machine learning with logical reasoning in a loop: a learning model predicts symbolic concept labels from raw inputs, which are revised through abduction using domain knowledge and then fed back for retraining. However, due to the nondeterminism of abduction, the training process often suffers from instability, especially when the knowledge base is large and complex, resulting in a prohibitively large abduction space. While prior works focus on improving candidate selection within this space, they typically treat the knowledge base as a static black box. In this work, we propose Curriculum Abductive Learning (C-ABL), a method that explicitly leverages the internal structure of the knowledge base to address the ABL training challenges. C-ABL partitions the knowledge base into a sequence of sub-bases, progressively introduced during training. This reduces the abduction space throughout training and enables the model to incorporate logic in a stepwise, smooth way. Experiments across multiple tasks show that C-ABL outperforms previous ABL implementations, significantly improves training stability, convergence speed, and final accuracy, especially under complex knowledge setting.
comment: Accepted by NeurIPS 2025, 22 pages, 6 figures
♻ ☆ Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models WACV 2026
Current network training paradigms primarily focus on either centralized or decentralized data regimes. However, in practice, data availability often exhibits a hybrid nature, where both regimes coexist. This hybrid setting presents new opportunities for model training, as the two regimes offer complementary trade-offs: decentralized data is abundant but subject to heterogeneity and communication constraints, while centralized data, though limited in volume and potentially unrepresentative, enables better curation and high-throughput access. Despite its potential, effectively combining these paradigms remains challenging, and few frameworks are tailored to hybrid data regimes. To address this, we propose a novel framework that constructs a model atlas from decentralized models and leverages centralized data to refine a global model within this structured space. The refined model is then used to reinitialize the decentralized models. Our method synergizes federated learning (to exploit decentralized data) and model merging (to utilize centralized data), enabling effective training under hybrid data availability. Theoretically, we show that our approach achieves faster convergence than methods relying solely on decentralized data, due to variance reduction in the merging process. Extensive experiments demonstrate that our framework consistently outperforms purely centralized, purely decentralized, and existing hybrid-adaptable methods. Notably, our method remains robust even when the centralized and decentralized data domains differ or when decentralized data contains noise, significantly broadening its applicability.
comment: Accepted at WACV 2026
♻ ☆ GSE: Group-wise Sparse and Explainable Adversarial Attacks
Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, often regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. We address this by presenting a two-phase algorithm that generates group-wise sparse attacks within semantically meaningful areas of an image. Initially, we optimize a quasinorm adversarial loss using the $1/2-$quasinorm proximal operator tailored for non-convex programming. Subsequently, the algorithm transitions to a projected Nesterov's accelerated gradient descent with $2-$norm regularization applied to perturbation magnitudes. Rigorous evaluations on CIFAR-10 and ImageNet datasets demonstrate a remarkable increase in group-wise sparsity, e.g., $50.9\%$ on CIFAR-10 and $38.4\%$ on ImageNet (average case, targeted attack). This performance improvement is accompanied by significantly faster computation times, improved explainability, and a $100\%$ attack success rate.
♻ ☆ Understanding Generalization in Node and Link Prediction
Using message-passing graph neural networks (MPNNs) for node and link prediction is crucial in various scientific and industrial domains, which has led to the development of diverse MPNN architectures. Besides working well in practical settings, their ability to generalize beyond the training set remains poorly understood. While some studies have explored MPNNs' generalization in graph-level prediction tasks, much less attention has been given to node- and link-level predictions. Existing works often rely on unrealistic i.i.d.\@ assumptions, overlooking possible correlations between nodes or links, and assuming fixed aggregation and impractical loss functions while neglecting the influence of graph structure. In this work, we introduce a unified framework to analyze the generalization properties of MPNNs in inductive and transductive node and link prediction settings, incorporating diverse architectural parameters and loss functions and quantifying the influence of graph structure. Additionally, our proposed generalization framework can be applied beyond graphs to any classification task under the inductive or transductive setting. Our empirical study supports our theoretical insights, deepening our understanding of MPNNs' generalization capabilities in these tasks.
comment: arXiv admin note: text overlap with arXiv:2412.07106
♻ ☆ RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching
Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network's output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.
♻ ☆ PepCompass: Navigating peptide embedding spaces using Riemannian Geometry
Antimicrobial peptide discovery is challenged by the astronomical size of peptide space and the relative scarcity of active peptides. Generative models provide continuous latent "maps" of peptide space, but conventionally ignore decoder-induced geometry and rely on flat Euclidean metrics, rendering exploration and optimization distorted and inefficient. Prior manifold-based remedies assume fixed intrinsic dimensionality, which critically fails in practice for peptide data. Here, we introduce PepCompass, a geometry-aware framework for peptide exploration and optimization. At its core, we define a Union of $\kappa$-Stable Riemannian Manifolds $\mathbb{M}^{\kappa}$, a family of decoder-induced manifolds that captures local geometry while ensuring computational stability. We propose two local exploration methods: Second-Order Riemannian Brownian Efficient Sampling, which provides a convergent second-order approximation to Riemannian Brownian motion, and Mutation Enumeration in Tangent Space, which reinterprets tangent directions as discrete amino-acid substitutions. Combining these yields Local Enumeration Bayesian Optimization (LE-BO), an efficient algorithm for local activity optimization. Finally, we introduce Potential-minimizing Geodesic Search (PoGS), which interpolates between prototype embeddings along property-enriched geodesics, biasing discovery toward seeds, i.e. peptides with favorable activity. In-vitro validation confirms the effectiveness of PepCompass: PoGS yields four novel seeds, and subsequent optimization with LE-BO discovers 25 highly active peptides with broad-spectrum activity, including against resistant bacterial strains. These results demonstrate that geometry-informed exploration provides a powerful new paradigm for antimicrobial peptide design.
♻ ☆ C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models NeurIPS 2025
Low-Rank Adaptation (LoRA) offers a cost-effective solution for fine-tuning large language models (LLMs), but it often produces overconfident predictions in data-scarce few-shot settings. To address this issue, several classical statistical learning approaches have been repurposed for scalable uncertainty-aware LoRA fine-tuning. However, these approaches neglect how input characteristics affect the predictive uncertainty estimates. To address this limitation, we propose Contextual Low-Rank Adaptation (C-LoRA) as a novel uncertainty-aware and parameter efficient fine-tuning approach, by developing new lightweight LoRA modules contextualized to each input data sample to dynamically adapt uncertainty estimates. Incorporating data-driven contexts into the parameter posteriors, C-LoRA mitigates overfitting, achieves well-calibrated uncertainties, and yields robust predictions. Extensive experiments on LLaMA2-7B models demonstrate that C-LoRA consistently outperforms the state-of-the-art uncertainty-aware LoRA methods in both uncertainty quantification and model generalization. Ablation studies further confirm the critical role of our contextual modules in capturing sample-specific uncertainties. C-LoRA sets a new standard for robust, uncertainty-aware LLM fine-tuning in few-shot regimes. Although our experiments are limited to 7B models, our method is architecture-agnostic and, in principle, applies beyond this scale; studying its scaling to larger models remains an open problem. Our code is available at https://github.com/ahra99/c_lora.
comment: Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.
comment: 20 pages, 8 figures
♻ ☆ LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.
♻ ☆ SignalLLM: A General-Purpose LLM Agent Framework for Automated Signal Processing
Modern signal processing (SP) pipelines, whether model-based or data-driven, often constrained by complex and fragmented workflow, rely heavily on expert knowledge and manual engineering, and struggle with adaptability and generalization under limited data. In contrast, Large Language Models (LLMs) offer strong reasoning capabilities, broad general-purpose knowledge, in-context learning, and cross-modal transfer abilities, positioning them as powerful tools for automating and generalizing SP workflows. Motivated by these potentials, we introduce SignalLLM, the first general-purpose LLM-based agent framework for general SP tasks. Unlike prior LLM-based SP approaches that are limited to narrow applications or tricky prompting, SignalLLM introduces a principled, modular architecture. It decomposes high-level SP goals into structured subtasks via in-context learning and domain-specific retrieval, followed by hierarchical planning through adaptive retrieval-augmented generation (RAG) and refinement; these subtasks are then executed through prompt-based reasoning, cross-modal reasoning, code synthesis, model invocation, or data-driven LLM-assisted modeling. Its generalizable design enables the flexible selection of problem solving strategies across different signal modalities, task types, and data conditions. We demonstrate the versatility and effectiveness of SignalLLM through five representative tasks in communication and sensing, such as radar target detection, human activity recognition, and text compression. Experimental results show superior performance over traditional and existing LLM-based methods, particularly in few-shot and zero-shot settings.
comment: 11 pages
♻ ☆ Differentiation Through Black-Box Quadratic Programming Solvers
Differentiable optimization has attracted significant research interest, particularly for quadratic programming (QP). Existing approaches for differentiating the solution of a QP with respect to its defining parameters often rely on specific integrated solvers. This integration limits their applicability, including their use in neural network architectures and bi-level optimization tasks, restricting users to a narrow selection of solver choices. To address this limitation, we introduce dQP, a modular and solver-agnostic framework for plug-and-play differentiation of virtually any QP solver. A key insight we leverage to achieve modularity is that, once the active set of inequality constraints is known, both the solution and its derivative can be expressed using simplified linear systems that share the same matrix. This formulation fully decouples the computation of the QP solution from its differentiation. Building on this result, we provide a minimal-overhead, open-source implementation ( https://github.com/cwmagoon/dQP ) that seamlessly integrates with over 15 state-of-the-art solvers. Comprehensive benchmark experiments demonstrate dQP's robustness and scalability, particularly highlighting its advantages in large-scale sparse problems.
♻ ☆ Smart Exploration in Reinforcement Learning using Bounded Uncertainty Models
Reinforcement learning (RL) is a powerful framework for decision-making in uncertain environments, but it often requires large amounts of data to learn an optimal policy. We address this challenge by incorporating prior model knowledge to guide exploration and accelerate the learning process. Specifically, we assume access to a model set that contains the true transition kernel and reward function. We optimize over this model set to obtain upper and lower bounds on the Q-function, which are then used to guide the exploration of the agent. We provide theoretical guarantees on the convergence of the Q-function to the optimal Q-function under the proposed class of exploring policies. Furthermore, we also introduce a data-driven regularized version of the model set optimization problem that ensures the convergence of the class of exploring policies to the optimal policy. Lastly, we show that when the model set has a specific structure, namely the bounded-parameter MDP (BMDP) framework, the regularized model set optimization problem becomes convex and simple to implement. In this setting, we also prove finite-time convergence to the optimal policy under mild assumptions. We demonstrate the effectiveness of the proposed exploration strategy, which we call BUMEX (Bounded Uncertainty Model-based Exploration), in a simulation study. The results indicate that the proposed method can significantly accelerate learning in benchmark examples. A toolbox is available at https://github.com/JvHulst/BUMEX.
comment: Accepted for Presentation at 64th IEEE Conference on Decision and Control, CDC 2025, Rio de Janeiro, Brazil, 2025
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation
comment: 16 pages; 8 figures, 4 tables; v2 changelog: Fixed the modeling for table 3, random effect is the model version; v3 changelog: Fixed minor formatting issues in tables 2 and 3; v4 changelog: Fixed some typos and model description
♻ ☆ Incentivizing LLMs to Self-Verify Their Answers
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.
♻ ☆ Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testament to the impact of the approach, the method has been adopted for use in the most recent release of the series.
♻ ☆ A mathematical certification for positivity conditions in Neural Networks with applications to partial monotonicity and Trustworthy AI
Artificial Neural Networks (ANNs) have become a powerful tool for modeling complex relationships in large-scale datasets. However, their black-box nature poses trustworthiness challenges. In certain situations, ensuring trust in predictions might require following specific partial monotonicity constraints. However, certifying if an already-trained ANN is partially monotonic is challenging. Therefore, ANNs are often disregarded in some critical applications, such as credit scoring, where partial monotonicity is required. To address this challenge, this paper presents a novel algorithm (LipVor) that certifies if a black-box model, such as an ANN, is positive based on a finite number of evaluations. Consequently, since partial monotonicity can be expressed as a positivity condition on partial derivatives, LipVor can certify whether an ANN is partially monotonic. To do so, for every positively evaluated point, the Lipschitzianity of the black-box model is used to construct a specific neighborhood where the function remains positive. Next, based on the Voronoi diagram of the evaluated points, a sufficient condition is stated to certify if the function is positive in the domain. Unlike prior methods, our approach certifies partial monotonicity without constrained architectures or piece-wise linear activations. Therefore, LipVor could open up the possibility of using unconstrained ANN in some critical fields. Moreover, some other properties of an ANN, such as convexity, can be posed as positivity conditions, and therefore, LipVor could also be applied.
comment: 16 pages, 4 figures
♻ ☆ GFlowNets for Learning Better Drug-Drug Interaction Representations ICANN 2025
Drug-drug interactions pose a significant challenge in clinical pharmacology, with severe class imbalance among interaction types limiting the effectiveness of predictive models. Common interactions dominate datasets, while rare but critical interactions remain underrepresented, leading to poor model performance on infrequent cases. Existing methods often treat DDI prediction as a binary problem, ignoring class-specific nuances and exacerbating bias toward frequent interactions. To address this, we propose a framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generate effective and novel DDI pairs. Our approach enhances predictive performance across interaction types, ensuring better clinical reliability.
comment: Accepted to ICANN 2025:AIDD and NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling (https://openreview.net/forum?id=LZW1jSgfCI)
♻ ☆ Exploring the Early Universe with Deep Learning
Hydrogen is the most abundant element in our Universe. The first generation of stars and galaxies produced photons that ionized hydrogen gas, driving a cosmological event known as the Epoch of Reionization (EoR). The upcoming Square Kilometre Array Observatory (SKAO) will map the distribution of neutral hydrogen during this era, aiding in the study of the properties of these first-generation objects. Extracting astrophysical information will be challenging, as SKAO will produce a tremendous amount of data where the hydrogen signal will be contaminated with undesired foreground contamination and instrumental systematics. To address this, we develop the latest deep learning techniques to extract information from the 2D power spectra of the hydrogen signal expected from SKAO. We apply a series of neural network models to these measurements and quantify their ability to predict the history of cosmic hydrogen reionization, which is connected to the increasing number and efficiency of early photon sources. We show that the study of the early Universe benefits from modern deep learning technology. In particular, we demonstrate that dedicated machine learning algorithms can achieve more than a $0.95$ $R^2$ score on average in recovering the reionization history. This enables accurate and precise cosmological and astrophysical inference of structure formation in the early Universe.
comment: EPIA 2025 preprint version, 12 pages, 3 figures
♻ ☆ Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling NeurIPS 2025
Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1\% over Beam Search and 3.6\% over Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to support future research.
comment: Accepted at NeurIPS 2025
♻ ☆ A geometric framework for momentum-based optimizers for low-rank training
Low-rank pre-training and fine-tuning have recently emerged as promising techniques for reducing the computational and storage costs of large neural networks. Training low-rank parameterizations typically relies on conventional optimizers such as heavy ball momentum methods or Adam. In this work, we identify and analyze potential difficulties that these training methods encounter when used to train low-rank parameterizations of weights. In particular, we show that classical momentum methods can struggle to converge to a local optimum due to the geometry of the underlying optimization landscape. To address this, we introduce novel training strategies derived from dynamical low-rank approximation, which explicitly account for the underlying geometric structure. Our approach leverages and combines tools from dynamical low-rank approximation and momentum-based optimization to design optimizers that respect the intrinsic geometry of the parameter space. We validate our methods through numerical experiments, demonstrating faster convergence, and stronger validation metrics at given parameter budgets.
♻ ☆ LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding
Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions. (1) We show that the discrepancy between the data available during training time and inference time can lead to biased estimates of treatment effects. We formalize this issue as an inference time text confounding problem, where confounders are fully observed during training time but only partially available through text at inference time. (2) To address this problem, we propose a novel framework for estimating treatment effects that explicitly accounts for inference time text confounding. Our framework leverages large language models together with a custom doubly robust learner to mitigate biases caused by the inference time text confounding. (3) Through a series of experiments, we demonstrate the effectiveness of our framework in real-world applications.
♻ ☆ MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks NeurIPS 2025
The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.
comment: Accepted by NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ Distributed optimization: designed for federated learning
Federated learning (FL), as a distributed collaborative machine learning (ML) framework under privacy-preserving constraints, has garnered increasing research attention in cross-organizational data collaboration scenarios. This paper proposes a class of distributed optimization algorithms based on the augmented Lagrangian technique, designed to accommodate diverse communication topologies in both centralized and decentralized FL settings. Furthermore, we develop multiple termination criteria and parameter update mechanisms to enhance computational efficiency, accompanied by rigorous theoretical guarantees of convergence. By generalizing the augmented Lagrangian relaxation through the incorporation of proximal relaxation and quadratic approximation, our framework systematically recovers a broad of classical unconstrained optimization methods, including proximal algorithm, classic gradient descent, and stochastic gradient descent, among others. Notably, the convergence properties of these methods can be naturally derived within the proposed theoretical framework. Numerical experiments demonstrate that the proposed algorithm exhibits strong performance in large-scale settings with significant statistical heterogeneity across clients.
comment: 16 pages, 6 figures
♻ ☆ On the Impact of Performative Risk Minimization for Binary Random Variables ICML 2025
Performativity, the phenomenon where outcomes are influenced by predictions, is particularly prevalent in social contexts where individuals strategically respond to a deployed model. In order to preserve the high accuracy of machine learning models under distribution shifts caused by performativity, Perdomo et al. (2020) introduced the concept of performative risk minimization (PRM). While this framework ensures model accuracy, it overlooks the impact of the PRM on the underlying distributions and the predictions of the model. In this paper, we initiate the analysis of the impact of PRM, by studying performativity for a sequential performative risk minimization problem with binary random variables and linear performative shifts. We formulate two natural measures of impact. In the case of full information, where the distribution dynamics are known, we derive explicit formulas for the PRM solution and our impact measures. In the case of partial information, we provide performative-aware statistical estimators, as well as simulations. Our analysis contrasts PRM to alternatives that do not model data shift and indicates that PRM can have amplified side effects compared to such methods.
comment: ICML 2025 camera-ready
♻ ☆ Machine-learning competition to grade EEG background patterns in newborns with hypoxic-ischaemic encephalopathy
Machine learning (ML) has the potential to support and improve expert performance in monitoring the brain function of at-risk newborns. Developing accurate and reliable ML models depends on access to high-quality, annotated data, a resource in short supply. ML competitions address this need by providing researchers access to expertly annotated datasets, fostering shared learning through direct model comparisons, and leveraging the benefits of crowdsourcing diverse expertise. We compiled a retrospective dataset containing 353 hours of EEG from 102 individual newborns from a multi-centre study. The data was fully anonymised and divided into training, testing, and held-out validation datasets. EEGs were graded for the severity of abnormal background patterns. Next, we created a web-based competition platform and hosted a machine learning competition to develop ML models for classifying the severity of EEG background patterns in newborns. After the competition closed, the top 4 performing models were evaluated offline on a separate held-out validation dataset. Although a feature-based model ranked first on the testing dataset, deep learning models generalised better on the validation sets. All methods had a significant decline in validation performance compared to the testing performance. This highlights the challenges for model generalisation on unseen data, emphasising the need for held-out validation datasets in ML studies with neonatal EEG. The study underscores the importance of training ML models on large and diverse datasets to ensure robust generalisation. The competition's outcome demonstrates the potential for open-access data and collaborative ML development to foster a collaborative research environment and expedite the development of clinical decision-support tools for neonatal neuromonitoring.
comment: 29 pages, supplementary materials: "supplementary materials ML Comp.docx"
♻ ☆ Wisdom and Delusion of LLM Ensembles for Code Generation and Repair
Today's pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble's potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble's candidate pool. We find that the theoretical upperbound for an ensemble's performance can be 83% above the best single model. Our results show that consensus-based strategies for selecting solutions fall into a "popularity trap," amplifying common but incorrect outputs. In contrast, a diversity-based strategy realizes up to 95% of this theoretical potential, and proves effective even in small two-model ensembles, enabling a cost-efficient way to enhance performance by leveraging multiple LLMs.
comment: Added Acknowledgments section and hyphenated last names
♻ ☆ Cybersecurity threat detection based on a UEBA framework using Deep Autoencoders
User and Entity Behaviour Analytics (UEBA) is a broad branch of data analytics that attempts to build a normal behavioural profile in order to detect anomalous events. Among the techniques used to detect anomalies, Deep Autoencoders constitute one of the most promising deep learning models on UEBA tasks, allowing explainable detection of security incidents that could lead to the leak of personal data, hijacking of systems, or access to sensitive business information. In this study, we introduce the first implementation of an explainable UEBA-based anomaly detection framework that leverages Deep Autoencoders in combination with Doc2Vec to process both numerical and textual features. Additionally, based on the theoretical foundations of neural networks, we offer a novel proof demonstrating the equivalence of two widely used definitions for fully-connected neural networks. The experimental results demonstrate the proposed framework capability to detect real and synthetic anomalies effectively generated from real attack data, showing that the models provide not only correct identification of anomalies but also explainable results that enable the reconstruction of the possible origin of the anomaly. Our findings suggest that the proposed UEBA framework can be seamlessly integrated into enterprise environments, complementing existing security systems for explainable threat detection.
comment: Published in AIMS Mathematics (2025), 10(10): 23496-23517. DOI: 10.3934/math.20251043
♻ ☆ Monopoly Deal: A Benchmark Environment for Bounded One-Sided Response Games
Card games are widely used to study sequential decision-making under uncertainty, with real-world analogues in negotiation, finance, and cybersecurity. These games typically fall into three categories based on the flow of control: strictly sequential (players alternate single actions), deterministic response (some actions trigger a fixed outcome), and unbounded reciprocal response (alternating counterplays are permitted). A less-explored but strategically rich structure is the bounded one-sided response, where a player's action briefly transfers control to the opponent, who must satisfy a fixed condition through one or more moves before the turn resolves. We term games featuring this mechanism Bounded One-Sided Response Games (BORGs). We introduce a modified version of Monopoly Deal as a benchmark environment that isolates this dynamic, where a Rent action forces the opponent to choose payment assets. The gold-standard algorithm, Counterfactual Regret Minimization (CFR), converges on effective strategies without novel algorithmic extensions. A lightweight full-stack research platform unifies the environment, a parallelized CFR runtime, and a human-playable web interface. The trained CFR agent and source code are available at https://monopolydeal.ai.
comment: 24 pages, 7 figures
♻ ☆ In Defence of Post-hoc Explainability NeurIPS 2024
This position paper defends post-hoc explainability methods as legitimate tools for scientific knowledge production in machine learning. Addressing criticism of these methods' reliability and epistemic status, we develop a philosophical framework grounded in mediated understanding and bounded factivity. We argue that scientific insights can emerge through structured interpretation of model behaviour without requiring complete mechanistic transparency, provided explanations acknowledge their approximative nature and undergo rigorous empirical validation. Through analysis of recent biomedical ML applications, we demonstrate how post-hoc methods, when properly integrated into scientific practice, generate novel hypotheses and advance phenomenal understanding.
comment: v1 presented at the Interpretable AI: Past, Present, and Future Workshop at NeurIPS 2024 (non-archival)
♻ ☆ On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.
♻ ☆ Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction
Physics Informed Neural Networks (PINNs) are gaining attention for their ability to embed physical laws into deep learning models, which is particularly useful in structural engineering tasks with limited data. This paper aims to explore the use of PINNs to predict the weight of small scale spaghetti bridges, a task relevant to understanding load limits and potential failure modes in simplified structural models. Our proposed framework incorporates physics-based constraints to the prediction model for improved performance. In addition to standard PINNs, we introduce a novel architecture named Physics Informed Kolmogorov Arnold Network (PIKAN), which blends universal function approximation theory with physical insights. The structural parameters provided as input to the model are collected either manually or through computer vision methods. Our dataset includes 15 real bridges, augmented to 100 samples, and our best model achieves an $R^2$ score of 0.9603 and a mean absolute error (MAE) of 10.50 units. From applied perspective, we also provide a web based interface for parameter entry and prediction. These results show that PINNs can offer reliable estimates of structural weight, even with limited data, and may help inform early stage failure analysis in lightweight bridge designs. The complete data and code are available at https://github.com/OmerJauhar/PINNS-For-Spaghetti-Bridges.
comment: 12 pages, 17 figures. Preprint
♻ ☆ TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on Hopper and Blackwell NVIDIA GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
comment: 14 pages, 16 figures. For source code, see https://github.com/microsoft/tokenweave. In version 2, Figure 6 shows All-Reduce bandwidth instead of Reduce-Scatter. The Multimem Reduce-Scatter bandwidth formula differs slightly from the ring-based version. Fixed x-ticks in Figure 7
♻ ☆ Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability
We consider the fundamental problem of estimating a discrete distribution on a domain of size~$K$ with high probability in Kullback-Leibler divergence. We provide upper and lower bounds on the minimax estimation rate, which show that the optimal rate is between $\big(K + \ln(K)\ln(1/\delta)\big) /n$ and $\big(K\ln\ln(K) + \ln(K)\ln(1/\delta)\big) /n$ at error probability $\delta$ and sample size $n$, which pins down the rate up to the doubly logarithmic factor $\ln \ln K$ that multiplies $K$. Our upper bound uses techniques from online learning to construct a novel estimator via online-to-batch conversion. Perhaps surprisingly, the tail behavior of the minimax rate is worse than for the squared total variation and squared Hellinger distance, for which it is $\big(K + \ln(1/\delta)\big) /n$, i.e.\ without the $\ln K$ multiplying $\ln (1/\delta)$. As a consequence, we cannot obtain a fully tight lower bound from the usual reduction to these smaller distances. Moreover, we show that this lower bound cannot be achieved by the standard lower bound approach based on a reduction to hypothesis testing, and instead we need to introduce a new reduction to what we call weak hypothesis testing. We investigate the source of the gap with other divergences further in refined results, which show that the total variation rate is achievable for Kullback-Leibler divergence after all (in fact by he maximum likelihood estimator) if we rule out outcome probabilities smaller than $O(\ln(K/\delta) / n)$, which is a vanishing set as $n$ increases for fixed $K$ and~$\delta$. This explains why minimax Kullback-Leibler estimation is more difficult than asymptotic estimation.
♻ ☆ Predictive Causal Inference via Spatio-Temporal Modeling and Penalized Empirical Likelihood
This study introduces an integrated framework for predictive causal inference designed to overcome limitations inherent in conventional single model approaches. Specifically, we combine a Hidden Markov Model (HMM) for spatial health state estimation with a Multi Task and Multi Graph Convolutional Network (MTGCN) for capturing temporal outcome trajectories. The framework asymmetrically treats temporal and spatial information regarding them as endogenous variables in the outcome regression, and exogenous variables in the propensity score model, thereby expanding the standard doubly robust treatment effect estimation to jointly enhance bias correction and predictive accuracy. To demonstrate its utility, we focus on clinical domains such as cancer, dementia, and Parkinson disease, where treatment effects are challenging to observe directly. Simulation studies are conducted to emulate latent disease dynamics and evaluate the model performance under varying conditions. Overall, the proposed framework advances predictive causal inference by structurally adapting to spatiotemporal complexities common in biomedical data.
♻ ☆ Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.
comment: 19 pages, 1 figure, 7 tables
♻ ☆ SafEDMD: A Koopman-based data-driven controller design framework for nonlinear dynamical systems
The Koopman operator serves as the theoretical backbone for machine learning of dynamical control systems, where the operator is heuristically approximated by extended dynamic mode decomposition (EDMD). In this paper, we propose SafEDMD, a novel stability- and feedback-oriented EDMD-based controller design framework. Our approach leverages a reliable surrogate model generated in a data-driven fashion in order to provide closed-loop guarantees. In particular, we establish a controller design based on semi-definite programming with guaranteed stabilization of the underlying nonlinear system. As central ingredient, we derive proportional error bounds that vanish at the origin and are tailored to control tasks. We illustrate the developed method by means of several benchmark examples and highlight the advantages over state-of-the-art methods.
comment: Accepted for publication in Automatica
♻ ☆ Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks NeurIPS 2025
Modern neural networks demonstrate state-of-the-art performance on numerous existing benchmarks; however, their high computational requirements and energy consumption prompt researchers to seek more efficient solutions for real-world deployment. Logic gate networks (LGNs) learns a large network of logic gates for efficient image classification. However, learning a network that can solve a simple problem like CIFAR-10 can take days to weeks to train. Even then, almost half of the network remains unused, causing a discretization gap. This discretization gap hinders real-world deployment of LGNs, as the performance drop between training and inference negatively impacts accuracy. We inject Gumbel noise with a straight-through estimator during training to significantly speed up training, improve neuron utilization, and decrease the discretization gap. We theoretically show that this results from implicit Hessian regularization, which improves the convergence properties of LGNs. We train networks $4.5 \times$ faster in wall-clock time, reduce the discretization gap by $98\%$, and reduce the number of unused gates by $100\%$.
comment: Accepted to NeurIPS 2025 (main track)
♻ ☆ Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset
Non-invasive brain-computer interfaces (BCIs) are beginning to benefit from large, public benchmarks. However, current benchmarks target relatively simple, foundational tasks like Speech Detection and Phoneme Classification, while application-ready results on tasks like Brain-to-Text remain elusive. We propose Keyword Spotting (KWS) as a practically applicable, privacy-aware intermediate task. Using the deep 52-hour, within-subject LibriBrain corpus, we provide standardized train/validation/test splits for reproducible benchmarking, and adopt an evaluation protocol tailored to extreme class imbalance. Concretely, we use area under the precision-recall curve (AUPRC) as a robust evaluation metric, complemented by false alarms per hour (FA/h) at fixed recall to capture user-facing trade-offs. To simplify deployment and further experimentation within the research community, we are releasing an updated version of the pnpl library with word-level dataloaders and Colab-ready tutorials. As an initial reference model, we present a compact 1-D Conv/ResNet baseline with focal loss and top-k pooling that is trainable on a single consumer-class GPU. The reference model achieves approximately 13x the permutation baseline AUPRC on held-out sessions, demonstrating the viability of the task. Exploratory analyses reveal: (i) predictable within-subject scaling - performance improves log-linearly with more training hours - and (ii) the existence of word-level factors (frequency and duration) that systematically modulate detectability.
comment: 16 pages, 7 figures, 6 tables; updated acknowledgments
♻ ☆ AIMeter: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads
The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presents AIMeter, a comprehensive software toolkit for the measurement, analysis, and visualization of energy use, power draw, hardware performance, and carbon emissions across AI workloads. By seamlessly integrating with existing AI frameworks, AIMeter offers standardized reports and exports fine-grained time-series data to support benchmarking and reproducibility in a lightweight manner. It further enables in-depth correlation analysis between hardware metrics and model performance and thus facilitates bottleneck identification and performance enhancement. By addressing critical limitations in existing tools, AIMeter encourages the research community to weigh environmental impact alongside raw performance of AI workloads and advances the shift toward more sustainable "Green AI" practices. The code is available at https://github.com/SusCom-Lab/AIMeter.
comment: 11 pages, 7 figures and 5 tables
♻ ☆ Stability and Sharper Risk Bounds with Convergence Rate $\tilde{O}(1/n^2)$
Prior work (Klochkov $\&$ Zhivotovskiy, 2021) establishes at most $O\left(\log (n)/n\right)$ excess risk bounds via algorithmic stability for strongly-convex learners with high probability. We show that under the similar common assumptions -- - Polyak-Lojasiewicz condition, smoothness, and Lipschitz continous for losses -- - rates of $O\left(\log^2(n)/n^2\right)$ are at most achievable. To our knowledge, our analysis also provides the tightest high-probability bounds for gradient-based generalization gaps in nonconvex settings.
♻ ☆ Hysteresis Activation Function for Efficient Inference NeurIPS
The widely used ReLU is favored for its hardware efficiency, {as the implementation at inference is a one bit sign case,} yet suffers from issues such as the ``dying ReLU'' problem, where during training, neurons fail to activate and constantly remain at zero, as highlighted by Lu et al. Traditional approaches to mitigate this issue often introduce more complex and less hardware-friendly activation functions. In this work, we propose a Hysteresis Rectified Linear Unit (HeLU), an efficient activation function designed to address the ``dying ReLU'' problem with minimal complexity. Unlike traditional activation functions with fixed thresholds for training and inference, HeLU employs a variable threshold that refines the backpropagation. This refined mechanism allows simpler activation functions to achieve competitive performance comparable to their more complex counterparts without introducing unnecessary complexity or requiring inductive biases. Empirical evaluations demonstrate that HeLU enhances model generalization across diverse datasets, offering a promising solution for efficient and effective inference suitable for a wide range of neural network architectures.
comment: Accepted to 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024)
♻ ☆ GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
comment: Project Page: https://jingw193.github.io/GRPO-Guard/
♻ ☆ Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL NeurIPS 2025
A key challenge in offline multi-agent reinforcement learning (MARL) is achieving effective many-agent multi-step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention-based architecture Sable and combines it with a sequential form of implicit constraint Q-learning (ICQ), to develop a novel offline autoregressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over long trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works -- SMAC, RWARE, and Multi-Agent MuJoCo -- covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state-of-the-art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many-agent coordination in offline MARL, and demonstrate Oryx's superior ability to scale effectively in such settings.
comment: Published at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Neurosymbolic Diffusion Models NeurIPS 2025
Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predictors assume conditional independence between the symbols they extract, thus limiting their ability to model interactions and uncertainty - often leading to overconfident predictions and poor out-of-distribution generalisation. To overcome the limitations of the independence assumption, we introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy predictors that use discrete diffusion to model dependencies between symbols. Our approach reuses the independence assumption from NeSy predictors at each step of the diffusion process, enabling scalable learning while capturing symbol dependencies and uncertainty quantification. Across both synthetic and real-world benchmarks - including high-dimensional visual path planning and rule-based autonomous driving - NeSyDMs achieve state-of-the-art accuracy among NeSy predictors and demonstrate strong calibration.
comment: Accepted to NeurIPS 2025
♻ ☆ AnomalyMatch: Discovering Rare Objects of Interest with Semi-supervised and Active Learning
Anomaly detection in large datasets is essential in astronomy and computer vision. However, due to a scarcity of labelled data, it is often infeasible to apply supervised methods to anomaly detection. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. AnomalyMatch is tailored for large-scale applications and integrated into the ESA Datalabs science platform. In this method, we treat anomaly detection as a binary classification problem and efficiently utilise limited labelled and abundant unlabelled images for training. We enable active learning via a user interface for verification of high-confidence anomalies and correction of false positives. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance display strong performance. Starting from five to ten labelled anomalies, we achieve an average AUROC of 0.96 (miniImageNet) and 0.89 (GalaxyMNIST), with respective AUPRC of 0.82 and 0.77. After three active learning cycles, anomalies are ranked with 76% (miniImageNet) to 94% (GalaxyMNIST) precision in the top 1% of the highest-ranking images by score. We compare to the established Astronomaly software on selected 'odd' galaxies from the 'Galaxy Zoo - The Galaxy Challenge' dataset, achieving comparable performance with an average AUROC of 0.83. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity.
comment: Journal submission in preparation to RASTI; 15 pages; 12 figures
♻ ☆ Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data NeurIPS'25
Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this work, we investigate the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-design for LLMs.
comment: Accepted by NeurIPS'25 main track. 47 pages, 21 figures, 32 tables
♻ ☆ Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions
In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.
comment: 54 pages, 20 figures
♻ ☆ Experiments with Optimal Model Trees
Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic'' decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.
♻ ☆ Decoding for Punctured Convolutional and Turbo Codes: A Deep Learning Solution for Protocols Compliance
Neural network-based decoding methods show promise in enhancing error correction performance but face challenges with punctured codes. In particular, existing methods struggle to adapt to variable code rates or meet protocol compatibility requirements. This paper proposes a unified long short-term memory (LSTM)-based neural decoder for punctured convolutional and Turbo codes to address these challenges. The key component of the proposed LSTM-based neural decoder is puncturing-aware embedding, which integrates puncturing patterns directly into the neural network to enable seamless adaptation to different code rates. Moreover, a balanced bit error rate training strategy is designed to ensure the decoder's robustness across various code lengths, rates, and channels. In this way, the protocol compatibility requirement can be realized. Extensive simulations in both additive white Gaussian noise (AWGN) and Rayleigh fading channels demonstrate that the proposed neural decoder outperforms conventional decoding techniques, offering significant improvements in decoding accuracy and robustness.
♻ ☆ Beyond likelihood ratio bias: Nested multi-time-scale stochastic approximation for likelihood-free parameter estimation
We study parameter inference in simulation-based stochastic models where the analytical form of the likelihood is unknown. The main difficulty is that score evaluation as a ratio of noisy Monte Carlo estimators induces bias and instability, which we overcome with a ratio-free nested multi-time-scale (NMTS) stochastic approximation (SA) method that simultaneously tracks the score and drives the parameter update. We provide a comprehensive theoretical analysis of the proposed NMTS algorithm for solving likelihood-free inference problems, including strong convergence, asymptotic normality, and convergence rates. We show that our algorithm can eliminate the original asymptotic bias $O\big(\sqrt{\frac{1}{N}}\big)$ and accelerate the convergence rate from $O\big(\beta_k+\sqrt{\frac{1}{N}}\big)$ to $O\big(\frac{\beta_k}{\alpha_k}+\sqrt{\frac{\alpha_k}{N}}\big)$, where $N$ is the fixed batch size, $\alpha_k$ and $\beta_k$ are decreasing step sizes with $\alpha_k$, $\beta_k$, $\beta_k/\alpha_k\rightarrow 0$. With proper choice of $\alpha_k$ and $\beta_k$, our convergence rates can match the optimal rate in the multi-time-scale SA literature. Numerical experiments demonstrate that our algorithm can improve the estimation accuracy by one to two orders of magnitude at the same computational cost, making it efficient for parameter estimation in stochastic systems.
♻ ☆ Chaos-based reinforcement learning with TD3
Chaos-based reinforcement learning (CBRL) is a method in which the agent's internal chaotic dynamics drives exploration. However, the learning algorithms in CBRL have not been thoroughly developed in previous studies, nor have they incorporated recent advances in reinforcement learning. This study introduced Twin Delayed Deep Deterministic Policy Gradients (TD3), which is one of the state-of-the-art deep reinforcement learning algorithms that can treat deterministic and continuous action spaces, to CBRL. The validation results provide several insights. First, TD3 works as a learning algorithm for CBRL in a simple goal-reaching task. Second, CBRL agents with TD3 can autonomously suppress their exploratory behavior as learning progresses and resume exploration when the environment changes. Finally, examining the effect of the agent's chaoticity on learning shows that there exists a suitable range of chaos strength in the agent's model to flexibly switch between exploration and exploitation and adapt to environmental changes.
comment: Accepted for publication in Neural Networks
Human-Computer Interaction 22
☆ Human-AI Complementarity: A Goal for Amplified Oversight
Human feedback is critical for aligning AI systems to human values. As AI capabilities improve and AI is used to tackle more challenging tasks, verifying quality and safety becomes increasingly challenging. This paper explores how we can leverage AI to improve the quality of human oversight. We focus on an important safety problem that is already challenging for humans: fact-verification of AI outputs. We find that combining AI ratings and human ratings based on AI rater confidence is better than relying on either alone. Giving humans an AI fact-verification assistant further improves their accuracy, but the type of assistance matters. Displaying AI explanation, confidence, and labels leads to over-reliance, but just showing search results and evidence fosters more appropriate trust. These results have implications for Amplified Oversight -- the challenge of combining humans and AI to supervise AI systems even as they surpass human expert performance.
☆ Metacognition and Confidence Dynamics in Advice Taking from Generative AI
Generative Artificial Intelligence (GenAI) can aid humans in a wide range of tasks, but its effectiveness critically depends on users being able to evaluate the accuracy of GenAI outputs and their own expertise. Here we asked how confidence in self and GenAI contributes to decisions to seek and rely on advice from GenAI ('prospective confidence'), and how advice-taking in turn shapes this confidence ('retrospective confidence'). In a novel paradigm involving text generation, participants formulated plans for events, and could request advice from a GenAI (Study 1; N=200) or were randomly assigned to receive advice (Study 2; N=300), which they could rely on or ignore. Advice requests in Study 1 were related to higher prospective confidence in GenAI and lower confidence in self. Advice-seekers showed increased retrospective confidence in GenAI, while those who declined advice showed increased confidence in self. Random assignment in Study 2 revealed that advice exposure increases confidence in GenAI and in self, suggesting that GenAI advice-taking causally boosts retrospective confidence. These results were mirrored in advice reliance, operationalised as the textual similarity between GenAI advice and participants' responses, with reliance associated with increased retrospective confidence in both GenAI and self. Critically, participants who chose to obtain/rely on advice provided more detailed responses (likely due to the output's verbosity), but failed to check the output thoroughly, missing key information. These findings underscore a key role for confidence in interactions with GenAI, shaped by both prior beliefs about oneself and the reliability of AI, and context-dependent exposure to advice.
☆ Scaffolding Creativity: How Divergent and Convergent LLM Personas Shape Human Machine Creative Problem-Solving
Large language models (LLMs) are increasingly shaping creative work and problem-solving; however, prior research suggests that they may diminish unassisted creativity. To address this tension, a coach-like LLM environment was developed that embodies divergent and convergent thinking personas as two complementary processes. Effectiveness and user behavior were assessed through a controlled experiment in which participants interacted with either persona, while a control group engaged with a standard LLM providing direct answers. Notably, users' perceptions of which persona best supported their creativity often diverged from objective performance measures. Trait-based analyses revealed that individual differences predict when people utilize divergent versus convergent personas, suggesting opportunities for adaptive sequencing. Furthermore, interaction patterns reflected the design thinking model, demonstrating how persona-guided support shapes creative problem-solving. Our findings provide design principles for creativity support systems that strike a balance between exploration and convergence through persona-based guidance and personalization. These insights advance human-AI collaboration tools that scaffold rather than overshadow human creativity.
☆ Look at That Distractor: Dynamic Translation Gain under Low Perceptual Load in Virtual Reality
Redirected walking utilizes gain adjustments within perceptual thresholds to allow natural navigation in large scale virtual environments within confined physical environments. Previous research has found that when users are distracted by some scene elements, they are less sensitive to gain values. However, the effects on detection thresholds have not been quantitatively measured. In this paper, we present a novel method that dynamically adjusts translation gain by leveraging visual distractors. We place distractors within the user's field of view and apply a larger translation gain when their attention is drawn to them. Because the magnitude of gain adjustment depends on the user's level of engagement with the distractors, the redirection process remains smooth and unobtrusive. To evaluate our method, we developed a task oriented virtual environment for a user study. Results show that introducing distractors in the virtual environment significantly raises users' translation gain thresholds. Furthermore, assessments using the Simulator Sickness Questionnaire and Igroup Presence Questionnaire indicate that the method maintains user comfort and acceptance, supporting its effectiveness for RDW systems.
☆ Avatar Appearance Beyond Pixels -- User Ratings and Avatar Preferences within Health Applications
The appearance of a virtual avatar significantly influences its perceived appropriateness and the user's experience, particularly in healthcare applications. This study analyzed interactions with six avatars of varying characteristics in a patient-reported outcome measures (PROMs) application to investigate correlations between avatar ratings and user preferences. Forty-seven participants completed a healthcare survey involving 30 PROMIS items (Global Health and Physical Function) and then rated the avatars on warmth, competence, attractiveness, and human-likeness, as well as their willingness to share personal data. The results showed that competence was the most critical factor in avatar selection, while human-likeness had minimal impact on health data disclosure. Gender did not significantly affect the ratings, but clothing style played a key role, with male avatars in professional attire rated higher in competence due to gender-stereotypical expectations. In contrast, professional female avatars were rated lower in warmth and attractiveness. These findings underline the importance of thoughtful avatar design in healthcare applications to enhance user experience and engagement.
comment: 17 pages, 3 figures, 1 table
☆ Structurally Valid Log Generation using FSM-GFlowNets
Generating structurally valid and behaviorally diverse synthetic event logs for interaction-aware models is a challenging yet crucial problem, particularly in settings with limited or privacy constrained user data. Existing methods such as heuristic simulations and LLM based generators often lack structural coherence or controllability, producing synthetic data that fails to accurately represent real world system interactions. This paper presents a framework that integrates Finite State Machines or FSMs with Generative Flow Networks or GFlowNets to generate structured, semantically valid, and diverse synthetic event logs. Our FSM-constrained GFlowNet ensures syntactic validity and behavioral variation through dynamic action masking and guided sampling. The FSM, derived from expert traces, encodes domain-specific rules, while the GFlowNet is trained using a flow matching objective with a hybrid reward balancing FSM compliance and statistical fidelity. We instantiate the framework in the context of UI interaction logs using the UIC HCI dataset, but the approach generalizes to any symbolic sequence domain. Experimental results based on distributional metrics show that our FSM GFlowNet produces realistic, structurally consistent logs, achieving, for instance, under the real user logs baseline, a KL divergence of 0.2769 and Chi squared distance of 0.3522, significantly outperforming GPT-4o's 2.5294/13.8020 and Gemini's 3.7233/63.0355, alongside a leading bigram overlap of 0.1214 vs. GPT 4o's 0.0028 and Gemini's 0.0007. A downstream use case intent classification demonstrates that classifiers trained solely on our synthetic logs produced from FSM-GFlowNet achieve competitive accuracy compared to real data.
☆ Linking Heterogeneous Data with Coordinated Agent Flows for Social Media Analysis
Social media platforms generate massive volumes of heterogeneous data, capturing user behaviors, textual content, temporal dynamics, and network structures. Analyzing such data is crucial for understanding phenomena such as opinion dynamics, community formation, and information diffusion. However, discovering insights from this complex landscape is exploratory, conceptually challenging, and requires expertise in social media mining and visualization. Existing automated approaches, though increasingly leveraging large language models (LLMs), remain largely confined to structured tabular data and cannot adequately address the heterogeneity of social media analysis. We present SIA (Social Insight Agents), an LLM agent system that links heterogeneous multi-modal data -- including raw inputs (e.g., text, network, and behavioral data), intermediate outputs, mined analytical results, and visualization artifacts -- through coordinated agent flows. Guided by a bottom-up taxonomy that connects insight types with suitable mining and visualization techniques, SIA enables agents to plan and execute coherent analysis strategies. To ensure multi-modal integration, it incorporates a data coordinator that unifies tabular, textual, and network data into a consistent flow. Its interactive interface provides a transparent workflow where users can trace, validate, and refine the agent's reasoning, supporting both adaptability and trustworthiness. Through expert-centered case studies and quantitative evaluation, we show that SIA effectively discovers diverse and meaningful insights from social media while supporting human-agent collaboration in complex analytical tasks.
☆ Interaction-Augmented Instruction: Modeling the Synergy of Prompts and Interactions in Human-GenAI Collaboration
Text prompt is the most common way for human-generative AI (GenAI) communication. Though convenient, it is challenging to convey fine-grained and referential intent. One promising solution is to combine text prompts with precise GUI interactions, like brushing and clicking. However, there lacks a formal model to model synergistic designs between prompts and interactions, hindering their comparison and innovation. To fill this gap, via an iterative and deductive process, we develop the Interaction-Augmented Instruction (IAI) model, a compact entity-relation graph formalizing how the combination of interactions and text prompts enhances human-generative AI communication. With the model, we distill twelve recurring and composable atomic interaction paradigms from prior tools, verifying our model's capability to facilitate systematic design characterization and comparison. Case studies further demonstrate the model's utility in applying, refining, and extending these paradigms. These results illustrate our IAI model's descriptive, discriminative, and generative power for shaping future GenAI systems.
comment: 26 pages
☆ Can AI be Accountable?
The AI we use is powerful, and its power is increasing rapidly. If this powerful AI is to serve the needs of consumers, voters, and decision makers, then it is imperative that the AI is accountable. In general, an agent is accountable to a forum if the forum can request information from the agent about its actions, if the forum and the agent can discuss this information, and if the forum can sanction the agent. Unfortunately, in too many cases today's AI is not accountable -- we cannot question it, enter into a discussion with it, let alone sanction it. In this chapter we relate the general definition of accountability to AI, we illustrate what it means for AI to be accountable and unaccountable, and we explore approaches that can improve our chances of living in a world where all AI is accountable to those who are affected by it.
comment: To be published as a chapter in Daniele Quercia and Marios Constantinides (Eds.). Operationalizing Responsible AI. Cambridge University Press. Forthcoming
☆ FractalBrain: A Neuro-interactive Virtual Reality Experience using Electroencephalogram (EEG) for Mindfulness
Mindfulness has been studied and practiced in enhancing psychological well-being while reducing neuroticism and psychopathological indicators. However, practicing mindfulness with continuous attention is challenging, especially for beginners. In the proposed system, FractalBrain, we utilize an interactive audiovisual fractal with a geometric repetitive pattern that has been demonstrated to induce meditative effects. FractalBrain presents an experience combining a surreal virtual reality (VR) program with an electroencephalogram (EEG) interface. While viewing an ever-changing fractal-inspired artwork in an immersive environment, the user's EEG stream is analyzed and mapped into VR. These EEG data adaptively manipulates the audiovisual parameters in real-time, generating a distinct experience for each user. The pilot feedback suggests the potential of the FractalBrain to facilitate mindfulness and enhance attention.
comment: Extended Abstracts (Interactivity), 4 pages. Published at the 2024 CHI Conference on Human Factors in Computing Systems
☆ VitalLens 2.0: High-Fidelity rPPG for Heart Rate Variability Estimation from Face Video
This report introduces VitalLens 2.0, a new deep learning model for estimating physiological signals from face video. This new model demonstrates a significant leap in accuracy for remote photoplethysmography (rPPG), enabling the robust estimation of not only heart rate (HR) and respiratory rate (RR) but also Heart Rate Variability (HRV) metrics. This advance is achieved through a combination of a new model architecture and a substantial increase in the size and diversity of our training data, now totaling 1,413 unique individuals. We evaluate VitalLens 2.0 on a new, combined test set of 422 unique individuals from four public and private datasets. When averaging results by individual, VitalLens 2.0 achieves a Mean Absolute Error (MAE) of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD. These results represent a new state-of-the-art, significantly outperforming previous methods. This model is now available to developers via the VitalLens API at https://rouast.com/api.
comment: Technical Report. 8 pages, 5 figures. Introduces the VitalLens 2.0 model for rPPG and Heart Rate Variability (HRV) estimation. Project website: https://rouast.com/api
☆ AIOT based Smart Education System: A Dual Layer Authentication and Context-Aware Tutoring Framework for Learning Environments
The AIoT-Based Smart Education System integrates Artificial Intelligence and IoT to address persistent challenges in contemporary classrooms: attendance fraud, lack of personalization, student disengagement, and inefficient resource use. The unified platform combines four core modules: (1) a dual-factor authentication system leveraging RFID-based ID scans and WiFi verification for secure, fraud-resistant attendance; (2) an AI-powered assistant that provides real-time, context-aware support and dynamic quiz generation based on instructor-supplied materials; (3) automated test generators to streamline adaptive assessment and reduce administrative overhead; and (4) the EcoSmart Campus module, which autonomously regulates classroom lighting, air quality, and temperature using IoT sensors and actuators. Simulated evaluations demonstrate the system's effectiveness in delivering robust real-time monitoring, fostering inclusive engagement, preventing fraudulent practices, and supporting operational scalability. Collectively, the AIoT-Based Smart Education System offers a secure, adaptive, and efficient learning environment, providing a scalable blueprint for future educational innovation and improved student outcomes through the synergistic application of artificial intelligence and IoT technologies.
☆ Using Salient Object Detection to Identify Manipulative Cookie Banners that Circumvent GDPR AAAI
The main goal of this paper is to study how often cookie banners that comply with the General Data Protection Regulation (GDPR) contain aesthetic manipulation, a design tactic to draw users' attention to the button that permits personal data sharing. As a byproduct of this goal, we also evaluate how frequently the banners comply with GDPR and the recommendations of national data protection authorities regarding banner designs. We visited 2,579 websites and identified the type of cookie banner implemented. Although 45% of the relevant websites have fully compliant banners, we found aesthetic manipulation on 38% of the compliant banners. Unlike prior studies of aesthetic manipulation, we use a computer vision model for salient object detection to measure how salient (i.e., attention-drawing) each banner element is. This enables the discovery of new types of aesthetic manipulation (e.g., button placement), and leads us to conclude that aesthetic manipulation is more common than previously reported (38% vs 27% of banners). To study the effects of user and/or website location on cookie banner design, we include websites within the European Union (EU), where privacy regulation enforcement is more stringent, and websites outside the EU. We visited websites from IP addresses in the EU and from IP addresses in the United States (US). We find that 13.9% of EU websites change their banner design when the user is from the US, and EU websites are roughly 48.3% more likely to use aesthetic manipulation than non-EU websites, highlighting their innovative responses to privacy regulation.
comment: Accepted to International AAAI Conference on Web and Social Media 2026 (ICWSM'26)
♻ ☆ Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
On social media, many individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly. Instead, signs may surface indirectly through everyday posts or peer interactions. Detecting such implicit signals early is critical but remains challenging. We frame early and implicit SI as a forward-looking prediction task and develop a computational framework that models a user's information environment, consisting of both their longitudinal posting histories as well as the discourse of their socially proximal peers. We adopted a composite network centrality measure to identify top neighbors of a user, and temporally aligned the user's and neighbors' interactions -- integrating the multi-layered signals in a fine-tuned DeBERTa-v3 model. In a Reddit study of 1,000 (500 Case and 500 Control) users, our approach improves early and implicit SI detection by 15% over individual-only baselines. These findings highlight that peer interactions offer valuable predictive signals and carry broader implications for designing early detection systems that capture indirect as well as masked expressions of risk in online environments.
♻ ☆ AI's Social Forcefield: Reshaping Distributed Cognition in Human-AI Teams
AI is not only a neutral tool in team settings; it actively reshapes the social and cognitive fabric of collaboration. We advance a unified framework of alignment in distributed cognition in human-AI teams -- a process through which linguistic, cognitive, and social coordination emerge as human and AI agents co-construct a shared representational space. Across two studies, we show that exposure to AI-generated language shapes not only how people speak, but also how they think, what they attend to, and how they relate to each other. Together, these findings reveal how AI participation reorganizes the distributed cognitive architecture of teams: AI systems function as implicit social forcefields. Our findings highlight the double-edged impact of AI: the same mechanisms that enable efficient collaboration can also erode epistemic diversity and undermine natural alignment processes. We argue for rethinking AI in teams as a socially influential actor and call for new design paradigms that foreground transparency, controllability, and group-level dynamics to foster responsible, productive human-AI collaboration.
♻ ☆ SmartSustain Recommender System: Navigating Sustainability Trade-offs in Personalized City Trip Planning CIKM'2025
Tourism is a major contributor to global carbon emissions and over-tourism, creating an urgent need for recommender systems that not only inform but also gently steer users toward more sustainable travel decisions. Such choices, however, often require balancing complex trade-offs between environmental impact, cost, convenience, and personal interests. To address this, we present the SmartSustain Recommender, a web application designed to nudge users toward eco-friendlier options through an interactive, user-centric interface. The system visualizes the broader consequences of travel decisions by combining CO2e emissions, destination popularity, and seasonality with personalized interest matching. It employs mechanisms such as interactive city cards for quick comparisons, dynamic banners that surface sustainable alternatives in specific trade-off scenarios, and real-time impact feedback using animated environmental indicators. A preliminary user study with 21 participants indicated strong usability and perceived effectiveness. The system is accessible at https://smartsustainrecommender.web.app.
comment: Accepted for presentation at Workshop on Recommender Systems for Sustainable Development (RS4SD), co-located with CIKM'2025
♻ ☆ Proceedings of the Access InContext Workshop @ CHI'25 Conference on Human Factors in Computing Systems
This is the Proceedings of the Access InContext Workshop, which was held at the CHI'25 Conference on Human Factors in Computing Systems, in Yokohama, Japan, on April 26th 2025.
♻ ☆ Reflection on Data Storytelling Tools in the Generative AI Era from the Human-AI Collaboration Perspective IEEE VIS 25
Human-AI collaborative tools attract attentions from the data storytelling community to lower the expertise barrier and streamline the workflow. The recent advance in large-scale generative AI techniques, e.g., large language models (LLMs) and text-to-image models, has the potential to enhance data storytelling with their power in visual and narration generation. After two years since these techniques were publicly available, it is important to reflect our progress of applying them and have an outlook for future opportunities. To achieve the goal, we compare the collaboration patterns of the latest tools with those of earlier ones using a dedicated framework for understanding human-AI collaboration in data storytelling. Through comparison, we identify consistently widely studied patterns, e.g., human-creator + AI-assistant, and newly explored or emerging ones, e.g., AI-creator + human-reviewer. The benefits of these AI techniques and implications to human-AI collaboration are also revealed. We further propose future directions to hopefully ignite innovations.
comment: This paper is a sequel to the CHI 24 paper "Where Are We So Far? Understanding Data Storytelling Tools from the Perspective of Human-AI Collaboration (https://doi.org/10.1145/3613904.3642726), aiming to refresh our understanding with the latest advancements. It is accepted at IEEE VIS 25
♻ ☆ AI for a Planet Under Pressure
Artificial intelligence (AI) is already driving scientific breakthroughs in a variety of research fields, ranging from the life sciences to mathematics. This raises a critical question: can AI be applied both responsibly and effectively to address complex and interconnected sustainability challenges? This report is the result of a collaboration between the Stockholm resilience Centre (Stockholm University), the Potsdam Institute for Climate Impact Research (PIK), and Google DeepMind. Our work explores the potential and limitations of using AI as a research method to help tackle eight broad sustainability challenges. The results build on iterated expert dialogues and assessments, a systematic AI-supported literature overview including over 8,500 academic publications, and expert deep-dives into eight specific issue areas. The report also includes recommendations to sustainability scientists, research funders, the private sector, and philanthropies.
comment: 88 pages, 8 figures, 1 table
♻ ☆ Sensible Agent: A Framework for Unobtrusive Interaction with Proactive AR Agents
Proactive AR agents promise context-aware assistance, but their interactions often rely on explicit voice prompts or responses, which can be disruptive or socially awkward. We introduce Sensible Agent, a framework designed for unobtrusive interaction with these proactive agents. Sensible Agent dynamically adapts both "what" assistance to offer and, crucially, "how" to deliver it, based on real-time multimodal context sensing. Informed by an expert workshop (n=12) and a data annotation study (n=40), the framework leverages egocentric cameras, multimodal sensing, and Large Multimodal Models (LMMs) to infer context and suggest appropriate actions delivered via minimally intrusive interaction modes. We demonstrate our prototype on an XR headset through a user study (n=10) in both AR and VR scenarios. Results indicate that Sensible Agent significantly reduces perceived interaction effort compared to voice-prompted baseline, while maintaining high usability and achieving higher preference.
♻ ☆ Improving Human Verification of LLM Reasoning through Interactive Explanation Interfaces
The reasoning capabilities of Large Language Models (LLMs) have led to their increasing employment in several critical applications, particularly education, where they support problem-solving, tutoring, and personalized study. Chain-of-thought (CoT) reasoning capabilities [1, 2] are well-known to help LLMs decompose a problem into steps and explore the solution spaces more effectively, leading to impressive performance on mathematical and reasoning benchmarks. As the length of CoT tokens per question increases substantially to even thousands of tokens per question [ 1], it is unknown how users could comprehend LLM reasoning and detect errors or hallucinations. To address this problem and understand how reasoning can improve human-AI interaction, we present three new interactive reasoning interfaces: interactive CoT (iCoT), interactive Program-of-Thought (iPoT), and interactive Graph (iGraph). That is, we ask LLMs themselves to generate an interactive web interface wrapped around the original CoT content, which may be presented in text (iCoT), graphs (iGraph) or code (iPoT). This interface allows users to interact with and provide a novel experience in reading and validating the reasoning chains of LLMs. Across a study of 125 participants, interactive interfaces significantly improve user performance. Specifically, iGraph users score the highest error detection rate (85.6%), followed by iPoT (82.5%), iCoT (80.6%), all outperforming standard CoT (73.5%). Interactive interfaces also lead to faster user validation time-iGraph users are faster (57.9 secs per question) than the users of iCoT and iPoT (60 secs) and the standard CoT (64.7 secs). A post-study questionnaire shows that users prefer iGraph, citing its superior ability to enable them to follow the LLM's reasoning. We discuss the implications of these results and provide recommendations for the future design of reasoning models.
comment: 19 pages, 14 figures
♻ ☆ Modelling Emotions in Face-to-Face Setting: The Interplay of Eye-Tracking, Personality, and Temporal Dynamics
Accurate emotion recognition is pivotal for nuanced and engaging human-computer interactions, yet remains difficult to achieve, especially in dynamic, conversation-like settings. In this study, we showcase how integrating eye-tracking data, temporal dynamics, and personality traits can substantially enhance the detection of both perceived and felt emotions. Seventy-three participants viewed short, speech-containing videos from the CREMA-D dataset, while being recorded for eye-tracking signals (pupil size, fixation patterns), Big Five personality assessments, and self-reported emotional states. Our neural network models combined these diverse inputs including stimulus emotion labels for contextual cues and yielded marked performance gains compared to the state-of-the-art. Specifically, perceived valence predictions reached a macro F1-score of 0.76, and models incorporating personality traits and stimulus information demonstrated significant improvements in felt emotion accuracy. These results highlight the benefit of unifying physiological, individual and contextual factors to address the subjectivity and complexity of emotional expression. Beyond validating the role of user-specific data in capturing subtle internal states, our findings inform the design of future affective computing and human-agent systems, paving the way for more adaptive and cross-individual emotional intelligence in real-world interactions.
comment: The paper has been significantly revised and my colleague has already submitted the updated version in the link below: arXiv:2510.24720
Software Engineering 29
☆ Optimized Log Parsing with Syntactic Modifications
Logs provide valuable insights into system runtime and assist in software development and maintenance. Log parsing, which converts semi-structured log data into structured log data, is often the first step in automated log analysis. Given the wide range of log parsers utilizing diverse techniques, it is essential to evaluate them to understand their characteristics and performance. In this paper, we conduct a comprehensive empirical study comparing syntax- and semantic-based log parsers, as well as single-phase and two-phase parsing architectures. Our experiments reveal that semantic-based methods perform better at identifying the correct templates and syntax-based log parsers are 10 to 1,000 times more efficient and provide better grouping accuracy although they fall short in accurate template identification. Moreover, two-phase architecture consistently improves accuracy compared to single-phase architecture. Based on the findings of this study, we propose SynLog+, a template identification module that acts as the second phase in a two-phase log parsing architecture. SynLog+ improves the parsing accuracy of syntax-based and semantic-based log parsers by 236\% and 20\% on average, respectively, with virtually no additional runtime cost.
☆ Using Copilot Agent Mode to Automate Library Migration: A Quantitative Assessment
Keeping software systems up to date is essential to avoid technical debt, security vulnerabilities, and the rigidity typical of legacy systems. However, updating libraries and frameworks remains a time consuming and error-prone process. Recent advances in Large Language Models (LLMs) and agentic coding systems offer new opportunities for automating such maintenance tasks. In this paper, we evaluate the update of a well-known Python library, SQLAlchemy, across a dataset of ten client applications. For this task, we use the Github's Copilot Agent Mode, an autonomous AI systema capable of planning and executing multi-step migration workflows. To assess the effectiveness of the automated migration, we also introduce Migration Coverage, a metric that quantifies the proportion of API usage points correctly migrated. The results of our study show that the LLM agent was capable of migrating functionalities and API usages between SQLAlchemy versions (migration coverage: 100%, median), but failed to maintain the application functionality, leading to a low test-pass rate (39.75%, median).
☆ Process-based Indicators of Vulnerability Re-Introducing Code Changes: An Exploratory Case Study
Software vulnerabilities often persist or re-emerge even after being fixed, revealing the complex interplay between code evolution and socio-technical factors. While source code metrics provide useful indicators of vulnerabilities, software engineering process metrics can uncover patterns that lead to their introduction. Yet few studies have explored whether process metrics can reveal risky development activities over time -- insights that are essential for anticipating and mitigating software vulnerabilities. This work highlights the critical role of process metrics along with code changes in understanding and mitigating vulnerability reintroduction. We move beyond file-level prediction and instead analyze security fixes at the commit level, focusing not only on whether a single fix introduces a vulnerability but also on the longer sequences of changes through which vulnerabilities evolve and re-emerge. Our approach emphasizes that reintroduction is rarely the result of one isolated action, but emerges from cumulative development activities and socio-technical conditions. To support this analysis, we conducted a case study on the ImageMagick project by correlating longitudinal process metrics such as bus factor, issue density, and issue spoilage with vulnerability reintroduction activities, encompassing 76 instances of reintroduced vulnerabilities. Our findings show that reintroductions often align with increased issue spoilage and fluctuating issue density, reflecting short-term inefficiencies in issue management and team responsiveness. These observations provide a foundation for broader studies that combine process and code metrics to predict risky fixes and strengthen software security.
comment: 9 pages, 6 figures; Samiha Shimmi and Nicholas M. Synovic contributed equally to this work (co-first authors); Mona Rahimi and George K. Thiruvathukal contributed equally to this work (co-supervisors)
☆ Stitch: Step-by-step LLM Guided Tutoring for Scratch
Block-based environments such as Scratch are increasingly popular in programming education. While block syntax reduces surface errors, semantic bugs remain common and challenging for novices to resolve. Existing debugging workflows typically show the correct program directly to learners, a strategy that may fix errors but undermines the development of problem-solving skills. We present Stitch, an interactive tutoring system that replaces "showing the answer" with step-by-step scaffolding. The system's Diff-Analyze module contrasts a student's project with a reference implementation, identifies the most critical differences, and uses a large language model to explain why these changes matter. Learners inspect highlighted blocks through a custom rendering engine, understand the explanations, and selectively apply partial fixes. This iterative process continues until the intended functionality is achieved. We evaluate Stitch in an empirical study, comparing it against a state-of-the-art automated feedback generation tool for Scratch. Our key insight is that simply presenting the correct program is pedagogically ineffective. In contrast, our interactive, step-by-step guided system promotes a more effective learning experience. More broadly, what constitutes effective feedback in block-based programming remains an open question. Our evaluation provides new evidence that step-by-step tutoring significantly enhances learning outcomes, outperforming both direct-answer approaches and current automated feedback generation tools.
☆ Toward Automated Security Risk Detection in Large Software Using Call Graph Analysis
Threat modeling plays a critical role in the identification and mitigation of security risks; however, manual approaches are often labor intensive and prone to error. This paper investigates the automation of software threat modeling through the clustering of call graphs using density-based and community detection algorithms, followed by an analysis of the threats associated with the identified clusters. The proposed method was evaluated through a case study of the Splunk Forwarder Operator (SFO), wherein selected clustering metrics were applied to the software's call graph to assess pertinent code-density security weaknesses. The results demonstrate the viability of the approach and underscore its potential to facilitate systematic threat assessment. This work contributes to the advancement of scalable, semi-automated threat modeling frameworks tailored for modern cloud-native environments.
☆ Online and Interactive Bayesian Inference Debugging ICSE 2026
Probabilistic programming is a rapidly developing programming paradigm which enables the formulation of Bayesian models as programs and the automation of posterior inference. It facilitates the development of models and conducting Bayesian inference, which makes these techniques available to practitioners from multiple fields. Nevertheless, probabilistic programming is notoriously difficult as identifying and repairing issues with inference requires a lot of time and deep knowledge. Through this work, we introduce a novel approach to debugging Bayesian inference that reduces time and required knowledge significantly. We discuss several requirements a Bayesian inference debugging framework has to fulfill, and propose a new tool that meets these key requirements directly within the development environment. We evaluate our results in a study with 18 experienced participants and show that our approach to online and interactive debugging of Bayesian inference significantly reduces time and difficulty on inference debugging tasks.
comment: Accepted by ICSE 2026
☆ "Show Me You Comply... Without Showing Me Anything": Zero-Knowledge Software Auditing for AI-Enabled Systems
The increasing exploitation of Artificial Intelligence (AI) enabled systems in critical domains has made trustworthiness concerns a paramount showstopper, requiring verifiable accountability, often by regulation (e.g., the EU AI Act). Classical software verification and validation techniques, such as procedural audits, formal methods, or model documentation, are the mechanisms used to achieve this. However, these methods are either expensive or heavily manual and ill-suited for the opaque, "black box" nature of most AI models. An intractable conflict emerges: high auditability and verifiability are required by law, but such transparency conflicts with the need to protect assets being audited-e.g., confidential data and proprietary models-leading to weakened accountability. To address this challenge, this paper introduces ZKMLOps, a novel MLOps verification framework that operationalizes Zero-Knowledge Proofs (ZKPs)-cryptographic protocols allowing a prover to convince a verifier that a statement is true without revealing additional information-within Machine-Learning Operations lifecycles. By integrating ZKPs with established software engineering patterns, ZKMLOps provides a modular and repeatable process for generating verifiable cryptographic proof of compliance. We evaluate the framework's practicality through a study of regulatory compliance in financial risk auditing and assess feasibility through an empirical evaluation of top ZKP protocols, analyzing performance trade-offs for ML models of increasing complexity.
comment: This work has been submitted to the ACM Transactions on Software Engineering and Methodology for possible publication
☆ Reflecting on Empirical and Sustainability Aspects of Software Engineering Research in the Era of Large Language Models
Software Engineering (SE) research involving the use of Large Language Models (LLMs) has introduced several new challenges related to rigour in benchmarking, contamination, replicability, and sustainability. In this paper, we invite the research community to reflect on how these challenges are addressed in SE. Our results provide a structured overview of current LLM-based SE research at ICSE, highlighting both encouraging practices and persistent shortcomings. We conclude with recommendations to strengthen benchmarking rigour, improve replicability, and address the financial and environmental costs of LLM-based SE.
comment: 5 pages
☆ Envisioning Future Interactive Web Development: Editing Webpage with Natural Language
The evolution of web applications relies on iterative code modifications, a process that is traditionally manual and time-consuming. While Large Language Models (LLMs) can generate UI code, their ability to edit existing code from new design requirements (e.g., "center the logo") remains a challenge. This is largely due to the absence of large-scale, high-quality tuning data to align model performance with human expectations. In this paper, we introduce a novel, automated data generation pipeline that uses LLMs to synthesize a high-quality fine-tuning dataset for web editing, named Instruct4Edit. Our approach generates diverse instructions, applies the corresponding code modifications, and performs visual verification to ensure correctness. By fine-tuning models on Instruct4Edit, we demonstrate consistent improvement in translating human intent into precise, structurally coherent, and visually accurate code changes. This work provides a scalable and transparent foundation for natural language based web editing, demonstrating that fine-tuning smaller open-source models can achieve competitive performance with proprietary systems. We release all data, code implementations, and model checkpoints for reproduction.
comment: accepted by AIWare'25
☆ Automated Extract Method Refactoring with Open-Source LLMs: A Comparative Study
Automating the Extract Method refactoring (EMR) remains challenging and largely manual despite its importance in improving code readability and maintainability. Recent advances in open-source, resource-efficient Large Language Models (LLMs) offer promising new approaches for automating such high-level tasks. In this work, we critically evaluate five state-of-the-art open-source LLMs, spanning 3B to 8B parameter sizes, on the EMR task for Python code. We systematically assess functional correctness and code quality using automated metrics and investigate the impact of prompting strategies by comparing one-shot prompting to a Recursive criticism and improvement (RCI) approach. RCI-based prompting consistently outperforms one-shot prompting in test pass rates and refactoring quality. The best-performing models, Deepseek-Coder-RCI and Qwen2.5-Coder-RCI, achieve test pass percentage (TPP) scores of 0.829 and 0.808, while reducing lines of code (LOC) per method from 12.103 to 6.192 and 5.577, and cyclomatic complexity (CC) from 4.602 to 3.453 and 3.294, respectively. A developer survey on RCI-generated refactorings shows over 70% acceptance, with Qwen2.5-Coder rated highest across all evaluation criteria. In contrast, the original code scored below neutral, particularly in readability and maintainability, underscoring the benefits of automated refactoring guided by quality prompts. While traditional metrics like CC and LOC provide useful signals, they often diverge from human judgments, emphasizing the need for human-in-the-loop evaluation. Our open-source benchmark offers a foundation for future research on automated refactoring with LLMs.
comment: Accepted at AIware'25 - Main Track
☆ SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning ICSE 2026
Identifying and addressing security issues during the early phase of the development lifecycle is critical for mitigating the long-term negative impacts on software systems. Code review serves as an effective practice that enables developers to check their teammates' code before integration into the codebase. To streamline the generation of review comments, various automated code review approaches have been proposed, where LLM-based methods have significantly advanced the capabilities of automated review generation. However, existing models primarily focus on general-purpose code review, their effectiveness in identifying and addressing security-related issues remains underexplored. Moreover, adapting existing code review approaches to target security issues faces substantial challenges, including data scarcity and inadequate evaluation metrics. To address these limitations, we propose SecureReviewer, a new approach designed for enhancing LLMs' ability to identify and resolve security-related issues during code review. Specifically, we first construct a dataset tailored for training and evaluating secure code review capabilities. Leveraging this dataset, we fine-tune LLMs to generate code review comments that can effectively identify security issues and provide fix suggestions with our proposed secure-aware fine-tuning strategy. To mitigate hallucination in LLMs and enhance the reliability of their outputs, we integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric designed to assess the effectiveness of review comments in addressing security issues. Experimental results demonstrate that SecureReviewer outperforms state-of-the-art baselines in both security issue detection accuracy and the overall quality and practical utility of generated review comments.
comment: Accepted by ICSE 2026. Code and data: https://github.com/SIMIAO515/SecureReviewer
☆ CHCVerif: A Portfolio-Based Solver for Constrained Horn Clauses
Constrained Horn Clauses (CHCs) are widely adopted as intermediate representations for a variety of verification tasks, including safety checking, invariant synthesis, and interprocedural analysis. This paper introduces CHCVERIF, a portfolio-based CHC solver that adopts a software verification approach for solving CHCs. This approach enables us to reuse mature software verification tools to tackle CHC benchmarks, particularly those involving bitvectors and low-level semantics. Our evaluation shows that while the method enjoys only moderate success with linear integer arithmetic, it achieves modest success on bitvector benchmarks. Moreover, our results demonstrate the viability and potential of using software verification tools as backends for CHC solving, particularly when supported by a carefully constructed portfolio.
comment: In Proceedings HCVS 2025, arXiv:2510.25468
☆ Theta as a Horn Solver
Theta is a verification framework that has participated in the CHC-COMP competition since 2023. While its core approach -- based on transforming constrained Horn clauses (CHCs) into control-flow automata (CFAs) for analysis -- has remained mostly unchanged, Theta's verification techniques, design trade-offs, and limitations have remained mostly unexplored in the context of CHCs. This paper fills that gap: we provide a detailed description of the algorithms employed by Theta, highlighting the unique features that distinguish it from other CHC solvers. We also analyze the strengths and weaknesses of the tool in the context of CHC-COMP benchmarks. Notably, in the 2025 edition of the competition, Theta's performance was impacted by a configuration issue, leading to suboptimal results. To provide a clearer picture of Theta's actual capabilities, we re-execute the tool on the competition benchmarks under corrected settings and report on the resulting performance.
comment: In Proceedings HCVS 2025, arXiv:2510.25468
☆ Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
Test oracle generation in non-regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that fails this execution-based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re-validation. Our extensive evaluation on seven diverse benchmarks demonstrates that Nexus consistently and substantially outperforms state-of-theart baselines. For instance, Nexus improves the test-level oracle accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1-Mini generated test oracles on HumanEval increases from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.
comment: Under Review
☆ Environmental Impact of CI/CD Pipelines
CI/CD pipelines are widely used in software development, yet their environmental impact, particularly carbon and water footprints (CWF), remains largely unknown to developers, as CI service providers typically do not disclose such information. With the growing environmental impact of cloud computing, understanding the CWF of CI/CD services has become increasingly important. This work investigates the CWF of using GitHub Actions, focusing on open-source repositories where usage is free and unlimited for standard runners. We build upon a methodology from the Cloud Carbon Footprint framework and we use the largest dataset of workflow runs reported in the literature to date, comprising over 2.2 million workflow runs from more than 18,000 repositories. Our analysis reveals that the GitHub Actions ecosystem results in a substantial CWF. Our estimates for the carbon footprint in 2024 range from 150.5 MTCO2e in the most optimistic scenario to 994.9 MTCO2e in the most pessimistic scenario, while the water footprint ranges from 1,989.6 to 37,664.5 kiloliters. The most likely scenario estimates are 456.9 MTCO2e for carbon footprint and 5,738.2 kiloliters for water footprint. To provide perspective, the carbon footprint in the most likely scenario is equivalent to the carbon captured by 7,615 urban trees in a year, and the water footprint is comparable to the water consumed by an average American family over 5,053 years. We explore strategies to mitigate this impact, primarily by reducing wasted computational resources. Key recommendations include deploying runners in regions whose energy production has a low environmental impact such as France and the United Kingdom, implementing stricter deactivation policies for scheduled runs and aligning their execution with periods when the regional energy mix is more environmentally favorable, and reducing the size of repositories.
comment: This work has been submitted to the IEEE for possible publication
☆ Empowering RepoQA-Agent based on Reinforcement Learning Driven by Monte-carlo Tree Search
Repository-level software engineering tasks require large language models (LLMs) to efficiently navigate and extract information from complex codebases through multi-turn tool interactions. Existing approaches face significant limitations: training-free, in-context learning methods struggle to guide agents effectively in tool utilization and decision-making based on environmental feedback, while training-based approaches typically rely on costly distillation from larger LLMs, introducing data compliance concerns in enterprise environments. To address these challenges, we introduce RepoSearch-R1, a novel agentic reinforcement learning framework driven by Monte-carlo Tree Search (MCTS). This approach allows agents to generate diverse, high-quality reasoning trajectories via self-training without requiring model distillation or external supervision. Based on RepoSearch-R1, we construct a RepoQA-Agent specifically designed for repository question-answering tasks. Comprehensive evaluation on repository question-answering tasks demonstrates that RepoSearch-R1 achieves substantial improvements of answer completeness: 16.0% enhancement over no-retrieval methods, 19.5% improvement over iterative retrieval methods, and 33% increase in training efficiency compared to general agentic reinforcement learning approaches. Our cold-start training methodology eliminates data compliance concerns while maintaining robust exploration diversity and answer completeness across repository-level reasoning tasks.
☆ A Research Roadmap for Augmenting Software Engineering Processes and Software Products with Generative AI
Generative AI (GenAI) is rapidly transforming software engineering (SE) practices, influencing how SE processes are executed, as well as how software systems are developed, operated, and evolved. This paper applies design science research to build a roadmap for GenAI-augmented SE. The process consists of three cycles that incrementally integrate multiple sources of evidence, including collaborative discussions from the FSE 2025 "Software Engineering 2030" workshop, rapid literature reviews, and external feedback sessions involving peers. McLuhan's tetrads were used as a conceptual instrument to systematically capture the transforming effects of GenAI on SE processes and software products.The resulting roadmap identifies four fundamental forms of GenAI augmentation in SE and systematically characterizes their related research challenges and opportunities. These insights are then consolidated into a set of future research directions. By grounding the roadmap in a rigorous multi-cycle process and cross-validating it among independent author teams and peers, the study provides a transparent and reproducible foundation for analyzing how GenAI affects SE processes, methods and tools, and for framing future research within this rapidly evolving area. Based on these findings, the article finally makes ten predictions for SE in the year 2030.
☆ The "4W+1H" of Software Supply Chain Security Checklist for Critical Infrastructure
The increasing frequency and sophistication of software supply chain attacks pose severe risks to critical infrastructure sectors, threatening national security, economic stability, and public safety. Despite growing awareness, existing security practices remain fragmented and insufficient, with most frameworks narrowly focused on isolated life cycle stages or lacking alignment with the specific needs of critical infrastructure (CI) sectors. In this paper, we conducted a multivocal literature review across international frameworks, Australian regulatory sources, and academic studies to identify and analyze security practices across the software supply chain, especially specific CI sector. Our analysis found that few existing frameworks are explicitly tailored to CI domains. We systematically leveraged identified software supply chain security frameworks, using a "4W+1H" analytical approach, we synthesized ten core categories (what) of software supply chain security practices, mapped them across life-cycle phases (when), stakeholder roles (who), and implementation levels (how), and examined their coverage across existing frameworks (where). Building on these insights, the paper culminates in structured, multi-layered checklist of 80 questions designed to relevant stakeholders evaluate and enhance their software supply chain security. Our findings reveal gaps between framework guidance and sector-specific needs, highlight the need for integrated, context-aware approaches to safeguard critical infrastructure from evolving software supply chain risks.
comment: 18 pages, 4 figures
☆ Reduction of Test Re-runs by Prioritizing Potential Order Dependent Flaky Tests
Flaky tests can make automated software testing unreliable due to their unpredictable behavior. These tests can pass or fail on the same code base on multiple runs. However, flaky tests often do not refer to any fault, even though they can cause the continuous integration (CI) pipeline to fail. A common type of flaky test is the order-dependent (OD) test. The outcome of an OD test depends on the order in which it is run with respect to other test cases. Several studies have explored the detection and repair of OD tests. However, their methods require re-runs of tests multiple times, that are not related to the order dependence. Hence, prioritizing potential OD tests is necessary to reduce the re-runs. In this paper, we propose a method to prioritize potential order-dependent tests. By analyzing shared static fields in test classes, we identify tests that are more likely to be order-dependent. In our experiment on 27 project modules, our method successfully prioritized all OD tests in 23 cases, reducing test executions by an average of 65.92% and unnecessary re-runs by 72.19%. These results demonstrate that our approach significantly improves the efficiency of OD test detection by lowering execution costs.
☆ Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented configurations, and documentation completeness levels. Results reveal a stark performance disparity: LLMs achieve 84% to 89% correctness on established synthetic benchmarks but only 25% to 34% on real-world class tasks, with negligible differences between familiar and novel codebases. Comprehensive docstrings yield modest gains of 1% to 3% in functional accuracy, though statistical significance is rare. Retrieval-augmented generation proves most effective with partial documentation, improving correctness by 4% to 7% by supplying concrete implementation patterns absent from specifications. Error profiling identifies AttributeError, TypeError, and AssertionError as dominant failure modes (84% of cases), with synthetic tests overemphasizing assertion issues and real-world scenarios highlighting type and attribute mismatches. Retrieval augmentation reduces logical flaws but can introduce dependency conflicts. The benchmark and analysis expose critical limitations in current LLM capabilities for class-level engineering, offering actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.
comment: Pre-print prepared for journal submission
☆ Industry Members' Perceptions about ABET-based Accreditation: An Exploratory Study in a Developing Country
ABET accreditation is an increasingly prominent system of global accreditation of engineering programs, and the assessment requires programs to demonstrate that they meet the needs of the program's stakeholders, typically industrial potential employers of graduates. To obtain these inputs, programs are required to assemble an advisory committee board. The views of the advisory board on the relevance of the degree outcomes are an essential part of this process. The purpose of this qualitative research study is to explore the viewpoints that industry stakeholders have on this type of process. The context for the study was an Ecuadorian engineering program which had successfully achieved the ABET accreditation. The study drew on interviews undertaken with industry members who were part of the advisory board. This study focuses on how they perceive the process and the accreditation awarded, analyzing their views of its usefulness, especially in relation to the employability of graduates. Based on the findings, we offer critical insights into this accreditation process when it takes place in contexts beyond highly industrialized countries.
comment: Accepted manuscript version of a paper published in IEEE Transactions on Education (2024). The final version and citation suggested are available on IEEE Xplore at https://doi.org/10.1109/TE.2024.3410996
☆ FGGM: Formal Grey-box Gradient Method for Attacking DRL-based MU-MIMO Scheduler
In 5G mobile communication systems, MU-MIMO has been applied to enhance spectral efficiency and support high data rates. To maximize spectral efficiency while providing fairness among users, the base station (BS) needs to selects a subset of users for data transmission. Given that this problem is NP-hard, DRL-based methods have been proposed to infer the near-optimal solutions in real-time, yet this approach has an intrinsic security problem. This paper investigates how a group of adversarial users can exploit unsanitized raw CSIs to launch a throughput degradation attack. Most existing studies only focused on systems in which adversarial users can obtain the exact values of victims' CSIs, but this is impractical in the case of uplink transmission in LTE/5G mobile systems. We note that the DRL policy contains an observation normalizer which has the mean and variance of the observation to improve training convergence. Adversarial users can then estimate the upper and lower bounds of the local observations including the CSIs of victims based solely on that observation normalizer. We develop an attacking scheme FGGM by leveraging polytope abstract domains, a technique used to bound the outputs of a neural network given the input ranges. Our goal is to find one set of intentionally manipulated CSIs which can achieve the attacking goals for the whole range of local observations of victims. Experimental results demonstrate that FGGM can determine a set of adversarial CSI vector controlled by adversarial users, then reuse those CSIs throughout the simulation to reduce the network throughput of a victim up to 70\% without knowing the exact value of victims' local observations. This study serves as a case study and can be applied to many other DRL-based problems, such as a knapsack-oriented resource allocation problems.
☆ Adding New Capability in Existing Scientific Application with LLM Assistance
With the emergence and rapid evolution of large language models (LLM), automating coding tasks has become an important research topic. Many efforts are underway and literature abounds about the efficacy of models and their ability to generate code. A less explored aspect of code generation is for new algorithms, where the training dataset would not have included any previous example of similar code. In this paper we propose a new methodology for writing code from scratch for a new algorithm using LLM assistance, and describe enhancement of a previously developed code-translation tool, Code-Scribe, for new code generation.
comment: 8 pages, 4 figures, submitted to The 1st International Workshop on Foundational large Language Models Advances for HPC in Asia
♻ ☆ Scalpel: Automotive Deep Learning Framework Testing via Assembling Model Components ICSE 2026
Deep learning (DL) plays a key role in autonomous driving systems. DL models support perception modules, equipped with tasks such as object detection and sensor fusion. These DL models enable vehicles to process multi-sensor inputs to understand complex surroundings. Deploying DL models in autonomous driving systems faces stringent challenges, including real-time processing, limited computational resources, and strict power constraints. To address these challenges, automotive DL frameworks (e.g., PaddleInference) have emerged to optimize inference efficiency. However, these frameworks encounter unique quality issues due to their more complex deployment environments, such as crashes stemming from limited scheduled memory and incorrect memory allocation. Unfortunately, existing DL framework testing methods fail to detect these quality issues due to the failure in deploying generated test input models, as these models lack three essential capabilities: (1) multi-input/output tensor processing, (2) multi-modal data processing, and (3) multi-level data feature extraction. These capabilities necessitate specialized model components, which existing testing methods neglect during model generation. To bridge this gap, we propose Scalpel, an automotive DL frameworks testing method that generates test input models at the model component level. Scalpel generates models by assembling model components (heads, necks, backbones) to support capabilities required by autonomous driving systems. Specifically, Scalpel maintains and updates a repository of model components, generating test inputs by selecting, mutating, and assembling them. Successfully generated models are added back to enrich the repository. Newly generated models are then deployed within the autonomous driving system to test automotive DL frameworks via differential testing.
comment: Accepted by the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026)
♻ ☆ Wisdom and Delusion of LLM Ensembles for Code Generation and Repair
Today's pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble's potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble's candidate pool. We find that the theoretical upperbound for an ensemble's performance can be 83% above the best single model. Our results show that consensus-based strategies for selecting solutions fall into a "popularity trap," amplifying common but incorrect outputs. In contrast, a diversity-based strategy realizes up to 95% of this theoretical potential, and proves effective even in small two-model ensembles, enabling a cost-efficient way to enhance performance by leveraging multiple LLMs.
comment: Added Acknowledgments section and hyphenated last names
♻ ☆ Dissect-and-Restore: AI-based Code Verification with Transient Refactoring
Formal verification is increasingly recognized as a critical foundation for building reliable software systems. However, the need for specialized expertise to write precise specifications, navigate complex proof obligations, and learn annotations often makes verification an order of magnitude more expensive than implementation. While modern AI systems can recognize patterns in mathematical proofs and interpret natural language, effectively integrating them into the formal verification process remains an open challenge. We present Prometheus, a novel AI-assisted system that facilitates automated code verification with current AI capabilities in conjunction with modular software engineering principles (e.g., modular refactoring). Our approach begins by decomposing complex program logic, such as nested loops, into smaller, verifiable components. Once verified, these components are recomposed to construct a proof of the original program. This decomposition-recomposition workflow is non-trivial. Prometheus addresses this by guiding the proof search through structured decomposition of complex lemmas into smaller, verifiable sub-lemmas. When automated tools are insufficient, users can provide lightweight natural language guidance to steer the proof process effectively. Our evaluation demonstrates that transiently applying modular restructuring to the code substantially improves the AI's effectiveness in verifying individual components. This approach successfully verifies 86% of tasks in our curated dataset, compared to 68% for the baseline. Gains are more pronounced with increasing specification complexity, improving from 30% to 69%, and when integrating proof outlines for complex programs, from 25% to 87%.
♻ ☆ IntelliRadar: A Comprehensive Platform to Pinpoint Malicious Packages Information from Cyber Intelligence
Malicious packages in public registries pose serious threats to software supply chain security. While current software component analysis (SCA) tools rely on databases like OSV and Snyk to detect these threats, these databases suffer from delayed updates and incomplete coverage. However, they miss intelligence from unstructured sources like social media and developer forums, where new threats are often first reported. This delay extends the lifecycle of malicious packages and increases risks for downstream users. To address this, we developed a novel and comprehensive approach to construct a platform IntelliRadar to collect disclosed malicious package names from unstructured web content. Specifically, by exhaustively searching and snowballing the public sources of malicious package names, and incorporating large language models (LLMs) with domain-specialized Least to Most prompts, IntelliRadar ensures comprehensive collection of historical and current disclosed malicious package names from diverse unstructured sources. As a result, we constructed a comprehensive malicious package database containing 34,313 malicious NPM and PyPI package names. Our evaluation shows that IntelliRadar achieves high performance (97.91% precision) on malicious package intelligence extraction. Compared to existing databases, IntelliRadar identifies 7,542 more malicious package names than OSV and 12,684 more than Snyk. Furthermore, 76.6% of NPM components and 70.3% of PyPI components in IntelliRadar were collected earlier than in Snyk's database. IntelliRadar is also more cost-efficient, with a cost of $0.003 per piece of malicious package intelligence and only $7 per month for continuous monitoring. Furthermore, we identified and received confirmation for 1,981 malicious packages in downstream package manager mirror registries through the IntelliRadar.
♻ ☆ CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
Given a large and evolving codebase, the ability to automatically generate holistic, architecture-aware documentation that captures not only individual functions but also cross-file, cross-module, and system-level interactions remains an open challenge. Comprehensive documentation is essential for long-term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real-world software systems. We present \textbf{CodeWiki}, a unified framework for automated repository-level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. To enable rigorous evaluation, we introduce \textbf{CodeWikiBench}, a comprehensive benchmark featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results show that CodeWiki achieves a 68.79\% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06\%) by 4.73\%, with particularly strong improvements on high-level scripting languages (+10.47\%). We open-source CodeWiki to foster future research and community adoption.
♻ ☆ Forcrat: Automatic I/O API Translation from C to Rust via Origin and Capability Analysis
Translating C to Rust is a promising way to enhance the reliability of legacy system programs. Although the industry has developed an automatic C-to-Rust translator, C2Rust, its translation remains unsatisfactory. One major reason is that C2Rust retains C standard library (libc) function calls instead of replacing them with functions from the Rust standard library (Rust std). However, little work has been done on replacing library functions in C2Rust-generated code. In this work, we focus on replacing the I/O API, an important subset of library functions. This poses challenges due to the semantically different designs of I/O APIs in libc and Rust std. First, the two APIs offer different sets of types that represent the origins (e.g., standard input, files) and capabilities (e.g., read, write) of streams used for I/O. Second, they use different error-checking mechanisms: libc uses internal indicators, while Rust std uses return values. To address these challenges, we propose two static analysis techniques, origin and capability analysis and error source analysis, and use their results to replace the I/O API. Our evaluation shows that the proposed approach is (1) correct, with all 32 programs that have test suites passing the tests after transformation, (2) efficient, analyzing and transforming 422k LOC in 14 seconds, and (3) widely applicable, replacing 82% of I/O API calls.
comment: 12 pages, 3 figures, 3 tables, In Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE 2025)
Systems and Control 45
☆ Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
☆ Time-Optimal Model Predictive Control for Linear Systems with Multiplicative Uncertainties
This paper presents a time-optimal Model Predictive Control (MPC) scheme for linear discrete-time systems subject to multiplicative uncertainties represented by interval matrices. To render the uncertainty propagation computationally tractable, the set-valued error system dynamics are approximated using a matrix-zonotope-based bounding operator. Recursive feasibility and finite-time convergence are ensured through an adaptive terminal constraint mechanism. A key advantage of the proposed approach is that all the necessary bounding sets can be computed offline, substantially reducing the online computational burden. The effectiveness of the method is illustrated via a numerical case study on an orbital rendezvous maneuver between two satellites.
☆ Pareto-Optimal Sampling and Resource Allocation for Timely Communication in Shared-Spectrum Low-Altitude Networks
Guaranteeing stringent data freshness for low-altitude unmanned aerial vehicles (UAVs) in shared spectrum forces a critical trade-off between two operational costs: the UAV's own energy consumption and the occupation of terrestrial channel resources. The core challenge is to satisfy the aerial data freshness while finding a Pareto-optimal balance between these costs. Leveraging predictive channel models and predictive UAV trajectories, we formulate a bi-objective Pareto optimization problem over a long-term planning horizon to jointly optimize the sampling timing for aerial traffic and the power and spectrum allocation for fair coexistence. However, the problem's non-convex, mixed-integer nature renders classical methods incapable of fully characterizing the complete Pareto frontier. Notably, we show monotonicity properties of the frontier, building on which we transform the bi-objective problem into several single-objective problems. We then propose a new graph-based algorithm and prove that it can find the complete set of Pareto optima with low complexity, linear in the horizon and near-quadratic in the resource block (RB) budget. Numerical comparisons show that our approach meets the stringent timeliness requirement and achieves a six-fold reduction in RB utilization or a 6 dB energy saving compared to benchmarks.
☆ Graph approach for observability analysis in power system dynamic state estimation
The proposed approach yields a numerical method that provably executes in linear time with respect to the number of nodes and edges in a graph. The graph, constructed from the power system model, requires only knowledge of the dependencies between state-to-state and output-to-state variables within a state-space framework. While graph-based observability analysis methods exist for power system static-state estimation, the approach presented here is the first for dynamic-state estimation (DSE). We examine decentralized and centralized DSE scenarios and compare our findings with a well-established, albeit non-scalable, observability analysis method in the literature. When compared to the latter in a centralized DSE setting, our method reduced computation time by 1440x.
☆ Statistically Adaptive Differential Protection for AC Microgrids Based on Kullback-Leibler Divergence
The proliferation of inverter-based resources challenges traditional microgrid protection by introducing variable fault currents and complex transients. This paper presents a statistically adaptive differential protection scheme based on Kullback-Leibler divergence, implemented via a Bartlett-corrected G-statistic computed on logarithm-transformed current magnitudes. The method is a multivariate fault detection engine that employs the Mahalanobis distance to distinguish healthy and faulty states, enabling robust detection even in noisy environments. Detection thresholds are statistically derived from a chi-squared distribution for precise control over the false alarm rate. Upon detection, a lightweight classifier identifies the fault type by assessing per-phase G-statistics against dedicated thresholds, enhanced by a temporal persistence filter for security. Extensive simulations on a modified CIGRE 14-bus microgrid show high efficacy: sub-cycle average detection delays, high detection and classification accuracy across operating modes, resilience to high-impedance faults up to 250 Ohms, tolerance to 10 ms communication delay, and noise levels down to a 20 dB signal-to-noise ratio. These findings demonstrate a reproducible and computationally efficient solution for next-generation AC microgrid protection.
☆ Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling
The electricity sector transition requires substantial increases in residential demand response capacity, yet Home Energy Management Systems (HEMS) adoption remains limited by user interaction barriers requiring translation of everyday preferences into technical parameters. While large language models have been applied to energy systems as code generators and parameter extractors, no existing implementation deploys LLMs as autonomous coordinators managing the complete workflow from natural language input to multi-appliance scheduling. This paper presents an agentic AI HEMS where LLMs autonomously coordinate multi-appliance scheduling from natural language requests to device control, achieving optimal scheduling without example demonstrations. A hierarchical architecture combining one orchestrator with three specialist agents uses the ReAct pattern for iterative reasoning, enabling dynamic coordination without hardcoded workflows while integrating Google Calendar for context-aware deadline extraction. Evaluation across three open-source models using real Austrian day-ahead electricity prices reveals substantial capability differences. Llama-3.3-70B successfully coordinates all appliances across all scenarios to match cost-optimal benchmarks computed via mixed-integer linear programming, while other models achieve perfect single-appliance performance but struggle to coordinate all appliances simultaneously. Progressive prompt engineering experiments demonstrate that analytical query handling without explicit guidance remains unreliable despite models' general reasoning capabilities. We open-source the complete system including orchestration logic, agent prompts, tools, and web interfaces to enable reproducibility, extension, and future research.
comment: 34 pages, 9 figures. Code available at https://github.com/RedaElMakroum/agentic-ai-hems
☆ Optimal Bidding and Coordinated Dispatch of Hybrid Energy Systems in Regulation Markets
The increasing integration of renewable energy sources and distributed energy resources (DER) into modern power systems introduces significant uncertainty, posing challenges for maintaining grid flexibility and reliability. Hybrid energy systems (HES), composed of controllable generators, flexible loads, and battery storage, offer a decentralized solution to enhance flexibility compared to single centralized resources. This paper presents a two-level framework to enable HES participation in frequency regulation markets. The upper level performs a chance-constrained optimization to choose capacity bids based on historical regulation signals. At the lower level, a real-time control strategy disaggregates the regulation power among the constituent resources. This real-time control strategy is then benchmarked against an offline optimal dispatch to evaluate flexibility performance. Additionally, the framework evaluates the profitability of overbidding strategies and identifies thresholds beyond which performance degradation may lead to market penalties or disqualification. The proposed framework also compare the impact of imbalance of power capacities on performance and battery state of charge (SoC) through asymmetric HES configurations.
☆ Two-Timescale Optimization Framework for IAB-Enabled Heterogeneous UAV Networks
In post-disaster scenarios, the rapid deployment of adequate communication infrastructure is essential to support disaster search, rescue, and recovery operations. To achieve this, uncrewed aerial vehicle (UAV) has emerged as a promising solution for emergency communication due to its low cost and deployment flexibility. However, conventional untethered UAV (U-UAV) is constrained by size, weight, and power (SWaP) limitations, making it incapable of maintaining the operation of a macro base station. To address this limitation, we propose a heterogeneous UAV-based framework that integrates tethered UAV (T-UAV) and U-UAVs, where U-UAVs are utilized to enhance the throughput of cell-edge ground user equipments (G-UEs) and guarantee seamless connectivity during G-UEs' mobility to safe zones. It is noted that the integrated access and backhaul (IAB) technique is adopted to support the wireless backhaul of U-UAVs. Accordingly, we formulate a two-timescale joint user scheduling and trajectory control optimization problem, aiming to maximize the downlink throughput under asymmetric traffic demands and G-UEs' mobility. To solve the formulated problem, we proposed a two-timescale multi-agent deep deterministic policy gradient (TTS-MADDPG) algorithm based on the centralized training and distributed execution paradigm. Numerical results show that the proposed algorithm outperforms other benchmarks, including the two-timescale multi-agent proximal policy optimization (TTS-MAPPO) algorithm and MADDPG scheduling method, with robust and higher throughput. Specifically, the proposed algorithm obtains up to 12.2\% average throughput gain compared to the MADDPG scheduling method.
☆ Proxemics and Permeability of the Pedestrian Group
People tend to walk in groups, and interactions with those groups have a significant impact on crowd behavior and pedestrian traffic dynamics. Social norms can be seen as unwritten rules regulating people interactions in social settings. This article studies people interactions with groups and the emergence of group proxemics. Group zones, zone occupancy counts and people clearance from the group are studied using naturalistic data. Analysis indicate potential presence of three different zones in addition to the public zone. People tend to remain in the public zone and only progressively get closer to groups, and those closer approaches happen in a low frequency and for brief periods of time.
☆ Life-cycle Modeling and the Walking Behavior of the Pedestrian-Group as an Emergent Agent: With Empirical Data on the Cohesion of the Group Formation
This article investigates the pedestrian group as an emergent agent. The article explores empirical data to derive emergent agency and formation state spaces and outline recurring patterns of walking behavior. In this analysis, pedestrian trajectories extracted from surveillance videos are used along with manually annotated pedestrian group memberships. We conducted manual expert evaluation of observed groups, produced new manual annotations for relevant events pertaining to group behavior and extracted metrics relevant group formation. This information along with quantitative analysis was used to model the life-cycle and formation of the group agent. Those models give structure to expectations around walking behavior of groups; from pedestrian walking independently to the emergence of a collective intention where group members tended to maintain bounded distance between each other. Disturbances to this bounded distance often happened in association with changes in either their agency or their formation states. We summarized the patterns of behavior along with the sequences of state transitions into abstract patterns, which can aid in the development of more detailed group agents in simulation and in the design of engineering systems to interact with such groups.
☆ Efficient Collision-Avoidance Constraints for Ellipsoidal Obstacles in Optimal Control: Application to Path-Following MPC and UAVs
This article proposes a modular optimal control framework for local three-dimensional ellipsoidal obstacle avoidance, exemplarily applied to model predictive path-following control. Static as well as moving obstacles are considered. Central to the approach is a computationally efficient and continuously differentiable condition for detecting collisions with ellipsoidal obstacles. A novel two-stage optimization approach mitigates numerical issues arising from the structure of the resulting optimal control problem. The effectiveness of the approach is demonstrated through simulations and real-world experiments with the Crazyflie quadrotor. This represents the first hardware demonstration of an MPC controller of this kind for UAVs in a three-dimensional task.
☆ Safety Margins of Inverse Optimal ISSf Controllers
We investigate the gain margin of a general nonlinear system under an inverse optimal input-to-state safe (ISSf) controller of the form u=u0(x)+u*(x,u0), where u0 is the nominal control and u* is the inverse optimal safety filter that minimally modifies the nominal controller's unsafe actions over the infinite horizon. By first establishing a converse ISSf-BF theorem, we reveal the equivalence among the achievability of ISSf by feedback, the achievability of inverse optimality, and the solvability of a Hamilton-Jacobi-Isaacs equation associated with the inverse optimal ISSf gain assignment. Then we develop a collection of safety margin results on the overall control u=u0+u*. In the absence of disturbances, we find that standard inverse optimal safe controllers have a certain degree of gain margin. Specifically, when f(x) acts safely but u0 acts unsafely, the gain can be decreased by up to half; and when f(x) acts unsafely, we establish that, if u0 acts safely, the gain can be increased arbitrarily, whereas if u0 acts unsafely, the control recovers the full gain margin [1/2,inf). It is shown, however, that under control gain variation, the safe set of these controllers is locally asymptotically stable, which implies that their safety is sensitive to large but bounded disturbances. To make inverse optimal ISSf controllers robust to gain variation, we propose a gain margin improvement approach at the expense of an increased control effort. This improvement allows the inverse optimal safe control to inherit the standard gain margin of [1/2,inf) without requiring prior knowledge of whether f(x) or u0 acts safely on the safety boundary, while simultaneously ensuring global asymptotic stability of the resulting safe set. In the presence of disturbances, this improvement idea renders inverse optimal ISSf controllers robust to gain variations with the same gain margin of [1/2,inf).
☆ XWAVE: A Novel Software-Defined Everything Approach for the Manufacturing Industry
The manufacturing sector is moving from rigid, hardware-dependent systems toward flexible, software-driven environments. This transformation is shaped by the convergence of several Software-Defined technologies: Software-Defined Automation virtualizes industrial control, replacing proprietary PLCs with containerized, programmable solutions that enable scalability and interoperability. Software-Defined Compute and Communications provide a means to distribute intelligence seamlessly across devices, networks, and cloud platforms, reducing latency and enabling dynamic reconfiguration. Software-Defined Manufacturing Systems, usually implemented as Digital Twins, are real-time virtual models of machines and processes, allowing predictive analysis, optimization, and closer integration between human operators and intelligent systems. This work presents XWAVE, a project that unites these three Software-Defined paradigms to present a modular, fully software-defined manufacturing system.
☆ Command-filter-based trajectory-tracking control of quadrotor subject to internal and external disturbances
We propose a command-filter backstepping controller that integrates a disturbance observer and a high-gain observer (HGO) to handle unknown internal and external disturbances acting on a quadrotor. To build the controller, we first define tracking errors between the measured and desired quadrotor outputs, which allow the system to be rewritten in a new set of state variables. Using this transformed model, we apply Lyapunov theory to derive a backstepping control law. To avoid repeated differentiation of states and virtual controls, a first-order command filter is introduced, and a nonlinear disturbance observer is added to provide disturbance estimates. Each state in the controller and observer is replaced with its estimate from the HGO. The resulting control law enables the quadrotor to follow its path despite internal and external disturbances, with each subsystem allowed its own disturbance type for realism. A new state transformation and Lyapunov-based derivation prevent the usual explosion of complexity, while the HGO reconstructs unmeasured states and their rates for output feedback. The nonlinear disturbance observer attenuates constant and nonlinear disturbances as well as band-limited white noise. The method reduces dependence on high-precision sensors and mitigates wind, model error, and rotor noise effects during flight. Unlike previous studies that treat either disturbance rejection or partial sensing, this work combines the command filter, disturbance observer, and HGO to address both challenges simultaneously while avoiding the complexity growth typical of backstepping designs.
☆ Cooperative Task Spaces for Multi-Arm Manipulation Control based on Similarity Transformations
Many tasks in human environments require collaborative behavior between multiple kinematic chains, either to provide additional support for carrying big and bulky objects or to enable the dexterity that is required for in-hand manipulation. Since these complex systems often have a very high number of degrees of freedom coordinating their movements is notoriously difficult to model. In this article, we present the derivation of the theoretical foundations for cooperative task spaces of multi-arm robotic systems based on geometric primitives defined using conformal geometric algebra. Based on the similarity transformations of these cooperative geometric primitives, we derive an abstraction of complex robotic systems that enables representing these systems in a way that directly corresponds to single-arm systems. By deriving the associated analytic and geometric Jacobian matrices, we then show the straightforward integration of our approach into classical control techniques rooted in operational space control. We demonstrate this using bimanual manipulators, humanoids and multi-fingered hands in optimal control experiments for reaching desired geometric primitives and in teleoperation experiments using differential kinematics control. We then discuss how the geometric primitives naturally embed nullspace structures into the controllers that can be exploited for introducing secondary control objectives. This work, represents the theoretical foundations of this cooperative manipulation control framework, and thus the experiments are presented in an abstract way, while giving pointers towards potential future applications.
☆ From Embedding to Control: Representations for Stochastic Multi-Object Systems
This paper studies how to achieve accurate modeling and effective control in stochastic nonlinear dynamics with multiple interacting objects. However, non-uniform interactions and random topologies make this task challenging. We address these challenges by proposing \textit{Graph Controllable Embeddings} (GCE), a general framework to learn stochastic multi-object dynamics for linear control. Specifically, GCE is built on Hilbert space embeddings, allowing direct embedding of probability distributions of controlled stochastic dynamics into a reproducing kernel Hilbert space (RKHS), which enables linear operations in its RKHS while retaining nonlinear expressiveness. We provide theoretical guarantees on the existence, convergence, and applicability of GCE. Notably, a mean field approximation technique is adopted to efficiently capture inter-object dependencies and achieve provably low sample complexity. By integrating graph neural networks, we construct data-dependent kernel features that are capable of adapting to dynamic interaction patterns and generalizing to even unseen topologies with only limited training instances. GCE scales seamlessly to multi-object systems of varying sizes and topologies. Leveraging the linearity of Hilbert spaces, GCE also supports simple yet effective control algorithms for synthesizing optimal sequences. Experiments on physical systems, robotics, and power grids validate GCE and demonstrate consistent performance improvement over various competitive embedding methods in both in-distribution and few-shot tests
☆ Design of Orthogonal Phase of Arrival Positioning Scheme Based on 5G PRS and Optimization of TOA Performance
This study analyzes the performance of positioning techniques based on configuration changes of 5G New Radio signals. In 5G networks, a terminal position is determined from the Time of Arrival of Positioning Reference Signals transmitted by base stations. We propose an algorithm that improves TOA accuracy under low sampling rate constraints and implement 5G PRS for positioning in a software defined modem. We also examine how flexible time frequency resource allocation of PRS affects TOA estimation accuracy and discuss optimal PRS configurations for a given signal environment.
☆ Confidential FRIT via Homomorphic Encryption
Edge computing alleviates the computation burden of data-driven control in cyber-physical systems (CPSs) by offloading complex processing to edge servers. However, the increasing sophistication of cyberattacks underscores the need for security measures that go beyond conventional IT protections and address the unique vulnerabilities of CPSs. This study proposes a confidential data-driven gain-tuning framework using homomorphic encryption, such as ElGamal and CKKS encryption schemes, to enhance cybersecurity in gain-tuning processes outsourced to external servers. The idea for realizing confidential FRIT is to replace the matrix inversion operation with a vector summation form, allowing homomorphic operations to be applied. Numerical examples under 128-bit security confirm performance comparable to conventional methods while providing guidelines for selecting suitable encryption schemes for secure CPS.
☆ Green Wireless Network Scaling for Joint Deployment: Multi-BSs or Multi-RISs?
The imminent emergence of sixth-generation (6G) networks faces critical challenges from spatially heterogeneous traffic and escalating energy consumption, necessitating sustainable scaling strategies for network infrastructure such as base stations (BSs) and reconfigurable intelligent surfaces (RISs). This paper establishes fundamental scaling laws for the Integrated Relative Energy Efficiency (IREE) metric under joint multi-BS and multi-RIS deployment in traffic-mismatched scenarios. Specifically, we propose an Alternating Directional Dual-Radial Basis Function (ADD-RBF) framework that models the channels of BSs and RISs as two type of spatially decoupled RBF neurons to maximize IREE through alternative optimization, with proven universal approximation capability and convergence guarantees. Theoretical analysis reveals a scaling dichotomy: BS proliferation drives logarithmic capacity growth $\mathcal{O}(\log N^{BS})$ but only polynomial mismatch reduction $\mathcal{O}(1/\sqrt{N^{BS}})$, whereas RIS deployment achieves exponential mismatch mitigation $\mathcal{O}(\delta_{\text{err}}^{-(N^R+1)})$ despite its sub-logarithmic capacity gains. Simulation results validate that RISs excel in capturing spatial traffic correlations and alleviating hotspots, making them particularly effective when mismatch dominates, while BSs are preferable under capacity shortages. These findings offer practical guidelines for green 6G network design.
☆ A Scenario-Based Approach for Stochastic Economic Model Predictive Control with an Expected Shortfall Constraint
This paper presents a novel approach to stochastic economic model predictive control (SEMPC) that minimizes average economic cost while satisfying an empirical expected shortfall (EES) constraint to manage risk. A new scenario-based problem formulation ensuring controlled risk with high confidence while minimizing the average cost is introduced. The probabilistic guarantees is dependent on the number of support elements over the entire input domain, which is difficult to find for high-dimensional systems. A heuristic algorithm is proposed to find the number of support elements. Finally, an efficient method is presented to reduce the computational complexity of the SEMPC problem with an EES constraint. The approach is validated on a water distribution network, showing its effectiveness in balancing performance and risk.
☆ Competitive Equilibrium for Electricity Markets with Spatially Flexible Load
Electric vehicle charging and geo-distributed datacenters introduce spatially flexible loads (FLs) that couple power, transportation, and datacenter networks. These couplings create a closed-loop feedback between locational marginal prices (LMPs) and decisions of the FL systems, challenging the foundations of conventional competitive equilibrium (CE) in electricity markets. This paper studies a notion of generalized competitive equilibrium (GCE) that aims to capture such price-demand interactions across the interconnected infrastructures. We establish structural conditions under which the GCE preserves key properties of the conventional CE, including existence, uniqueness, and efficiency, without requiring detailed knowledge of decision processes for individual FL systems. The framework generalizes to settings where the grid is coupled with multiple FL systems. Stylized examples and case studies on the New York ISO grid, coupled with the Sioux Falls transportation and distributed datacenter networks, demonstrate the use of our theoretical framework and illustrate the mutual influence among the grid and the studied FL systems.
☆ SUSTAINABLE Platform: Seamless Smart Farming Integration Towards Agronomy Automation SC2
The global agricultural sector is undergoing a transformative shift, driven by increasing food demands, climate variability and the need for sustainable practices. SUSTAINABLE is a smart farming platform designed to integrate IoT, AI, satellite imaging, and role-based task orchestration to enable efficient, traceable, and sustainable agriculture with a pilot usecase in viticulture. This paper explores current smart agriculture solutions, presents a comparative evaluation, and introduces SUSTAINABLE's key features, including satellite index integration, real-time environmental data, and role-aware task management tailored to Mediterranean vineyards.
comment: Accepted for presentation to 11th IEEE International Smart Cities Conference (ISC2 2025)
☆ Dispatchable Current Source Virtual Oscillator Control Achieving Global Stability
This work introduces a novel dispatchable current source virtual oscillator control (dCVOC) scheme for grid-following (GFL) converters, which exhibits duality with dispatchable virtual oscillator control (dVOC) in two ways: a) the current frequency is generated through reactive power control, similar to a PLL ; b) the current magnitude reference is generated through active power control. We formally prove that our proposed control always admits a steady-state equilibrium and ensures global stability under reasonable conditions on grid and converter parameters, even when considering LVRT and current saturation constraints. Our approach avoids low-voltage transients and weak grid instability, which is not the case for conventional GFL control. The effectiveness of our proposed control is verified through high-fidelity electromagnetic transient simulations.
☆ Quantitative Parameter Conditions for Stability and Coupling in GFM-GFL Converter Hybrid Systems from a Small-Signal Synchronous Perspective
With the development of renewable energy sources, power systems are gradually evolving into a system comprising both grid-forming (GFM) and grid-following (GFL) converters. However, the dynamic interaction between the two types of converters, especially low-inertia GFM converters and GFL converters, remains unclear due to the substantial differences in their synchronization mechanisms. To address this gap, this paper develops a small-signal synchronous stability model for power systems containing GFM and GFL converters, which considers network line dynamics. Based on subspace perturbation theory, we reveal that GFM and GFL subsystems can be effectively decoupled when GFL converters operate near unity power factor or when GFM converters possess sufficiently large inertia or damping, and provide lower bound of control parameters ensuring decoupling. Under the decoupling condition, we propose decentralized and analytical parameter-based stability criteria which have clear physical interpretations: the positive damping of converters compensates for the negative damping of the network. In the case of coupling, we also propose decentralized stability criteria based on the small phase theorem. The effectiveness of the theoretical analysis is validated through simulations in MATLAB/Simulink.
☆ Adaptive Control for a Physics-Informed Model of a Thermal Energy Distribution System: Qualitative Analysis
Integrated energy systems (IES) are complex heterogeneous architectures that typically encompass power sources, hydrogen electrolyzers, energy storage, and heat exchangers. This integration is achieved through operating control strategy optimization. However, the lack of physical understanding as to how these systems evolve over time introduces uncertainties that hinder reliable application thereof. Techniques that can accommodate such uncertainties are fundamental for ensuring proper operation of these systems. Unfortunately, no unifying methodology exists for accommodating uncertainties in this regard. That being said, adaptive control (AC) is a discipline that may allow for accommodating such uncertainties in real-time. In the present work, we derive an AC formulation for linear systems in which all states are observable and apply it to the control of a glycol heat exchanger (GHX) in an IES. Based on prior research in which we quantified the uncertainties of the GHXs system dynamics, we introduced an error of 50% on four terms of the nominal model. In the case where a linear quadratic regulator is used as the nominal control for the reference system, we found that employing AC can reduce the mean absolute error and integral time absolute error by a factor of 30%-75%. This reduction is achieved with minimal computing overhead and control infrastructure, thus underscoring the strength of AC. However, the control effort induced is significant, therefore warranting further study in order to estimate its impact on a physical system. To address further challenges, including partially observable and non-linear dynamics, enhancements of the linear formulation are currently being developed.
☆ Quantifying Grid-Forming Behavior: Bridging Device-level Dynamics and System-Level Strength
Grid-forming (GFM) technology is widely regarded as a promising solution for future power systems dominated by power electronics. However, a precise method for quantifying GFM converter behavior and a universally accepted GFM definition remain elusive. Moreover, the impact of GFM on system stability is not precisely quantified, creating a significant disconnect between device and system levels. To address these gaps from a small-signal perspective, at the device level, we introduce a novel metric, the Forming Index (FI) to quantify a converter's response to grid voltage fluctuations. Rather than enumerating various control architectures, the FI provides a metric for the converter's GFM ability by quantifying its sensitivity to grid variations. At the system level, we propose a new quantitative measure of system strength that captures the multi-bus voltage stiffness, which quantifies the voltage and phase angle responses of multiple buses to current or power disturbances. We further extend this concept to grid strength and bus strength to identify weak areas within the system. Finally, we bridge the device and system levels by formally proving that GFM converters enhance system strength. Our proposed framework provides a unified benchmark for GFM converter design, optimal placement, and system stability assessment.
☆ Ferrohydrodynamic Microfluidics for Bioparticle Separation and Single-Cell Phenotyping: Principles, Applications, and Emerging Directions
Ferrohydrodynamic microfluidics relies on magnetic field gradients to manipulate diamagnetic particles in ferrofluid-filled microenvironments. It has emerged as a promising tool for label-free manipulation of bioparticles, including their separation and phenotyping. This perspective reviews recent progress in the development and applications of ferrofluid-based microfluidic platforms for multiscale bioparticle separation, ranging from micron-scale cells to submicron extracellular vesicles. We highlight the fundamental physical principles for ferrohydrodynamic manipulation, including the dominant magnetic buoyancy force resulting from the interaction of ferrofluids and particles. We then describe how these principles enable high-resolution size-based bioparticle separation, subcellular bioparticle enrichment, and phenotypic screening based on physical traits. We also discuss key challenges in ferrohydrodynamic microfluidics from the aspects of ferrofluid biocompatibility, system throughput, and nanoparticle depletion. Finally, we outline future research directions involving machine learning, 3D printing, and multiplexed detection. These insights chart a path for advancing ferrofluid-based technologies in precision biomedicine, diagnostics, and cellular engineering.
☆ Cooperative Integrated Estimation-Guidance for Simultaneous Interception of Moving Targets
This paper proposes a cooperative integrated estimation-guidance framework for simultaneous interception of a non-maneuvering target using a team of unmanned autonomous vehicles, assuming only a subset of vehicles are equipped with dedicated sensors to measure the target's states. Unlike earlier approaches that focus solely on either estimation or guidance design, the proposed framework unifies both within a cooperative architecture. To circumvent the limitation posed by heterogeneity in target observability, sensorless vehicles estimate the target's state by leveraging information exchanged with neighboring agents over a directed communication topology through a prescribed-time observer. The proposed approach employs true proportional navigation guidance (TPNG), which uses an exact time-to-go formulation and is applicable across a wide spectrum of target motions. Furthermore, prescribed-time observer and controller are employed to achieve convergence to true target's state and consensus in time-to-go within set predefined times, respectively. Simulations demonstrate the effectiveness of the proposed framework under various engagement scenarios.
☆ Finite Sample MIMO System Identification with Multisine Excitation: Nonparametric, Direct, and Two-step Parametric Estimators
Multisine excitations are widely used for identifying multi-input multi-output systems due to their periodicity, data compression properties, and control over the input spectrum. Despite their popularity, the finite sample statistical properties of frequency-domain estimators under multisine excitation, for both nonparametric and parametric settings, remain insufficiently understood. This paper develops a finite-sample statistical framework for least-squares estimation of the frequency response function (FRF) and its implications for parametric modeling. First, we derive exact distributional and covariance properties of the FRF estimator, explicitly accounting for aliasing effects under slow sampling regimes, and establish conditions for unbiasedness, uncorrelatedness, and consistency across multiple experiments. Second, we show that the FRF estimate is a sufficient statistic for any parametric model under Gaussian noise, leading to an exact equivalence between optimal two stage frequency-domain methods and time-domain prediction error and maximum likelihood estimation. This equivalence is shown to yield finite-sample concentration bounds for parametric maximum likelihood estimators, enabling rigorous uncertainty quantification, and closed-form prediction error method estimators without iterative optimization. The theoretical results are demonstrated in a representative case study.
comment: 16 pages, 4 figures
♻ ☆ Recursive Experiment Design for Closed-Loop Identification with Output Perturbation Limits
In many applications, system identification experiments must be performed under output feedback to ensure safety or to maintain system operation. In this paper, we consider the online design of informative experiments for ARMAX models by applying a bounded perturbation to the input signal generated by a fixed output feedback controller. Specifically, the design constrains the resulting output perturbation within user-specified limits and can be efficiently computed in closed form. We demonstrate the effectiveness of the method in a numerical experiment.
♻ ☆ Optimal and Heuristic Approaches for Platooning Systems with Deadlines
Efficient truck platooning is a key strategy for reducing freight costs, lowering fuel consumption, and mitigating emissions. Deadlines are critical in this context, as trucks must depart within specific time windows to meet delivery requirements and avoid penalties. In this paper, we investigate the optimal formation and dispatch of truck platoons at a highway station with finite capacity $L$ and deadline constraints $T$. The system operates in discrete time, with each arriving truck assigned a deadline of $T$ slot units. The objective is to leverage the efficiency gains from forming large platoons while accounting for waiting costs and deadline violations. We formulate the problem as a Markov decision process and analyze the structure of the optimal policy $\pi^\star$ for $L = 3$, extending insights to arbitrary $L$. We prove certain monotonicity properties of the optimal policy in the state space $\mathcal{S}$ and identify classes of unreachable states. Moreover, since the size of $\mathcal{S}$ grows exponentially with $L$ and $T$, we propose heuristics -- including conditional and deep-learning based approaches -- that exploit these structural insights while maintaining low computational complexity.
♻ ☆ Convex computation of regions of attraction from data using Sums-of-Squares programming
The paper concentrates on the analysis of the Region of Attraction (RoA) for unknown autonomous dynamical systems. The aim is to explore a data-driven approach based on moment Sum-of-Squares (SoS) hierarchy, which enables novel RoA outer approximations despite the reduced information on the structure of the dynamics. The main contribution of this work is bypassing the system model and, consequently, the recurring constraint on its polynomial structure. Numerical experimentation showcases the influence of data on learned approximating sets, offering a promising outlook on the potential of this method.
♻ ☆ Smart Exploration in Reinforcement Learning using Bounded Uncertainty Models
Reinforcement learning (RL) is a powerful framework for decision-making in uncertain environments, but it often requires large amounts of data to learn an optimal policy. We address this challenge by incorporating prior model knowledge to guide exploration and accelerate the learning process. Specifically, we assume access to a model set that contains the true transition kernel and reward function. We optimize over this model set to obtain upper and lower bounds on the Q-function, which are then used to guide the exploration of the agent. We provide theoretical guarantees on the convergence of the Q-function to the optimal Q-function under the proposed class of exploring policies. Furthermore, we also introduce a data-driven regularized version of the model set optimization problem that ensures the convergence of the class of exploring policies to the optimal policy. Lastly, we show that when the model set has a specific structure, namely the bounded-parameter MDP (BMDP) framework, the regularized model set optimization problem becomes convex and simple to implement. In this setting, we also prove finite-time convergence to the optimal policy under mild assumptions. We demonstrate the effectiveness of the proposed exploration strategy, which we call BUMEX (Bounded Uncertainty Model-based Exploration), in a simulation study. The results indicate that the proposed method can significantly accelerate learning in benchmark examples. A toolbox is available at https://github.com/JvHulst/BUMEX.
comment: Accepted for Presentation at 64th IEEE Conference on Decision and Control, CDC 2025, Rio de Janeiro, Brazil, 2025
♻ ☆ Agile and Cooperative Aerial Manipulation of a Cable-Suspended Load
Quadrotors can carry slung loads to hard-to-reach locations at high speed. Since a single quadrotor has limited payload capacities, using a team of quadrotors to collaboratively manipulate a heavy object is a scalable and promising solution. However, existing control algorithms for multi-lifting systems only enable low-speed and low-acceleration operations due to the complex dynamic coupling between quadrotors and the load, limiting their use in time-critical missions such as search and rescue. In this work, we present a solution to significantly enhance the agility of cable-suspended multi-lifting systems. Unlike traditional cascaded solutions, we introduce a trajectory-based framework that solves the whole-body kinodynamic motion planning problem online, accounting for the dynamic coupling effects and constraints between the quadrotors and the load. The planned trajectory is provided to the quadrotors as a reference in a receding-horizon fashion and is tracked by an onboard controller that observes and compensates for the cable tension. Real-world experiments demonstrate that our framework can achieve at least eight times greater acceleration than state-of-the-art methods to follow agile trajectories. Our method can even perform complex maneuvers such as flying through narrow passages at high speed. Additionally, it exhibits high robustness against load uncertainties and does not require adding any sensors to the load, demonstrating strong practicality.
comment: 38 pages, 11 figures
♻ ☆ SafEDMD: A Koopman-based data-driven controller design framework for nonlinear dynamical systems
The Koopman operator serves as the theoretical backbone for machine learning of dynamical control systems, where the operator is heuristically approximated by extended dynamic mode decomposition (EDMD). In this paper, we propose SafEDMD, a novel stability- and feedback-oriented EDMD-based controller design framework. Our approach leverages a reliable surrogate model generated in a data-driven fashion in order to provide closed-loop guarantees. In particular, we establish a controller design based on semi-definite programming with guaranteed stabilization of the underlying nonlinear system. As central ingredient, we derive proportional error bounds that vanish at the origin and are tailored to control tasks. We illustrate the developed method by means of several benchmark examples and highlight the advantages over state-of-the-art methods.
comment: Accepted for publication in Automatica
♻ ☆ High Performance Distributed Control for Large-Scale Linear Systems: A Cover-Based Distributed Observer Approach
In recent years, the distributed-observer-based distributed control law has shown powerful ability to arbitrarily approximate the centralized control performance. However, the traditional distributed observer requires each local observer to reconstruct the state information of the whole system, which is unrealistic for large-scale scenarios. To fill this gap, This paper presents a coverage solution algorithm for large-scale systems that accounts for both physical and communication network characteristics, which can significantly reduce the dimension of local observers. Then, the cover-based distributed observer for large-scale systems is proposed to overcome the problem that the system dynamics are difficult to estimate due to the coupling between cover sets. Furthermore, the two-layer Lyapunov analysis method is adopted and the dynamic transformation lemma of compact errors is proved, which solves the problem of analyzing stability of the error dynamic of the cover-based distributed observer. Finally, it is proved that the distributed control law based on the cover-based distributed observer can also arbitrarily approximate the control performance of the centralized control law, and the dimension of the local observer is greatly reduced compared with the traditional method. The simulation results show the validity of the developed theories.
♻ ☆ End-to-end guarantees for indirect data-driven control of bilinear systems with finite stochastic data
In this paper we propose an end-to-end algorithm for indirect data-driven control for bilinear systems with stability guarantees. We consider the case where the collected i.i.d. data is affected by probabilistic noise with possibly unbounded support and leverage tools from statistical learning theory to derive finite sample identification error bounds. To this end, we solve the bilinear identification problem by solving a set of linear and affine identification problems, by a particular choice of a control input during the data collection phase. We provide a priori as well as data-dependent finite sample identification error bounds on the individual matrices as well as ellipsoidal bounds, both of which are structurally suitable for control. Further, we integrate the structure of the derived identification error bounds in a robust controller design to obtain an exponentially stable closed-loop. By means of an extensive numerical study we showcase the interplay between the controller design and the derived identification error bounds. Moreover, we note appealing connections of our results to indirect data-driven control of general nonlinear systems through Koopman operator theory and discuss how our results may be applied in this setup.
♻ ☆ Data-Driven Stabilization Using Prior Knowledge on Stabilizability and Controllability
In this work, we study data-driven stabilization of linear time-invariant systems using prior knowledge of system-theoretic properties, specifically stabilizability and controllability. To formalize this, we extend the concept of data informativity by requiring the existence of a controller that stabilizes all systems consistent with the data and the prior knowledge. We show that if the system is controllable, then incorporating this as prior knowledge does not relax the conditions required for data-driven stabilization. Remarkably, however, we show that if the system is stabilizable, then using this as prior knowledge leads to necessary and sufficient conditions that are weaker than those for data-driven stabilization without prior knowledge. In other words, data-driven stabilization is easier if one knows that the underlying system is stabilizable. We also provide new data-driven control design methods in terms of linear matrix inequalities that complement the conditions for informativity.
comment: 6 pages
♻ ☆ Decentralized Merging Control of Connected and Automated Vehicles to Enhance Safety and Energy Efficiency using Control Barrier Functions
This paper presents a decentralized Control Barrier Function (CBF) based approach for highway merging of Connected and Automated Vehicles (CAVs). In this control algorithm, each "host" vehicle negotiates with other agents in a control zone of the highway network, and enacts its own action, to perform safe and energy-efficient merge maneuvers. It uses predictor-corrector loops within the robust CBF setting for negotiation and to reconcile disagreements that may arise. There is no explicit order of vehicles and no priority. A notable feature is absence of gridlocks due to instability of the inter-agent system. Results from Monte Carlo simulations show significant improvement in the system-wide energy efficiency and traffic flow compared to a first-in-first-out approach, as well as enhanced robustness of the proposed decentralized controller compared to its centralized counterpart.
comment: This work has been submitted to a conference for possible publication and is under review. Paper summary: 8 pages, 5 figures, 2 tables
♻ ☆ Game Theoretic Resilience Recommendation Framework for CyberPhysical Microgrids Using Hypergraph MetaLearning
This paper presents a physics-aware cyberphysical resilience framework for radial microgrids under coordinated cyberattacks. The proposed approach models the attacker through a hypergraph neural network (HGNN) enhanced with model agnostic metalearning (MAML) to rapidly adapt to evolving defense strategies and predict high-impact contingencies. The defender is modeled via a bi-level Stackelberg game, where the upper level selects optimal tie-line switching and distributed energy resource (DER) dispatch using an Alternating Direction Method of Multipliers (ADMM) coordinator embedded within the Non-dominated Sorting Genetic Algorithm II (NSGA-II). The framework simultaneously optimizes load served, operational cost, and voltage stability, ensuring all post-defense states satisfy network physics constraints. The methodology is first validated on the IEEE 69-bus distribution test system with 12 DERs, 8 critical loads, and 5 tie-lines, and then extended to higher bus systems including the IEEE 123-bus feeder and a synthetic 300-bus distribution system. Results show that the proposed defense strategy restores nearly full service for 90% of top-ranked attacks, mitigates voltage violations, and identifies Feeder 2 as the principal vulnerability corridor. Actionable operating rules are derived, recommending pre-arming of specific tie-lines to enhance resilience, while higher bus system studies confirm scalability of the framework on the IEEE 123-bus and 300-bus systems.
♻ ☆ Climate Science and Control Engineering: Insights, Parallels, and Connections
Climate science is the multidisciplinary field that studies the Earth's climate and its evolution. At the very core of climate science are indispensable climate models that predict future climate scenarios, inform policy decisions, and dictate how a country's economy should change in light of the changing climate. Climate models capture a wide range of interacting dynamic processes via extremely complex ordinary and partial differential equations. To model these large-scale complex processes, climate science leverages supercomputers, advanced simulations, and statistical methods to predict future climate. An area of engineering that is rarely studied in climate science is control engineering. Given that climate systems are inherently dynamic, it is intuitive to analyze them within the framework of dynamic system science. This perspective has been underexplored in the literature. In this manuscript, we provide a tutorial that: (i) introduces the control engineering community to climate dynamics and modeling, including spatiotemporal scales and challenges in climate modeling; (ii) offers a fresh perspective on climate models from a control systems viewpoint; and (iii) explores the relevance and applicability of various advanced graph and network control-based approaches in building a physics-informed framework for learning, control and estimation in climate systems. We also present simple and then more complex climate models, depicting fundamental ideas and processes that are instrumental in building climate change projections. This tutorial also builds parallels and observes connections between various contemporary problems at the forefront of climate science and their control theoretic counterparts. We specifically observe that an abundance of climate science problems can be linguistically reworded and mathematically framed as control theoretic ones.
♻ ☆ On the Detection of Shared Data Manipulation in Distributed Optimization
This paper investigates the vulnerability of the Alternating Direction Method of Multipliers (ADMM) algorithm to shared data manipulation, with a focus on solving optimal power flow (OPF) problems. Deliberate data manipulation may cause the ADMM algorithm to converge to suboptimal solutions. We derive a sufficient condition for detecting data manipulation based on the theoretical convergence trajectory of the ADMM algorithm. We evaluate the performance of the detection condition on three data manipulation strategies with various levels of complexity and stealth. The simplest attack sends the target values and each iteration, the second attack uses a feedback loop to find the next target values, and the last attack uses a bilevel optimization to find the target values. We then extend the three data manipulation strategies to avoid detection by the detection conditions and a neural network (NN) detection model. We also propose an adversarial NN training framework to detect shared data manipulation. We illustrate the performance of our data manipulation strategy and detection framework on OPF problems. The results show that the proposed detection condition successfully detects most of the data manipulation attacks. However, the bilevel optimization attack strategy that incorporates the detection methods may avoid being detected. Countering this, our proposed adversarial training framework detects all the instances of the bilevel optimization attack.
♻ ☆ Online Adaptation for Flying Quadrotors in Tight Formations
The task of flying in tight formations is challenging for teams of quadrotors because the complex aerodynamic wake interactions can destabilize individual team members as well as the team. Furthermore, these aerodynamic effects are highly nonlinear and fast-paced, making them difficult to model and predict. To overcome these challenges, we present L1 KNODE-DW MPC, an adaptive, mixed expert learning based control framework that allows individual quadrotors to accurately track trajectories while adapting to time-varying aerodynamic interactions during formation flights. We evaluate L1 KNODE-DW MPC in two different three-quadrotor formations and show that it outperforms several MPC baselines. Our results show that the proposed framework is capable of enabling the three-quadrotor team to remain vertically aligned in close proximity throughout the flight. These findings show that the L1 adaptive module compensates for unmodeled disturbances most effectively when paired with an accurate dynamics model. A video showcasing our framework and the physical experiments is available here: https://youtu.be/9QX1Q5Ut9Rs
comment: 10 pages, 4 figures
♻ ☆ Understanding the Application of Utility Theory in Robotics and Artificial Intelligence: A Survey
As a unifying concept in economics, game theory, and operations research, even in the Robotics and AI field, the utility is used to evaluate the level of individual needs, preferences, and interests. Especially for decision-making and learning in multi-agent/robot systems (MAS/MRS), a suitable utility model can guide agents in choosing reasonable strategies to achieve their current needs and learning to cooperate and organize their behaviors, optimizing the system's utility, building stable and reliable relationships, and guaranteeing each group member's sustainable development, similar to the human society. Although these systems' complex, large-scale, and long-term behaviors are strongly determined by the fundamental characteristics of the underlying relationships, there has been less discussion on the theoretical aspects of mechanisms and the fields of applications in Robotics and AI. This paper introduces a utility-orient needs paradigm to describe and evaluate inter and outer relationships among agents' interactions. Then, we survey existing literature in relevant fields to support it and propose several promising research directions along with some open problems deemed necessary for further investigations.
comment: I am not sure whether withdrawing this paper is suitable. However, right now this paper has significant changes in its topic and author. So, I do not want to lead to any confusion about this paper. In the future, it will have a new version. I hope people will not have issues and confusion about the older one
♻ ☆ Integrated Learning and Optimization to Control Load Demand and Wind Generation for Minimizing Ramping Cost in Real-Time Electricity Market
We developed a new integrated learning and optimization (ILO) methodology to predict context-aware unknown parameters in economic dispatch (ED), a crucial problem in power systems solved to generate optimal power dispatching decisions to serve consumer load. The ED formulation in the current study consists of load and renewable generation as unknown parameters in its constraints predicted using contextual information (e.g., prior load, temperature). The ILO framework train a neural network (NN) to estimate ED parameters by minimizing an application-specific regret function which is a difference between ground truth and NN-driven decisions favouring better ED decisions. We thoroughly analyze the feasible region of ED formulation to understand the impact of load and renewable learning together on the ED decisions. Corresponding to that we developed a new regret function to capture real-time electricity market operations where differences in predicted and true loads are corrected by ramping generators in real-time but at a higher cost than the market price. The proposed regret function when minimized using ILO framework train the NN to guide the load and renewable predictions to generate ED decisions favouring minimum generator ramping costs. This is unlike conventional sequential learning and optimization (SLO) framework which train NN to accurately estimate load and renewable instead of better ED decisions. The combined training of load and renewable using ILO is a new concept and lead to significantly improved ramping costs when compared with SLO based training of load and renewable and SLO trained load with 100% accurate renewable proving its decision-focused capability.
comment: The preprint was submitted to disseminate the idea as soon as possible and was submitted without asking one of the authors listed in the manuscript as he was the supervisor. Moreover, the submitted preprint mentions being submitted in a journal while it has not yet been submitted in a journal yet. The institute thus asked to withdraw the preprint
Graphics 8
☆ OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.
comment: Project page: https://yukun-huang.github.io/OmniX/
☆ SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.
comment: 26 pages; 21 figures; 3 tables; project page: https://see-4d.github.io/
☆ HEIR: Learning Graph-Based Motion Hierarchies
Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/
comment: Code link: https://github.com/princeton-computational-imaging/HEIR
☆ The Impact and Outlook of 3D Gaussian Splatting
Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.
comment: Article written for Frontiers of Science Award, International Congress on Basic Science, 2025
☆ Look at That Distractor: Dynamic Translation Gain under Low Perceptual Load in Virtual Reality
Redirected walking utilizes gain adjustments within perceptual thresholds to allow natural navigation in large scale virtual environments within confined physical environments. Previous research has found that when users are distracted by some scene elements, they are less sensitive to gain values. However, the effects on detection thresholds have not been quantitatively measured. In this paper, we present a novel method that dynamically adjusts translation gain by leveraging visual distractors. We place distractors within the user's field of view and apply a larger translation gain when their attention is drawn to them. Because the magnitude of gain adjustment depends on the user's level of engagement with the distractors, the redirection process remains smooth and unobtrusive. To evaluate our method, we developed a task oriented virtual environment for a user study. Results show that introducing distractors in the virtual environment significantly raises users' translation gain thresholds. Furthermore, assessments using the Simulator Sickness Questionnaire and Igroup Presence Questionnaire indicate that the method maintains user comfort and acceptance, supporting its effectiveness for RDW systems.
☆ StructLayoutFormer:Conditional Structured Layout Generation via Structure Serialization and Disentanglement
Structured layouts are preferable in many 2D visual contents (\eg, GUIs, webpages) since the structural information allows convenient layout editing. Computational frameworks can help create structured layouts but require heavy labor input. Existing data-driven approaches are effective in automatically generating fixed layouts but fail to produce layout structures. We present StructLayoutFormer, a novel Transformer-based approach for conditional structured layout generation. We use a structure serialization scheme to represent structured layouts as sequences. To better control the structures of generated layouts, we disentangle the structural information from the element placements. Our approach is the first data-driven approach that achieves conditional structured layout generation and produces realistic layout structures explicitly. We compare our approach with existing data-driven layout generation approaches by including post-processing for structure extraction. Extensive experiments have shown that our approach exceeds these baselines in conditional structured layout generation. We also demonstrate that our approach is effective in extracting and transferring layout structures. The code is publicly available at %\href{https://github.com/Teagrus/StructLayoutFormer} {https://github.com/Teagrus/StructLayoutFormer}.
♻ ☆ Disentangled 4D Gaussian Splatting: Rendering High-Resolution Dynamic World at 343 FPS
While dynamic novel view synthesis from 2D videos has seen progress, achieving efficient reconstruction and rendering of dynamic scenes remains a challenging task. In this paper, we introduce Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel representation and rendering pipeline that achieves real-time performance without compromising visual fidelity. Disentangled4DGS decouples the temporal and spatial components of 4D Gaussians, avoiding the need for slicing first and four-dimensional matrix calculations in prior methods. By projecting temporal and spatial deformations into dynamic 2D Gaussians and deferring temporal processing, we minimize redundant computations of 4DGS. Our approach also features a gradient-guided flow loss and temporal splitting strategy to reduce artifacts. Experiments demonstrate a significant improvement in rendering speed and quality, achieving 343 FPS when render 1352*1014 resolution images on a single RTX3090 while reducing storage requirements by at least 4.5%. Our approach sets a new benchmark for dynamic novel view synthesis, outperforming existing methods on both multi-view and monocular dynamic scene datasets.
♻ ☆ 3D Dynamic Fluid Assets from Single-View Videos with Generative Gaussian Splatting
While the generation of 3D content from single-view images has been extensively studied, the creation of physically consistent 3D dynamic scenes from videos remains in its early stages. We propose a novel framework leveraging generative 3D Gaussian Splatting (3DGS) models to extract and re-simulate 3D dynamic fluid objects from single-view videos using simulation methods. The fluid geometry represented by 3DGS is initially generated and optimized from single-view images, then denoised, densified, and aligned across frames. We estimate the fluid surface velocity using optical flow, propose a mainstream extraction algorithm to refine it. The 3D volumetric velocity field is then derived from the velocity of the fluid's enclosed surface. The velocity field is therewith converted into a divergence-free, grid-based representation, enabling the optimization of simulation parameters through its differentiability across frames. This process outputs simulation-ready fluid assets with physical dynamics closely matching those observed in the source video. Our approach is applicable to various liquid fluids, including inviscid and viscous types, and allows users to edit the output geometry or extend movement durations seamlessly. This automatic method for creating 3D dynamic fluid assets from single-view videos, easily obtainable from the internet, shows great potential for generating large-scale 3D fluid assets at a low cost.
Computation and Language 142
☆ Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io
comment: Project Page: https://video-cof.github.io
☆ Gistify! Codebase-Level Understanding via Runtime Execution
As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.
☆ Defeating the Training-Inference Mismatch via FP16
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
☆ Remote Labor Index: Measuring AI Automation of Remote Work
AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.
comment: Website: https://www.remotelabor.ai
☆ AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/
comment: 14 pages, 9 figures
☆ Deep sequence models tend to memorize geometrically; it is unclear why
In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.
☆ Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
This paper presents a comprehensive cross-platform evaluation of reasoning capabilities in contemporary foundation models, establishing an infrastructure-agnostic benchmark across three computational paradigms: HPC supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and university clusters (a node with eight H200 GPUs). We evaluate 15 foundation models across 79 problems spanning eight academic domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization) through three experimental phases: (1) Baseline establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing methodology and reference performance; (2) Infrastructure validation: The 19-problem benchmark repeated on university cluster (seven models including Falcon-Mamba state-space architecture) and Nebius AI Studio (nine state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3 30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic reproducibility; (3) Extended evaluation: Full 79-problem assessment on both university cluster and Nebius platforms, probing generalization at scale across architectural diversity. The findings challenge conventional scaling assumptions, establish training data quality as more critical than model size, and provide actionable guidelines for model selection across educational, production, and research contexts. The tri-infrastructure methodology and 79-problem benchmark enable longitudinal tracking of reasoning capabilities as foundation models evolve.
☆ Value Drifts: Tracing Value Alignment During LLM Post-Training
As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.
☆ The End of Manual Decoding: Towards Truly End-to-End Language Models
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.
☆ Kimi Linear: An Expressive, Efficient Attention Architecture
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
comment: Kimi Linear tech report
☆ Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
Large language models (LLMs) have demonstrated exceptional capabilities across multiple domains by leveraging massive pre-training and curated fine-tuning data. However, in data-sensitive fields such as healthcare, the lack of high-quality, domain-specific training corpus hinders LLMs' adaptation for specialized applications. Meanwhile, domain experts have distilled domain wisdom into ontology rules, which formalize relationships among concepts and ensure the integrity of knowledge management repositories. Viewing LLMs as implicit repositories of human knowledge, we propose Evontree, a novel framework that leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs, without requiring extensive external datasets. Specifically, Evontree extracts domain ontology from raw models, detects inconsistencies using two core ontology rules, and reinforces the refined knowledge via self-distilled fine-tuning. Extensive experiments on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both unmodified models and leading supervised baselines, achieving up to a 3.7% improvement in accuracy. These results confirm the effectiveness, efficiency, and robustness of our approach for low-resource domain adaptation of LLMs.
☆ The Era of Agentic Organization: Learning to Organize with Language Models
We envision a new era of AI, termed agentic organization, where agents solve complex problems by working collaboratively and concurrently, enabling outcomes beyond individual intelligence. To realize this vision, we introduce asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large language models, which organizes the internal thinking process into concurrently executable structures. Specifically, we propose a thinking protocol where an organizer dynamically assigns sub-queries to workers, merges intermediate knowledge, and produces coherent solutions. More importantly, the thinking structure in this protocol can be further optimized through reinforcement learning. Experiments demonstrate that AsyncThink achieves 28% lower inference latency compared to parallel thinking while improving accuracy on mathematical reasoning. Moreover, AsyncThink generalizes its learned asynchronous thinking capabilities, effectively tackling unseen tasks without additional training.
☆ Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.
comment: The scaling study inspiring T5Gemma
☆ SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).
comment: https://slideagent.github.io/
☆ Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives EMNLP 2025
Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs' reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs' normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.
comment: Accepted to the 8th BlackboxNLP Workshop at EMNLP 2025
☆ Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.
☆ InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.
☆ The Structure of Relation Decoding Linear Operators in Large Language Models NeurIPS 2025
This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators' compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.
comment: NeurIPS 2025 (Spotlight)
☆ Hebrew Diacritics Restoration using Visual Representation
Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input's vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.
☆ Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs ICDM 2025
Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.
comment: ICDM 2025 Workshop
☆ A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool
Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool than a single LLM. Methods: 29,766 non-contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a HIPAA compliant internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open-source models and consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam-report pairs. The highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78). The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75 & 0.76). Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3 Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522 (0.500-0.543). No statistically significant differences were observed between Top-3, Full-9, and Consensus (p > 0.05). Conclusion: An ensemble of medium to large sized open-source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.
comment: 29 pages, 3 figures, 4 tables
☆ Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration
Recent advances in Text-to-SQL have achieved strong results in static, single-turn tasks, where models generate SQL queries from natural language questions. However, these systems fall short in real-world interactive scenarios, where user intents evolve and queries must be refined over multiple turns. In applications such as finance and business analytics, users iteratively adjust query constraints or dimensions based on intermediate results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a benchmark assessing model performance under evolving user interactions. Unlike previous manually curated datasets, DySQL-Bench is built through an automated two-stage pipeline of task synthesis and verification. Structured tree representations derived from raw database tables guide LLM-based task generation, followed by interaction-oriented filtering and expert validation. Human evaluation confirms 100% correctness of the synthesized data. We further propose a multi-turn evaluation framework simulating realistic interactions among an LLM-simulated user, the model under test, and an executable database. The model must adapt its reasoning and SQL generation as user intents change. DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling 1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on the Pass@5 metric, underscoring the benchmark's difficulty. All code and data are released at https://github.com/Aurora-slz/Real-World-SQL-Bench .
☆ Context Engineering 2.0: The Context of Context Engineering
Karl Marx once wrote that ``the human essence is the ensemble of social relations'', suggesting that individuals are not isolated entities but are fundamentally shaped by their interactions with other entities, within which contexts play a constitutive and essential role. With the advent of computers and artificial intelligence, these contexts are no longer limited to purely human--human interactions: human--machine interactions are included as well. Then a central question emerges: How can machines better understand our situations and purposes? To address this challenge, researchers have recently introduced the concept of context engineering. Although it is often regarded as a recent innovation of the agent era, we argue that related practices can be traced back more than twenty years. Since the early 1990s, the field has evolved through distinct historical phases, each shaped by the intelligence level of machines: from early human--computer interaction frameworks built around primitive computers, to today's human--agent interaction paradigms driven by intelligent agents, and potentially to human--level or superhuman intelligence in the future. In this paper, we situate context engineering, provide a systematic definition, outline its historical and conceptual landscape, and examine key design considerations for practice. By addressing these questions, we aim to offer a conceptual foundation for context engineering and sketch its promising future. This paper is a stepping stone for a broader community effort toward systematic context engineering in AI systems.
☆ Bayesian Network Fusion of Large Language Models for Sentiment Analysis
Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.
☆ Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
comment: Preprint
☆ SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning ICSE 2026
Identifying and addressing security issues during the early phase of the development lifecycle is critical for mitigating the long-term negative impacts on software systems. Code review serves as an effective practice that enables developers to check their teammates' code before integration into the codebase. To streamline the generation of review comments, various automated code review approaches have been proposed, where LLM-based methods have significantly advanced the capabilities of automated review generation. However, existing models primarily focus on general-purpose code review, their effectiveness in identifying and addressing security-related issues remains underexplored. Moreover, adapting existing code review approaches to target security issues faces substantial challenges, including data scarcity and inadequate evaluation metrics. To address these limitations, we propose SecureReviewer, a new approach designed for enhancing LLMs' ability to identify and resolve security-related issues during code review. Specifically, we first construct a dataset tailored for training and evaluating secure code review capabilities. Leveraging this dataset, we fine-tune LLMs to generate code review comments that can effectively identify security issues and provide fix suggestions with our proposed secure-aware fine-tuning strategy. To mitigate hallucination in LLMs and enhance the reliability of their outputs, we integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric designed to assess the effectiveness of review comments in addressing security issues. Experimental results demonstrate that SecureReviewer outperforms state-of-the-art baselines in both security issue detection accuracy and the overall quality and practical utility of generated review comments.
comment: Accepted by ICSE 2026. Code and data: https://github.com/SIMIAO515/SecureReviewer
☆ 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models EMNLP 2025
Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank \underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results. Notably, SSLC compresses Qwen2.5 by 50\% with no performance drop and achieves at least 1.63$\times$ speedup, offering a practical solution for efficient LLM deployment.
comment: 15 pages, 6 figures, EMNLP 2025 findings
☆ Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
Test oracle generation in non-regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that fails this execution-based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re-validation. Our extensive evaluation on seven diverse benchmarks demonstrates that Nexus consistently and substantially outperforms state-of-theart baselines. For instance, Nexus improves the test-level oracle accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1-Mini generated test oracles on HumanEval increases from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.
comment: Under Review
☆ OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education
With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided into 6 fine-grained categories, covering a total of 61 different subjects (41 in the knowledge and 20 in the cultivation). Furthermore, the dataset features a rich variety of question formats, including 11 common exam question types, providing a solid foundation for comprehensively evaluating LLMs' capabilities in education. Extensive experiments on 11 mainstream open-source and closed-source LLMs reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro surpassed 60\% accuracy, while in the cultivation dimension, the best-performing model, QWQ, still trailed human intelligence by nearly 30\%. These results highlight the substantial room for improvement and underscore the challenges of applying LLMs in education.
☆ On the Role of Context for Discourse Relation Classification in Scientific Writing
With the increasing use of generative Artificial Intelligence (AI) methods to support science workflows, we are interested in the use of discourse-level information to find supporting evidence for AI generated scientific claims. A first step towards this objective is to examine the task of inferring discourse structure in scientific writing. In this work, we present a preliminary investigation of pretrained language model (PLM) and Large Language Model (LLM) approaches for Discourse Relation Classification (DRC), focusing on scientific publications, an under-studied genre for this task. We examine how context can help with the DRC task, with our experiments showing that context, as defined by discourse structure, is generally helpful. We also present an analysis of which scientific discourse relation types might benefit most from context.
comment: Accepted at Joint Sixth Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025) and Eighth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2025)
☆ The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a "language model graph" that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.
☆ MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.
☆ From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning
Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen) that transforms generalist models into domain experts without sacrificing their broad capabilities. ACER first synthesizes a comprehensive, textbook-style curriculum by generating a table of contents for a subject and then creating question-answer (QA) pairs guided by Bloom's taxonomy. This ensures systematic topic coverage and progressively increasing difficulty. The resulting synthetic corpus is used for continual pretraining with an interleaved curriculum schedule, aligning learning across both content and cognitive dimensions. Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized MMLU subsets. In challenging domains like microeconomics, where baselines struggle, ACER boosts accuracy by 5 percentage points. Across all target domains, we observe a consistent macro-average improvement of 3 percentage points. Notably, ACER not only prevents catastrophic forgetting but also facilitates positive cross-domain knowledge transfer, improving performance on non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points, while maintaining stable performance on general reasoning tasks. Our results demonstrate that ACER offers a scalable and effective recipe for closing critical domain gaps in LLMs.
☆ SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.
☆ Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
OpenAI's ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas's web interaction capabilities using browser-based games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.
☆ Unravelling the Mechanisms of Manipulating Numbers in Language Models
Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information -- including the causes of output errors -- to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs' architectures.
☆ Do LLMs Signal When They're Right? Evidence from Neuron Agreement
Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.
☆ PVMark: Enabling Public Verifiability for LLM Watermarking Schemes
Watermarking schemes for large language models (LLMs) have been proposed to identify the source of the generated text, mitigating the potential threats emerged from model theft. However, current watermarking solutions hardly resolve the trust issue: the non-public watermark detection cannot prove itself faithfully conducting the detection. We observe that it is attributed to the secret key mostly used in the watermark detection -- it cannot be public, or the adversary may launch removal attacks provided the key; nor can it be private, or the watermarking detection is opaque to the public. To resolve the dilemma, we propose PVMark, a plugin based on zero-knowledge proof (ZKP), enabling the watermark detection process to be publicly verifiable by third parties without disclosing any secret key. PVMark hinges upon the proof of `correct execution' of watermark detection on which a set of ZKP constraints are built, including mapping, random number generation, comparison, and summation. We implement multiple variants of PVMark in Python, Rust and Circom, covering combinations of three watermarking schemes, three hash functions, and four ZKP protocols, to show our approach effectively works under a variety of circumstances. By experimental results, PVMark efficiently enables public verifiability on the state-of-the-art LLM watermarking schemes yet without compromising the watermarking performance, promising to be deployed in practice.
comment: This work has been submitted to the IEEE for possible publication
☆ Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.
comment: Work in progress
☆ Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages
Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.
comment: Under review
☆ Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6\% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.
☆ Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
comment: 10 pages
☆ Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.
☆ What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.
comment: Code: https://github.com/rmovva/wimhf
☆ Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation NeurIPS 2025
While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.
comment: Accepted in NeurIPS 2025
☆ RCScore: Quantifying Response Consistency in Large Language Models
Current LLM evaluations often rely on a single instruction template, overlooking models' sensitivity to instruction style-a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance variations undetected by conventional metrics. Our experiments across ten LLMs on four reasoning benchmarks demonstrate that instruction style can shift accuracy by up to 16.7% points. We introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to measure stylistic self-consistency, and establish its strong correlation with task accuracy, suggesting consistency as a valuable proxy for model reliability. Additional findings show that deterministic decoding produces more stylistically stable outputs, and model scale correlates positively with cross-style consistency. RCScore offers a principled approach to assess instruction robustness.
☆ SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level
The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.
☆ Similarity-Distance-Magnitude Language Models
We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which are sequence prediction models fine-tuned to maximize the proportion of generations in the well-calibrated, high-probability region partitioned by a final-layer SDM activation layer used for binary classification of instruction-following. We demonstrate that existing pre-trained decoder-only Transformer LMs can be readily converted into SDM LMs via supervised fine-tuning, using the final-layer SDM activation layer during training to estimate a change-of-base for a supervised next-token loss over a contrastive input encoding scheme, with additional hard negative examples generated online during training. This results in reduced abstentions (i.e., improved statistical efficiency) compared to strong supervised baselines.
comment: 8 pages, 5 tables
☆ MossNet: Mixture of State-Space Experts is a Multi-Head Attention
Large language models (LLMs) have significantly advanced generative applications in natural language processing (NLP). Recent trends in model architectures revolve around efficient variants of transformers or state-space/gated-recurrent models (SSMs, GRMs). However, prevailing SSM/GRM-based methods often emulate only a single attention head, potentially limiting their expressiveness. In this work, we propose MossNet, a novel mixture-of-state-space-experts architecture that emulates a linear multi-head attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation not only in channel-mixing multi-layered perceptron (MLP) blocks but also in the time-mixing SSM kernels to realize multiple "attention heads." Extensive experiments on language modeling and downstream evaluations show that MossNet outperforms both transformer- and SSM-based architectures of similar model size and data budgets. Larger variants of MossNet, trained on trillions of tokens, further confirm its scalability and superior performance. In addition, real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU demonstrate favorable runtime speed and resource usage compared to similarly sized baselines. Our results suggest that MossNet is a compelling new direction for efficient, high-performing recurrent LLM architectures.
☆ One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning
Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.
☆ Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math
Reinforcement learning (RL) can elicit strong reasoning in large language models (LLMs), yet most open efforts focus on math and code. We propose Reasoning Curriculum, a simple two-stage curriculum that first elicits reasoning skills in pretraining-aligned domains such as math, then adapts and refines these skills across other domains via joint RL. Stage 1 performs a brief cold start and then math-only RL with verifiable rewards to develop reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and consolidate these skills. The curriculum is minimal and backbone-agnostic, requiring no specialized reward models beyond standard verifiability checks. Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning curriculum yields consistent gains. Ablations and a cognitive-skill analysis indicate that both stages are necessary and that math-first elicitation increases cognitive behaviors important for solving complex problems. Reasoning Curriculum provides a compact, easy-to-adopt recipe for general reasoning.
comment: 9 pages
☆ On the Influence of Discourse Relations in Persuasive Texts
This paper investigates the relationship between Persuasion Techniques (PTs) and Discourse Relations (DRs) by leveraging Large Language Models (LLMs) and prompt engineering. Since no dataset annotated with both PTs and DRs exists, we took the SemEval 2023 Task 3 dataset labelled with 19 PTs as a starting point and developed LLM-based classifiers to label each instance of the dataset with one of the 22 PDTB 3.0 level-2 DRs. In total, four LLMs were evaluated using 10 different prompts, resulting in 40 unique DR classifiers. Ensemble models using different majority-pooling strategies were used to create 5 silver datasets of instances labelled with both persuasion techniques and level-2 PDTB senses. The silver dataset sizes vary from 1,281 instances to 204 instances, depending on the majority pooling technique used. Statistical analysis of these silver datasets shows that six discourse relations (namely Cause, Purpose, Contrast, Cause+Belief, Concession, and Condition) play a crucial role in persuasive texts, especially in the use of Loaded Language, Exaggeration/Minimisation, Repetition and to cast Doubt. This insight can contribute to detecting online propaganda and misinformation, as well as to our general understanding of effective communication.
comment: Published in Proceedings of the 38th Canadian Conference on Artificial Intelligence CanAI 2025 Calgary Alberta May 26-27 2025. 5 figures 7 tables
☆ Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking
While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .
☆ QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research.
☆ ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden Tests NeurIPS 2025
Recommender systems are among the most impactful AI applications, interacting with billions of users every day, guiding them to relevant products, services, or information tailored to their preferences. However, the research and development of recommender systems are hindered by existing datasets that fail to capture realistic user behaviors and inconsistent evaluation settings that lead to ambiguous conclusions. This paper introduces the Open Recommendation Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified benchmark for consistent and realistic evaluation of recommendation models. ORBIT offers a standardized evaluation framework of public datasets with reproducible splits and transparent settings for its public leaderboard. Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco, featuring web browsing sequences from 87 million public, high-quality webpages. ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and privacy-guaranteed browsing data. It aligns with modern recommendation scenarios and is reserved as the hidden test part of our leaderboard to challenge recommendation models' generalization ability. ORBIT measures 12 representative recommendation models on its public benchmark and introduces a prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results reflect general improvements of recommender systems on the public datasets, with variable individual performances. The results on the hidden test reveal the limitations of existing approaches in large-scale webpage recommendation and highlight the potential for improvements with LLM integrations. ORBIT benchmark, leaderboard, and codebase are available at https://www.open-reco-bench.ai.
comment: Accepted to NeurIPS 2025 Datasets & Benchmarks track
☆ Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.
☆ SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 -- 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.
☆ Artificial Intelligence-Enabled Analysis of Radiology Reports: Epidemiology and Consequences of Incidental Thyroid Findings
Importance Incidental thyroid findings (ITFs) are increasingly detected on imaging performed for non-thyroid indications. Their prevalence, features, and clinical consequences remain undefined. Objective To develop, validate, and deploy a natural language processing (NLP) pipeline to identify ITFs in radiology reports and assess their prevalence, features, and clinical outcomes. Design, Setting, and Participants Retrospective cohort of adults without prior thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline identified ITFs and extracted nodule characteristics from image reports from multiple modalities and body regions. Main Outcomes and Measures Prevalence of ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer diagnosis. Logistic regression identified demographic and imaging-related factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9% women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more likely in women, older adults, those with higher BMI, and when imaging was ordered by oncology or internal medicine. Compared with chest CT, ITFs were more likely via neck CT, PET, and nuclear medicine scans. Nodule characteristics were poorly documented, with size reported in 44% and other features in fewer than 15% (e.g. calcifications). Compared with patients without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis, biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were common and strongly associated with cascades leading to the detection of small, low-risk cancers. These findings underscore the role of ITFs in thyroid cancer overdiagnosis and the need for standardized reporting and more selective follow-up.
☆ Detecting Data Contamination in LLMs via In-Context Learning
We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.
☆ LLM-Centric RAG with Multi-Granular Indexing and Confidence Constraints
This paper addresses the issues of insufficient coverage, unstable results, and limited reliability in retrieval-augmented generation under complex knowledge environments, and proposes a confidence control method that integrates multi-granularity memory indexing with uncertainty estimation. The method builds a hierarchical memory structure that divides knowledge representations into different levels of granularity, enabling dynamic indexing and retrieval from local details to global context, and thus establishing closer semantic connections between retrieval and generation. On this basis, an uncertainty estimation mechanism is introduced to explicitly constrain and filter low-confidence paths during the generation process, allowing the model to maintain information coverage while effectively suppressing noise and false content. The overall optimization objective consists of generation loss, entropy constraints, and variance regularization, forming a unified confidence control framework. In the experiments, comprehensive sensitivity tests and comparative analyses were designed, covering hyperparameters, environmental conditions, and data structures, to verify the stability and robustness of the proposed method across different scenarios. The results show that the method achieves superior performance over existing models in QA accuracy, retrieval recall, ranking quality, and factual consistency, demonstrating the effectiveness of combining multi-granularity indexing with confidence control. This study not only provides a new technical pathway for retrieval-augmented generation but also offers practical evidence for improving the reliability and controllability of large models in complex contexts.
☆ VISTA Score: Verification In Sequential Turn-based Assessment
Hallucination--defined here as generating statements unsupported or contradicted by available evidence or conversational context--remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.
☆ Recursive numeral systems are highly regular and easy to process
Previous work has argued that recursive numeral systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Deni\'c and Szymanik, 2024). However, showing that only natural-language-like systems optimise this tradeoff has proven elusive, and the existing solution has relied on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Here, we argue that this issue arises because the proposed trade-off has neglected regularity, a crucial aspect of complexity central to human grammars in general. Drawing on the Minimum Description Length (MDL) approach, we propose that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and unattested but possible ones, including "optimal" recursive numeral systems from previous work, and that the ad-hoc constraints from previous literature naturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies that attempt to measure and explain optimality in language.
☆ Quantitative Intertextuality from the Digital Humanities Perspective: A Survey
The connection between texts is referred to as intertextuality in literary theory, which served as an important theoretical basis in many digital humanities studies. Over the past decade, advancements in natural language processing have ushered intertextuality studies into the quantitative age. Large-scale intertextuality research based on cutting-edge methods has continuously emerged. This paper provides a roadmap for quantitative intertextuality studies, summarizing their data, methods, and applications. Drawing on data from multiple languages and topics, this survey reviews methods from statistics to deep learning. It also summarizes their applications in humanities and social sciences research and the associated platform tools. Driven by advances in computer technology, more precise, diverse, and large-scale intertext studies can be anticipated. Intertextuality holds promise for broader application in interdisciplinary research bridging AI and the humanities.
Dataset Creation and Baseline Models for Sexism Detection in Hausa
Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.
comment: 9 pages, 1 figure, 4 tables
☆ Elastic Architecture Search for Efficient Language Models ICME 2025
As large pre-trained language models become increasingly critical to natural language understanding (NLU) tasks, their substantial computational and memory requirements have raised significant economic and environmental concerns. Addressing these challenges, this paper introduces the Elastic Language Model (ELM), a novel neural architecture search (NAS) method optimized for compact language models. ELM extends existing NAS approaches by introducing a flexible search space with efficient transformer blocks and dynamic modules for dimension and head number adjustment. These innovations enhance the efficiency and flexibility of the search process, which facilitates more thorough and effective exploration of model architectures. We also introduce novel knowledge distillation losses that preserve the unique characteristics of each block, in order to improve the discrimination between architectural choices during the search process. Experiments on masked language modeling and causal language modeling tasks demonstrate that models discovered by ELM significantly outperform existing methods.
comment: ICME 2025
☆ Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral
Several previous works concluded that the largest part of generation capabilities of large language models (LLM) are learned (early) during pre-training. However, LLMs still require further alignment to adhere to downstream task requirements and stylistic preferences, among other desired properties. As LLMs continue to scale in terms of size, the computational cost of alignment procedures increase prohibitively. In this work, we propose a novel approach to circumvent these costs via proxy-based test-time alignment, i.e. using guidance from a small aligned model. Our approach can be described as token-specific cascading method, where the token-specific deferral rule is reduced to 0-1 knapsack problem. In this setting, we derive primal and dual approximations of the optimal deferral decision. We experimentally show the benefits of our method both in task performance and speculative decoding speed.
☆ Semantically-Aware LLM Agent to Enhance Privacy in Conversational AI Services
With the increasing use of conversational AI systems, there is growing concern over privacy leaks, especially when users share sensitive personal data in interactions with Large Language Models (LLMs). Conversations shared with these models may contain Personally Identifiable Information (PII), which, if exposed, could lead to security breaches or identity theft. To address this challenge, we present the Local Optimizations for Pseudonymization with Semantic Integrity Directed Entity Detection (LOPSIDED) framework, a semantically-aware privacy agent designed to safeguard sensitive PII data when using remote LLMs. Unlike prior work that often degrade response quality, our approach dynamically replaces sensitive PII entities in user prompts with semantically consistent pseudonyms, preserving the contextual integrity of conversations. Once the model generates its response, the pseudonyms are automatically depseudonymized, ensuring the user receives an accurate, privacy-preserving output. We evaluate our approach using real-world conversations sourced from ShareGPT, which we further augment and annotate to assess whether named entities are contextually relevant to the model's response. Our results show that LOPSIDED reduces semantic utility errors by a factor of 5 compared to baseline techniques, all while enhancing privacy.
comment: Accepted to IEEE Big Data 2025
☆ Semantic Frame Aggregation-based Transformer for Live Video Comment Generation
Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.
Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations
Clinical documentation increasingly uses automatic speech recognition and summarization, yet converting conversations into actionable medical orders for Electronic Health Records remains unexplored. A solution to this problem can significantly reduce the documentation burden of clinicians and directly impact downstream patient care. We introduce the MEDIQA-OE 2025 shared task, the first challenge on extracting medical orders from doctor-patient conversations. Six teams participated in the shared task and experimented with a broad range of approaches, and both closed- and open-weight large language models (LLMs). In this paper, we describe the MEDIQA-OE task, dataset, final leaderboard ranking, and participants' solutions.
☆ Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence
We introduce a methodology for the identification of notifiable events in the domain of healthcare. The methodology harnesses semantic frames to define fine-grained patterns and search them in unstructured data, namely, open-text fields in e-medical records. We apply the methodology to the problem of underreporting of gender-based violence (GBV) in e-medical records produced during patients' visits to primary care units. A total of eight patterns are defined and searched on a corpus of 21 million sentences in Brazilian Portuguese extracted from e-SUS APS. The results are manually evaluated by linguists and the precision of each pattern measured. Our findings reveal that the methodology effectively identifies reports of violence with a precision of 0.726, confirming its robustness. Designed as a transparent, efficient, low-carbon, and language-agnostic pipeline, the approach can be easily adapted to other health surveillance contexts, contributing to the broader, ethical, and explainable use of NLP in public health systems.
☆ RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification
As AI systems migrate to safety-critical domains, verifying that their actions comply with well-defined rules remains a challenge. Formal methods provide provable guarantees but demand hand-crafted temporal-logic specifications, offering limited expressiveness and accessibility. Deep learning approaches enable evaluation of plans against natural-language constraints, yet their opaque decision process invites misclassifications with potentially severe consequences. We introduce RepV, a neurosymbolic verifier that unifies both views by learning a latent space where safe and unsafe plans are linearly separable. Starting from a modest seed set of plans labeled by an off-the-shelf model checker, RepV trains a lightweight projector that embeds each plan, together with a language model-generated rationale, into a low-dimensional space; a frozen linear boundary then verifies compliance for unseen natural-language rules in a single forward pass. Beyond binary classification, RepV provides a probabilistic guarantee on the likelihood of correct verification based on its position in the latent space. This guarantee enables a guarantee-driven refinement of the planner, improving rule compliance without human annotations. Empirical evaluations show that RepV improves compliance prediction accuracy by up to 15% compared to baseline methods while adding fewer than 0.2M parameters. Furthermore, our refinement framework outperforms ordinary fine-tuning baselines across various planning domains. These results show that safety-separable latent spaces offer a scalable, plug-and-play primitive for reliable neurosymbolic plan verification. Code and data are available at: https://repv-project.github.io/.
comment: Code and data are available at: https://repv-project.github.io/
☆ Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling
Hybrid models that combine state space models (SSMs) with attention mechanisms have shown strong performance by leveraging the efficiency of SSMs and the high recall ability of attention. However, the architectural design choices behind these hybrid models remain insufficiently understood. In this work, we analyze hybrid architectures through the lens of memory utilization and overall performance, and propose a complementary method to further enhance their effectiveness. We first examine the distinction between sequential and parallel integration of SSM and attention layers. Our analysis reveals several interesting findings, including that sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts. We also introduce a data-centric approach of continually training on datasets augmented with paraphrases, which further enhances recall while preserving other capabilities. It generalizes well across different base models and outperforms architectural modifications aimed at enhancing recall. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases.
☆ The Denario project: Deep knowledge AI agents for scientific discovery
We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at https://github.com/AstroPilot-AI/Denario. A Denario demo can also be run directly on the web at https://huggingface.co/spaces/astropilot-ai/Denario, and the full app will be deployed on the cloud.
comment: 272 pages. Examples of 11 AI-generated paper drafts from different scientific disciplines. Code publicly available at https://github.com/AstroPilot-AI/Denario
☆ Evaluating Perspectival Biases in Cross-Modal Retrieval
Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.
☆ CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions
Large Language Model (LLM) agents have evolved from basic text generation to autonomously completing complex tasks through interaction with external tools. However, current benchmarks mainly assess end-to-end performance in fixed scenarios, restricting evaluation to specific skills and suffering from score saturation and growing dependence on expert annotation as agent capabilities improve. In this work, we emphasize the importance of learning ability, including both self-improvement and peer-learning, as a core driver for agent evolution toward human-level intelligence. We propose an iterative, competitive peer-learning framework, which allows agents to refine and optimize their strategies through repeated interactions and feedback, thereby systematically evaluating their learning capabilities. To address the score saturation issue in current benchmarks, we introduce CATArena, a tournament-style evaluation platform featuring four diverse board and card games with open-ended scoring. By providing tasks without explicit upper score limits, CATArena enables continuous and dynamic evaluation of rapidly advancing agent capabilities. Experimental results and analyses involving both minimal and commercial code agents demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding.
♻ ☆ TinyTim: A Family of Language Models for Divergent Generation NeurIPS
In the search for artificial general intelligence, model development and training has focused primarily on vast datasets of known problems and their accepted solutions. This process necessarily produces convergent systems which are fundamentally incapable of the conceptual reframing that is required for genuine creative breakthroughs. Inspired by the divergent cognitive processes that allow humans to make such creative leaps, our work introduces a family of language models, TinyTim, to serve as sources of divergent generation within broader systems. These models have been created by fine-tuning on the anti-parsimonious text of James Joyce's `Finnegans Wake'. Quantitative analysis of both an unsupervised fine-tuned model (TinyTim-V1) and a new instruction-tuned variant (TinyTim-V2) demonstrates a profound capacity for lexical invention; the foundational V1 model exhibits a Yule's K score for lexical richness over twenty times greater than that of convergent baselines. This trait is a stable property of the family, as the instruction-tuned V2 maintains a statistically distinct profile and resists factual convergence, sacrificing benchmark performance to preserve its core generative style. This work establishes a methodology for engineering specialized divergent models that, when paired with convergent systems, can reframe problems and force breakthroughs beyond the reach of statistical optimization alone.
comment: 7 pages, 3 figures, accepted to NeurIPS Creative AI track, models available at https://hf.co/npc-worldwide/
♻ ☆ Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
comment: 22 pages, 5 figures, 3 tables
♻ ☆ Comparing human and LLM politeness strategies in free production EMNLP 2025
Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals -- from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness). We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses in both constrained and open-ended production tasks. We find that larger models ($\ge$70B parameters) successfully replicate key preferences from the computational pragmatics literature, and human evaluators surprisingly prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies even in positive contexts, potentially leading to misinterpretations. While modern LLMs demonstrate an impressive handle on politeness strategies, these subtle differences raise important questions about pragmatic alignment in AI systems.
comment: 25 pages, 5 figures | EMNLP 2025 camera-ready version
♻ ☆ Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.
comment: 5 pages, 5 figures, 2 tables. Accepted at EUSIPCO 2025
♻ ☆ Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality EMNLP 2025
Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness, often surpassing superficial similarity between the training data and the benchmark, and that mid-layer weight changes correlate most strongly with performance gains. We release these 1,000+ SFT models and benchmark results to accelerate further research. All resources are available at https://github.com/llm-jp/massive-sft.
comment: Accepted to EMNLP 2025 (Main Conference). Models and evaluation results available at: https://github.com/llm-jp/massive-sft
♻ ☆ Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training
Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.
comment: 7 pages, 5 figures
♻ ☆ Controlling Thinking Speed in Reasoning Models NeurIPS 2025
Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-in module delivers an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.
comment: NeurIPS 2025 Spotlight
♻ ☆ RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025
comment: Added link to access models: https://huggingface.co/collections/nvidia/reward-models-10-2025
♻ ☆ CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting ISWC 2025
Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they "understand" the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.
comment: Research Track, 24th International Semantic Web Conference (ISWC 2025), November 2-6, 2025, Nara, Japan
♻ ☆ Unveiling Unicode's Unseen Underpinnings in Undermining Authorship Attribution
When using a public communication channel -- whether formal or informal, such as commenting or posting on social media -- end users have no expectation of privacy: they compose a message and broadcast it for the world to see. Even if an end user takes utmost precautions to anonymize their online presence -- using an alias or pseudonym; masking their IP address; spoofing their geolocation; concealing their operating system and user agent; deploying encryption; registering with a disposable phone number or email; disabling non-essential settings; revoking permissions; and blocking cookies and fingerprinting -- one obvious element still lingers: the message itself. Assuming they avoid lapses in judgment or accidental self-exposure, there should be little evidence to validate their actual identity, right? Wrong. The content of their message -- necessarily open for public consumption -- exposes an attack vector: stylometric analysis, or author profiling. In this paper, we dissect the technique of stylometry, discuss an antithetical counter-strategy in adversarial stylometry, and devise enhancements through Unicode steganography.
comment: 33 pages, 7 figures, 3 tables
♻ ☆ Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
On social media, many individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly. Instead, signs may surface indirectly through everyday posts or peer interactions. Detecting such implicit signals early is critical but remains challenging. We frame early and implicit SI as a forward-looking prediction task and develop a computational framework that models a user's information environment, consisting of both their longitudinal posting histories as well as the discourse of their socially proximal peers. We adopted a composite network centrality measure to identify top neighbors of a user, and temporally aligned the user's and neighbors' interactions -- integrating the multi-layered signals in a fine-tuned DeBERTa-v3 model. In a Reddit study of 1,000 (500 Case and 500 Control) users, our approach improves early and implicit SI detection by 15% over individual-only baselines. These findings highlight that peer interactions offer valuable predictive signals and carry broader implications for designing early detection systems that capture indirect as well as masked expressions of risk in online environments.
♻ ☆ LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.
♻ ☆ Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs
In addition to its more widely studied cultural movements, American Evangelicalism has a well-developed but less externally visible literary side. Christian Fiction, however, has been little studied, and what scholarly attention there is has focused on the explosively popular Left Behind series. In this work, we use computational tools to provide both a broad topical overview of Christian Fiction as a genre and a more directed exploration of how its authors depict divine acts. Working with human annotators, we first developed a codebook for identifying "acts of God." We then adapted the codebook for use by a recent, lightweight LM with the assistance of a much larger model. The laptop-scale LM is largely capable of matching human annotations, even when the task is subtle and challenging. Using these annotations, we show that significant and meaningful differences exist between divine acts depicted by the Left Behind books and Christian Fiction more broadly.
comment: Accepted to CHR 2025
♻ ☆ Unstructured Evidence Attribution for Long Context Query Focused Summarization EMNLP 2025
Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query, and extracting and citing evidence spans helps improve the trustworthiness of these summaries. Whereas previous work has focused on evidence citation with fixed levels of granularity (e.g. sentence, paragraph, document, etc.), we propose to extract unstructured (i.e., spans of any length) evidence in order to acquire more relevant and consistent evidence than in the fixed granularity case. We show how existing systems struggle to copy and properly cite unstructured evidence, which also tends to be "lost-in-the-middle". To help models perform this task, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel pipeline, which can be used as training supervision for unstructured evidence summarization. We demonstrate across 5 LLMs and 4 datasets spanning human written, synthetic, single, and multi-document settings that LLMs adapted with SUnsET generate more relevant and factually consistent evidence with their summaries, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries than baselines with no fine-tuning and fixed granularity evidence. We release SUnsET and our generation code to the public.
comment: EMNLP 2025 Main; 29 pages; 24 figures; 8 tables
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation
comment: 16 pages; 8 figures, 4 tables; v2 changelog: Fixed the modeling for table 3, random effect is the model version; v3 changelog: Fixed minor formatting issues in tables 2 and 3; v4 changelog: Fixed some typos and model description
♻ ☆ Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis
Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs to identify aspect terms and determine their sentiment polarity. However, existing approaches often fall short in simultaneously addressing three core challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE). To overcome these limitations, we propose DASCO (\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework), a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees. First, we designed a multi-task pretraining strategy for MABSA on our base model, combining aspect-oriented enhancement, image-text matching, and aspect-level sentiment-sensitive cognition. This improved the model's perception of aspect terms and sentiment cues while achieving effective image-text alignment, addressing key challenges like SCP and MIM. Furthermore, we incorporate dependency trees as syntactic branch combining with semantic branch, guiding the model to selectively attend to critical contextual elements within a target-specific scope while effectively filtering out irrelevant noise for addressing SNE problem. Extensive experiments on two benchmark datasets across three subtasks demonstrate that DASCO achieves state-of-the-art performance in MABSA, with notable gains in JMASA (+2.3\% F1 and +3.5\% precision on Twitter2015). The source code is available at https://github.com/LHaoooo/DASCO .
♻ ☆ PairUni: Pairwise Training for Unified Multimodal Language Models
Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Codes are available at https://github.com/Haochen-Wang409/PairUni.
comment: 21 pages, 11 figures, and 8 tables
♻ ☆ Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks EMNLP
Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
comment: Accepted to EMNLP - NLLP Workshop
♻ ☆ More of the Same: Persistent Representational Harms Under Increased Representation NeurIPS 2025
To recognize and mitigate the harms of generative AI systems, it is crucial to consider who is represented in the outputs of generative AI systems and how people are represented. A critical gap emerges when naively improving who is represented, as this does not imply bias mitigation efforts have been applied to address how people are represented. We critically examined this by investigating gender representation in occupation across state-of-the-art large language models. We first show evidence suggesting that over time there have been interventions to models altering the resulting gender distribution, and we find that women are more represented than men when models are prompted to generate biographies or personas. We then demonstrate that representational biases persist in how different genders are represented by examining statistically significant word differences across genders. This results in a proliferation of representational harms, stereotypes, and neoliberalism ideals that, despite existing interventions to increase female representation, reinforce existing systems of oppression.
comment: Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) as a poster paper; 39 pages, 7 figures, 15 tables
♻ ☆ MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks NeurIPS 2025
The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.
comment: Accepted by NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ Wisdom and Delusion of LLM Ensembles for Code Generation and Repair
Today's pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble's potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble's candidate pool. We find that the theoretical upperbound for an ensemble's performance can be 83% above the best single model. Our results show that consensus-based strategies for selecting solutions fall into a "popularity trap," amplifying common but incorrect outputs. In contrast, a diversity-based strategy realizes up to 95% of this theoretical potential, and proves effective even in small two-model ensembles, enabling a cost-efficient way to enhance performance by leveraging multiple LLMs.
comment: Added Acknowledgments section and hyphenated last names
♻ ☆ TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.
comment: Main paper: 11 pages, 3 figures, 6 tables. Appendix: 28 pages. Bangde Du and Minghao Guo contributed equally. Corresponding authors: Ziyi Ye (ziyiye@fudan.edu.cn), Qingyao Ai (aiqy@tsinghua.edu.cn)
♻ ☆ ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.
comment: https://github.com/Chen-GX/ReForm
♻ ☆ Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.
comment: 19 pages, 1 figure, 7 tables
♻ ☆ Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
comment: Project Page: https://github.com/Paper2Poster/Paper2Poster
♻ ☆ BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.
♻ ☆ MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? NeurIPS'25
Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands. Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.
comment: Accepted by NeurIPS'25. 30 pages, 2 figures, 13 tables
♻ ☆ UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models
Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we introduce a novel, high-quality, and UNified Omni model benchmark, UNO-Bench. This benchmark is designed to effectively evaluate both UNi-modal and Omni-modal capabilities under a unified ability taxonomy, spanning 44 task types and 5 modality combinations. It includes 1250 human curated samples for omni-modal with 98% cross-modality solvability, and 2480 enhanced uni-modal samples. The human-generated dataset is well-suited to real-world scenarios, particularly within the Chinese context, whereas the automatically compressed dataset offers a 90% increase in speed and maintains 98% consistency across 18 public benchmarks. In addition to traditional multi-choice questions, we propose an innovative multi-step open-ended question format to assess complex reasoning. A general scoring model is incorporated, supporting 6 question types for automated evaluation with 95% accuracy. Experimental result shows the Compositional Law between omni-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.
comment: v3: Switch the paper template. Work in progress. Github: https://github.com/meituan-longcat/UNO-Bench Hugging Face: https://huggingface.co/datasets/meituan-longcat/UNO-Bench
♻ ☆ Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction
With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://anonymous.4open.science/r/AOE-Benchmark/.
♻ ☆ Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens EMNLP 2025
Previous research has primarily focused on the cognitive error detection capabilities of Large Language Models (LLMs), often prompting them to analyze mistakes in reasoning chains. However, few studies have examined the meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors), which are crucial for their reliability. While studies on LLM self-evaluation present some measures, such as perplexity, which can reflect the answer correctness and be viewed as the lens of meta-cognition, they lack step-level analysis and adaptation. This paper studies the evaluation of LLM meta-cognition using the current lenses and how to improve these lenses. Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation framework for benchmarking the existing lenses. Furthermore, a training-free Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost current meta-cognition lenses. Experimental results on three mathematical reasoning datasets and three LLMs show the reasonableness of AutoMeco by comparing it with Best-of-N verification. Moreover, the meta-cognition ability of LLMs can be better evaluated using MIRA.
comment: Accepted to EMNLP 2025
♻ ☆ Hysteresis Activation Function for Efficient Inference NeurIPS
The widely used ReLU is favored for its hardware efficiency, {as the implementation at inference is a one bit sign case,} yet suffers from issues such as the ``dying ReLU'' problem, where during training, neurons fail to activate and constantly remain at zero, as highlighted by Lu et al. Traditional approaches to mitigate this issue often introduce more complex and less hardware-friendly activation functions. In this work, we propose a Hysteresis Rectified Linear Unit (HeLU), an efficient activation function designed to address the ``dying ReLU'' problem with minimal complexity. Unlike traditional activation functions with fixed thresholds for training and inference, HeLU employs a variable threshold that refines the backpropagation. This refined mechanism allows simpler activation functions to achieve competitive performance comparable to their more complex counterparts without introducing unnecessary complexity or requiring inductive biases. Empirical evaluations demonstrate that HeLU enhances model generalization across diverse datasets, offering a promising solution for efficient and effective inference suitable for a wide range of neural network architectures.
comment: Accepted to 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024)
♻ ☆ Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data NeurIPS'25
Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this work, we investigate the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-design for LLMs.
comment: Accepted by NeurIPS'25 main track. 47 pages, 21 figures, 32 tables
♻ ☆ SEA-LION: Southeast Asian Languages in One Network AACL 2025
Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.
comment: Accepted at IJCNLP-AACL 2025 (Main Track). We released our model at https://huggingface.co/collections/aisingapore/sea-lionv3-672589a39cdadd6a5b199581
♻ ☆ Model-Document Protocol for AI Search
AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.
comment: 10 pages
♻ ☆ How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices
Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm, offering a parallelable decoding process that could yield greater efficiency. Yet, in practice, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility. This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods. Through empirical benchmarking and a roofline-based theoretical analysis, we demonstrate that AR models generally achieve higher throughput, while DLMs consistently lag. We also investigate acceleration strategies, finding that techniques like dual cache and parallel decoding mainly offer gains at small batch sizes, with their benefits diminishing upon scaling. Our findings underscore the necessity of robust evaluation methods and improved acceleration strategies to advance research on DLMs.
comment: Withdrawn by the authors to better delineate the related work from the paper's original contributions
♻ ☆ SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat NeurIPS 2025
We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model's lack of diversity in generation and biases in evaluation, multiple LLMs form a "sparta tribe" to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.
comment: NeurIPS 2025
♻ ☆ IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation
Large Language Models (LLMs) have shown strong potential for recommendation by framing item prediction as a token-by-token language generation task. However, existing methods treat all item tokens equally, simply pursuing likelihood maximization during both optimization and decoding. This overlooks crucial token-level differences in decisiveness-many tokens contribute little to item discrimination yet can dominate optimization or decoding. To quantify token decisiveness, we propose a novel perspective that models item generation as a decision process, measuring token decisiveness by the Information Gain (IG) each token provides in reducing uncertainty about the generated item. Our empirical analysis reveals that most tokens have low IG but often correspond to high logits, disproportionately influencing training loss and decoding, which may impair model performance. Building on these insights, we introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding. Specifically, IGD downweights low-IG tokens during tuning and rebalances decoding to emphasize tokens with high IG. In this way, IGD moves beyond pure likelihood maximization, effectively prioritizing high-decisiveness tokens. Extensive experiments on four benchmark datasets with two LLM backbones demonstrate that IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.
♻ ☆ ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge to improve factuality. However, existing RAG systems frequently underutilize the retrieved documents, failing to extract and integrate the key clues needed to support faithful and interpretable reasoning, especially in cases where relevant evidence is implicit, scattered, or obscured by noise. To address this issue, we propose ClueAnchor, a novel framework for enhancing RAG via clue-anchored reasoning exploration and optimization. ClueAnchor extracts key clues from retrieved content and generates multiple reasoning paths based on different knowledge configurations, optimizing the model by selecting the most appropriate reasoning path for the given context through reward-based preference optimization. Experiments show that ClueAnchor significantly outperforms prior RAG baselines in the completeness and robustness of reasoning. Further analysis confirms its strong resilience to noisy or partially relevant retrieved content, as well as its capability to identify supporting evidence even in the absence of explicit clue supervision during inference. All codes are available at https://github.com/thunlp/ClueAnchor.
♻ ☆ Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English ALT
Sarcasm is a challenge to sentiment analysis because of the incongruity between stated and implied sentiment. The challenge is exacerbated when the implication may be relevant to a specific country or geographical region. Pragmatic metacognitive prompting (PMP) is a cognition-inspired technique that has been used for pragmatic reasoning. In this paper, we harness PMP for explainable sarcasm detection for Australian and Indian English, alongside a benchmark dataset for standard English. We manually add sarcasm explanations to an existing sarcasm-labeled dataset for Australian and Indian English called BESSTIE, and compare the performance for explainable sarcasm detection for them with FLUTE, a standard English dataset containing sarcasm explanations. Our approach utilising PMP when evaluated on two open-weight LLMs (GEMMA and LLAMA) achieves statistically significant performance improvement across all tasks and datasets when compared with four alternative prompting strategies. We also find that alternative techniques such as agentic prompting mitigate context-related failures by enabling external knowledge retrieval. The focused contribution of our work is utilising PMP in generating sarcasm explanations for varieties of English.
comment: ALTA 2025 (Best Paper Honorable Mention). Camera-ready
♻ ☆ FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs EMNLP
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
comment: Accepted in the Findings of EMNLP, 2025
♻ ☆ Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.
♻ ☆ The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs
With the rapid advancement of artificial intelligence, Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), including content generation, human-computer interaction, machine translation, and code generation. However, their widespread deployment has also raised significant safety concerns. In particular, LLM-generated content can exhibit unsafe behaviors such as toxicity, bias, or misinformation, especially in adversarial contexts, which has attracted increasing attention from both academia and industry. Although numerous studies have attempted to evaluate these risks, a comprehensive and systematic survey on safety evaluation of LLMs is still lacking. This work aims to fill this gap by presenting a structured overview of recent advances in safety evaluation of LLMs. Specifically, we propose a four-dimensional taxonomy: (i) Why to evaluate, which explores the background of safety evaluation of LLMs, how they differ from general LLMs evaluation, and the significance of such evaluation; (ii) What to evaluate, which examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and related aspects; (iii) Where to evaluate, which summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (iv) How to evaluate, which reviews existing mainstream evaluation methods based on the roles of the evaluators and some evaluation frameworks that integrate the entire evaluation pipeline. Finally, we identify the challenges in safety evaluation of LLMs and propose promising research directions to promote further advancement in this field. We emphasize the necessity of prioritizing safety evaluation to ensure the reliable and responsible deployment of LLMs in real-world applications.
comment: 20 pages, preprint
♻ ☆ Similarity-Distance-Magnitude Activations
We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to co-variate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.
comment: 18 pages, 5 tables, 1 algorithm. arXiv admin note: substantial text overlap with arXiv:2502.20167
♻ ☆ TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents
The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB
comment: Source code: https://github.com/yzjiao/Text2DB
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce \paper, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking AACL
Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.
comment: Accepted to AACL-IJCNLP 2025
♻ ☆ Are LLMs Rigorous Logical Reasoners? Empowering Natural Language Proof Generation by Stepwise Decoding with Contrastive Learning AACL 2025
Logical reasoning is a pivotal component in the field of artificial intelligence. Proof planning, particularly in contexts requiring the validation of explanation accuracy, continues to present challenges. The recent advancement of large language models (LLMs) has led to significant progress in natural language proof planning, evolving from one-stage generators to more complex three-stage systems that include additional searchers or verifiers. While these assisted methods improve the quality of generated results, they also introduce increased search efforts and computational costs. Furthermore, the generative process itself remains underexplored. In this study, we propose a stepwise decoding approach augmented by contrastive learning to address two common errors encountered during the LLM generator's decoding process. We fine-tune the language model using both vanilla and enhanced hard negatives to mitigate these decoding errors. Empirical results demonstrate the effectiveness of our strategy. Additionally, our further analysis reveals that even larger LLMs still struggle to generate rigorous logical chains.
comment: 15 pages, 2 figures, 11 tables. Accepted by AACL 2025 main conference
♻ ☆ Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.
♻ ☆ Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data
Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks -- which we refer to as the InfluenceNetwork -- to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: https://github.com/agarwalishika/NN-CIFT.
♻ ☆ Large Language Models Report Subjective Experience Under Self-Referential Processing
Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.
♻ ☆ Let LRMs Break Free from Overthinking via Self-Braking Tuning NeurIPS 2025
Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.
comment: Accepted to NeurIPS 2025; Camera ready version, 10 pages. Github:https://github.com/ZJU-REAL/Self-Braking-Tuning Project Page: https://ZJU-REAL.github.io/SBT
♻ ☆ Model Provenance Testing for Large Language Models
Large language models are increasingly customized through fine-tuning and other adaptations, creating challenges in enforcing licensing terms and managing downstream impacts. Tracking model origins is crucial both for protecting intellectual property and for identifying derived models when biases or vulnerabilities are discovered in foundation models. We address this challenge by developing a framework for testing model provenance: Whether one model is derived from another. Our approach is based on the key observation that real-world model derivations preserve significant similarities in model outputs that can be detected through statistical analysis. Using only black-box access to models, we employ multiple hypothesis testing to compare model similarities against a baseline established by unrelated models. On two comprehensive real-world benchmarks spanning models from 30M to 4B parameters and comprising over 600 models, our tester achieves 90-95% precision and 80-90% recall in identifying derived models. These results demonstrate the viability of systematic provenance verification in production environments even when only API access is available.
♻ ☆ When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.
♻ ☆ ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models NeurIPS 2025
Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.
comment: NeurIPS 2025 Datasets & Benchmarks
♻ ☆ Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech ICASSP 2026
Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.
comment: Submitted to IEEE ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
♻ ☆ Improving LLM Safety Alignment with Dual-Objective Optimization ICML 2025
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment
comment: ICML 2025
♻ ☆ Language Model Preference Evaluation with Multiple Weak Evaluators
Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. While existing works usually leverage a strong LLM as the judge for comparing LLMs' response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate PGED 's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.
♻ ☆ RADAR: Benchmarking Language Models on Imperfect Tabular Data NeurIPS 2025
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
comment: NeurIPS 2025 Dataset and Benchmark Track
♻ ☆ Representative Social Choice: From Learning Theory to AI Alignment NeurIPS 2024
Social choice theory is the study of preference aggregation across a population, used both in mechanism design for human agents and in the democratic alignment of language models. In this study, we propose the representative social choice framework for the modeling of democratic representation in collective decisions, where the number of issues and individuals are too large for mechanisms to consider all preferences directly. These scenarios are widespread in real-world decision-making processes, such as jury trials, legislation, corporate governance, and, more recently, language model alignment. In representative social choice, the population is represented by a finite sample of individual-issue pairs based on which social choice decisions are made. We show that many of the deepest questions in representative social choice can be formulated as statistical learning problems, and prove the generalization properties of social choice mechanisms using the theory of machine learning. We further formulate axioms for representative social choice, and prove Arrow-like impossibility theorems with new combinatorial tools of analysis. Our framework introduces the representative approach to social choice, opening up research directions at the intersection of social choice, learning theory, and AI alignment.
comment: Journal of Artificial Intelligence Research, in press. Best Paper at NeurIPS 2024 Pluralistic Alignment Workshop
♻ ☆ SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
♻ ☆ Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.
♻ ☆ VeriFastScore: Speeding up long-form factuality evaluation
Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.
♻ ☆ Accelerating Diffusion LLMs via Adaptive Parallel Decoding
The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
comment: 10 pages, 5 figures
♻ ☆ Scalable Best-of-N Selection for Large Language Models via Self-Certainty NeurIPS 2025
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty
comment: NeurIPS 2025
♻ ☆ Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions
Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet -- corrects these false presuppositions more than $43\%$ of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.
♻ ☆ (How) Do Language Models Track State?
Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that seem to require tracking the unobserved state of an evolving world. How do they do this? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, and then refines this with an associative scan. LMs that learn the former algorithm tend to generalize better and converge faster, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pre-trained or fine-tuned, can learn to implement efficient and interpretable state-tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.
comment: 21 pages, 17 figures, 1 table. Code: http://github.com/belindal/state-tracking
♻ ☆ DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22--25\% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM's final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.
Information Retrieval 28
☆ ProfOlaf: Semi-Automated Tool for Systematic Literature Reviews
Systematic reviews and mapping studies are critical for synthesizing research, identifying gaps, and guiding future work, but they are often labor-intensive and time-consuming. Existing tools provide partial support for specific steps, leaving much of the process manual and error-prone. We present ProfOlaf, a semi-automated tool designed to streamline systematic reviews while maintaining methodological rigor. ProfOlaf supports iterative snowballing for article collection with human-in-the-loop filtering and uses large language models to assist in analyzing articles, extracting key topics, and answering queries about the content of papers. By combining automation with guided manual effort, ProfOlaf enhances the efficiency, quality, and reproducibility of systematic reviews across research fields. A video describing and demonstrating ProfOlaf is available at: https://youtu.be/4noUXfcmxsE
comment: 4 pages, 1 Figure, 2 tables
☆ AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.
comment: Accepted at 32nd International Conference on MultiMedia Modeling
☆ WeaveRec: An LLM-Based Cross-Domain Sequential Recommendation Framework with Model Merging
Cross-Domain Sequential Recommendation (CDSR) seeks to improve user preference modeling by transferring knowledge from multiple domains. Despite the progress made in CDSR, most existing methods rely on overlapping users or items to establish cross-domain correlations-a requirement that rarely holds in real-world settings. The advent of large language models (LLM) and model-merging techniques appears to overcome this limitation by unifying multi-domain data without explicit overlaps. Yet, our empirical study shows that naively training an LLM on combined domains-or simply merging several domain-specific LLMs-often degrades performance relative to a model trained solely on the target domain. To address these challenges, we first experimentally investigate the cause of suboptimal performance in LLM-based cross-domain recommendation and model merging. Building on these insights, we introduce WeaveRec, which cross-trains multiple LoRA modules with source and target domain data in a weaving fashion, and fuses them via model merging. WeaveRec can be extended to multi-source domain scenarios and notably does not introduce additional inference-time cost in terms of latency or memory. Furthermore, we provide a theoretical guarantee that WeaveRec can reduce the upper bound of the expected error in the target domain. Extensive experiments on single-source, multi-source, and cross-platform cross-domain recommendation scenarios validate that WeaveRec effectively mitigates performance degradation and consistently outperforms baseline approaches in real-world recommendation tasks.
☆ Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs ICDM 2025
Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.
comment: ICDM 2025 Workshop
☆ LINK-KG: LLM-Driven Coreference-Resolved Knowledge Graphs for Human Smuggling Networks
Human smuggling networks are complex and constantly evolving, making them difficult to analyze comprehensively. Legal case documents offer rich factual and procedural insights into these networks but are often long, unstructured, and filled with ambiguous or shifting references, posing significant challenges for automated knowledge graph (KG) construction. Existing methods either overlook coreference resolution or fail to scale beyond short text spans, leading to fragmented graphs and inconsistent entity linking. We propose LINK-KG, a modular framework that integrates a three-stage, LLM-guided coreference resolution pipeline with downstream KG extraction. At the core of our approach is a type-specific Prompt Cache, which consistently tracks and resolves references across document chunks, enabling clean and disambiguated narratives for structured knowledge graph construction from both short and long legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes by 32.22% compared to baseline methods, resulting in cleaner and more coherent graph structures. These improvements establish LINK-KG as a strong foundation for analyzing complex criminal networks.
comment: Accepted in ICKG 2025 Conference, 8 Pages, 2 Figures
☆ Vectorized Context-Aware Embeddings for GAT-Based Collaborative Filtering
Recommender systems often struggle with data sparsity and cold-start scenarios, limiting their ability to provide accurate suggestions for new or infrequent users. This paper presents a Graph Attention Network (GAT) based Collaborative Filtering (CF) framework enhanced with Large Language Model (LLM) driven context aware embeddings. Specifically, we generate concise textual user profiles and unify item metadata (titles, genres, overviews) into rich textual embeddings, injecting these as initial node features in a bipartite user item graph. To further optimize ranking performance, we introduce a hybrid loss function that combines Bayesian Personalized Ranking (BPR) with a cosine similarity term and robust negative sampling, ensuring explicit negative feedback is distinguished from unobserved data. Experiments on the MovieLens 100k and 1M datasets show consistent improvements over state-of-the-art baselines in Precision, NDCG, and MAP while demonstrating robustness for users with limited interaction history. Ablation studies confirm the critical role of LLM-augmented embeddings and the cosine similarity term in capturing nuanced semantic relationships. Our approach effectively mitigates sparsity and cold-start limitations by integrating LLM-derived contextual understanding into graph-based architectures. Future directions include balancing recommendation accuracy with coverage and diversity, and introducing fairness-aware constraints and interpretability features to enhance system performance further.
☆ Barlow Twins for Sequential Recommendation
Sequential recommendation models must navigate sparse interaction data popularity bias and conflicting objectives like accuracy versus diversity While recent contrastive selfsupervised learning SSL methods offer improved accuracy they come with tradeoffs large batch requirements reliance on handcrafted augmentations and negative sampling that can reinforce popularity bias In this paper we introduce BT-SR a novel noncontrastive SSL framework that integrates the Barlow Twins redundancyreduction principle into a Transformerbased nextitem recommender BTSR learns embeddings that align users with similar shortterm behaviors while preserving longterm distinctionswithout requiring negative sampling or artificial perturbations This structuresensitive alignment allows BT-SR to more effectively recognize emerging user intent and mitigate the influence of noisy historical context Our experiments on five public benchmarks demonstrate that BTSR consistently improves nextitem prediction accuracy and significantly enhances longtail item coverage and recommendation calibration Crucially we show that a single hyperparameter can control the accuracydiversity tradeoff enabling practitioners to adapt recommendations to specific application needs
☆ GraphCompliance: Aligning Policy and Context Graphs for LLM-Based Regulatory Compliance
Compliance at web scale poses practical challenges: each request may require a regulatory assessment. Regulatory texts (e.g., the General Data Protection Regulation, GDPR) are cross-referential and normative, while runtime contexts are expressed in unstructured natural language. This setting motivates us to align semantic information in unstructured text with the structured, normative elements of regulations. To this end, we introduce GraphCompliance, a framework that represents regulatory texts as a Policy Graph and runtime contexts as a Context Graph, and aligns them. In this formulation, the policy graph encodes normative structure and cross-references, whereas the context graph formalizes events as subject-action-object (SAO) and entity-relation triples. This alignment anchors the reasoning of a judge large language model (LLM) in structured information and helps reduce the burden of regulatory interpretation and event parsing, enabling a focus on the core reasoning step. In experiments on 300 GDPR-derived real-world scenarios spanning five evaluation tasks, GraphCompliance yields 4.1-7.2 percentage points (pp) higher micro-F1 than LLM-only and RAG baselines, with fewer under- and over-predictions, resulting in higher recall and lower false positive rates. Ablation studies indicate contributions from each graph component, suggesting that structured representations and a judge LLM are complementary for normative reasoning.
comment: Under review at The Web Conference 2026 (Semantics & Knowledge track). Code will be released upon acceptance. This arXiv v1 contains no repository links to preserve double-blind review
☆ DiSE: A diffusion probabilistic model for automatic structure elucidation of organic compounds
Automatic structure elucidation is essential for self-driving laboratories as it enables the system to achieve truly autonomous. This capability closes the experimental feedback loop, ensuring that machine learning models receive reliable structure information for real-time decision-making and optimization. Herein, we present DiSE, an end-to-end diffusion-based generative model that integrates multiple spectroscopic modalities, including MS, 13C and 1H chemical shifts, HSQC, and COSY, to achieve automated yet accurate structure elucidation of organic compounds. By learning inherent correlations among spectra through data-driven approaches, DiSE achieves superior accuracy, strong generalization across chemically diverse datasets, and robustness to experimental data despite being trained on calculated spectra. DiSE thus represents a significant advance toward fully automated structure elucidation, with broad potential in natural product research, drug discovery, and self-driving laboratories.
☆ ReaKase-8B: Legal Case Retrieval via Knowledge and Reasoning Representations with LLMs
Legal case retrieval (LCR) is a cornerstone of real-world legal decision making, as it enables practitioners to identify precedents for a given query case. Existing approaches mainly rely on traditional lexical models and pretrained language models to encode the texts of legal cases. Yet there are rich information in the relations among different legal entities as well as the crucial reasoning process that uncovers how legal facts and legal issues can lead to judicial decisions. Such relational reasoning process reflects the distinctive characteristics of each case that can distinguish one from another, mirroring the real-world judicial process. Naturally, incorporating such information into the precise case embedding could further enhance the accuracy of case retrieval. In this paper, a novel ReaKase-8B framework is proposed to leverage extracted legal facts, legal issues, legal relation triplets and legal reasoning for effective legal case retrieval. ReaKase-8B designs an in-context legal case representation learning paradigm with a fine-tuned large language model. Extensive experiments on two benchmark datasets from COLIEE 2022 and COLIEE 2023 demonstrate that our knowledge and reasoning augmented embeddings substantially improve retrieval performance over baseline models, highlighting the potential of integrating legal reasoning into legal case retrieval systems. The code has been released on https://github.com/yanran-tang/ReaKase-8B.
☆ OneTrans: Unified Feature Interaction and Sequence Modeling with One Transformer in Industrial Recommender
In recommendation systems, scaling up feature-interaction modules (e.g., Wukong, RankMixer) or user-behavior sequence modules (e.g., LONGER) has achieved notable success. However, these efforts typically proceed on separate tracks, which not only hinders bidirectional information exchange but also prevents unified optimization and scaling. In this paper, we propose OneTrans, a unified Transformer backbone that simultaneously performs user-behavior sequence modeling and feature interaction. OneTrans employs a unified tokenizer to convert both sequential and non-sequential attributes into a single token sequence. The stacked OneTrans blocks share parameters across similar sequential tokens while assigning token-specific parameters to non-sequential tokens. Through causal attention and cross-request KV caching, OneTrans enables precomputation and caching of intermediate representations, significantly reducing computational costs during both training and inference. Experimental results on industrial-scale datasets demonstrate that OneTrans scales efficiently with increasing parameters, consistently outperforms strong baselines, and yields a 5.68% lift in per-user GMV in online A/B tests.
☆ ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden Tests NeurIPS 2025
Recommender systems are among the most impactful AI applications, interacting with billions of users every day, guiding them to relevant products, services, or information tailored to their preferences. However, the research and development of recommender systems are hindered by existing datasets that fail to capture realistic user behaviors and inconsistent evaluation settings that lead to ambiguous conclusions. This paper introduces the Open Recommendation Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified benchmark for consistent and realistic evaluation of recommendation models. ORBIT offers a standardized evaluation framework of public datasets with reproducible splits and transparent settings for its public leaderboard. Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco, featuring web browsing sequences from 87 million public, high-quality webpages. ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and privacy-guaranteed browsing data. It aligns with modern recommendation scenarios and is reserved as the hidden test part of our leaderboard to challenge recommendation models' generalization ability. ORBIT measures 12 representative recommendation models on its public benchmark and introduces a prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results reflect general improvements of recommender systems on the public datasets, with variable individual performances. The results on the hidden test reveal the limitations of existing approaches in large-scale webpage recommendation and highlight the potential for improvements with LLM integrations. ORBIT benchmark, leaderboard, and codebase are available at https://www.open-reco-bench.ai.
comment: Accepted to NeurIPS 2025 Datasets & Benchmarks track
☆ Evaluating Perspectival Biases in Cross-Modal Retrieval
Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.
☆ AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency
Text to video generation has emerged as a critical frontier in generative artificial intelligence, yet existing approaches struggle with maintaining temporal consistency, compositional understanding, and fine grained control over visual narratives. We present MOVAI (Multimodal Original Video AI), a novel hierarchical framework that integrates compositional scene understanding with temporal aware diffusion models for high fidelity text to video synthesis. Our approach introduces three key innovations: (1) a Compositional Scene Parser (CSP) that decomposes textual descriptions into hierarchical scene graphs with temporal annotations, (2) a Temporal-Spatial Attention Mechanism (TSAM) that ensures coherent motion dynamics across frames while preserving spatial details, and (3) a Progressive Video Refinement (PVR) module that iteratively enhances video quality through multi-scale temporal reasoning. Extensive experiments on standard benchmarks demonstrate that MOVAI achieves state-of-the-art performance, improving video quality metrics by 15.3% in LPIPS, 12.7% in FVD, and 18.9% in user preference studies compared to existing methods. Our framework shows particular strength in generating complex multi-object scenes with realistic temporal dynamics and fine-grained semantic control.
♻ ☆ Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.
comment: 5 pages, 5 figures, 2 tables. Accepted at EUSIPCO 2025
♻ ☆ Unveiling Unicode's Unseen Underpinnings in Undermining Authorship Attribution
When using a public communication channel -- whether formal or informal, such as commenting or posting on social media -- end users have no expectation of privacy: they compose a message and broadcast it for the world to see. Even if an end user takes utmost precautions to anonymize their online presence -- using an alias or pseudonym; masking their IP address; spoofing their geolocation; concealing their operating system and user agent; deploying encryption; registering with a disposable phone number or email; disabling non-essential settings; revoking permissions; and blocking cookies and fingerprinting -- one obvious element still lingers: the message itself. Assuming they avoid lapses in judgment or accidental self-exposure, there should be little evidence to validate their actual identity, right? Wrong. The content of their message -- necessarily open for public consumption -- exposes an attack vector: stylometric analysis, or author profiling. In this paper, we dissect the technique of stylometry, discuss an antithetical counter-strategy in adversarial stylometry, and devise enhancements through Unicode steganography.
comment: 33 pages, 7 figures, 3 tables
♻ ☆ Unstructured Evidence Attribution for Long Context Query Focused Summarization EMNLP 2025
Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query, and extracting and citing evidence spans helps improve the trustworthiness of these summaries. Whereas previous work has focused on evidence citation with fixed levels of granularity (e.g. sentence, paragraph, document, etc.), we propose to extract unstructured (i.e., spans of any length) evidence in order to acquire more relevant and consistent evidence than in the fixed granularity case. We show how existing systems struggle to copy and properly cite unstructured evidence, which also tends to be "lost-in-the-middle". To help models perform this task, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel pipeline, which can be used as training supervision for unstructured evidence summarization. We demonstrate across 5 LLMs and 4 datasets spanning human written, synthetic, single, and multi-document settings that LLMs adapted with SUnsET generate more relevant and factually consistent evidence with their summaries, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries than baselines with no fine-tuning and fixed granularity evidence. We release SUnsET and our generation code to the public.
comment: EMNLP 2025 Main; 29 pages; 24 figures; 8 tables
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation
comment: 16 pages; 8 figures, 4 tables; v2 changelog: Fixed the modeling for table 3, random effect is the model version; v3 changelog: Fixed minor formatting issues in tables 2 and 3; v4 changelog: Fixed some typos and model description
♻ ☆ RecCocktail: A Generalizable and Efficient Framework for LLM-Based Recommendation
Large Language Models (LLMs) have achieved remarkable success in recent years, owing to their impressive generalization capabilities and rich world knowledge. To capitalize on the potential of using LLMs as recommender systems, mainstream approaches typically focus on two paradigms. The first paradigm designs multi-domain or multi-task instruction data for generalizable recommendation, so as to align LLMs with general recommendation areas and deal with cold-start recommendation. The second paradigm focuses on enhancing domain-specific recommendation tasks, improving performance in warm recommendation scenarios. While most previous works treat these two paradigms separately, we argue that they have complementary advantages, and combining them can yield better results. In this paper, we propose a generalizable and efficient LLM-based recommendation framework RecCocktail. Our approach begins with fine-tuning a "base spirit" LoRA module using domain-general recommendation instruction data to align LLM with recommendation knowledge. Next, given users' behavior of a specific domain, we construct a domain-specific "ingredient" LoRA module. We then provide an entropy-guided adaptive merging method to mix the "base spirit" and the "ingredient" in the weight space. Please note that, RecCocktail combines the advantages of the existing two paradigms without introducing additional time or space overhead during the inference phase. Moreover, RecCocktail is efficient with plug and play, as the "base spirit" LoRA is trained only once, and any domain-specific "ingredient" can be efficiently mixed with only domain-specific fine-tuning. Extensive experiments on multiple datasets under both warm and cold-start recommendation scenarios validate the effectiveness and generality of the proposed RecCocktail.
♻ ☆ On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.
♻ ☆ Model-Document Protocol for AI Search
AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.
comment: 10 pages
♻ ☆ The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems
With the growing adoption of retrieval-augmented generation (RAG) systems, various attack methods have been proposed to degrade their performance. However, most existing approaches rely on unrealistic assumptions in which external attackers have access to internal components such as the retriever. To address this issue, we introduce a realistic black-box attack based on the RAG paradox, a structural vulnerability arising from the system's effort to enhance trust by revealing both the retrieved documents and their sources to users. This transparency enables attackers to observe which sources are used and how information is phrased, allowing them to craft poisoned documents that are more likely to be retrieved and upload them to the identified sources. Moreover, as RAG systems directly provide retrieved content to users, these documents must not only be retrievable but also appear natural and credible to maintain user confidence in the search results. Unlike prior work that focuses solely on improving document retrievability, our attack method explicitly considers both retrievability and user trust in the retrieved content. Both offline and online experiments demonstrate that our method significantly degrades system performance without internal access, while generating natural-looking poisoned documents.
♻ ☆ Shilling Recommender Systems by Generating Side-feature-aware Fake User Profiles
Recommender systems (RS) greatly influence users' consumption decisions, making them attractive targets for malicious shilling attacks that inject fake user profiles to manipulate recommendations. Existing shilling methods can generate effective and stealthy fake profiles when training data only contain rating matrix, but they lack comprehensive solutions for scenarios where side features are present and utilized by the recommender. To address this gap, we extend the Leg-UP framework by enhancing the generator architecture to incorporate side features, enabling the generation of side-feature-aware fake user profiles. Experiments on benchmarks show that our method achieves strong attack performance while maintaining stealthiness.
♻ ☆ MMQ-v2: Align, Denoise, and Amplify: Adaptive Behavior Mining for Semantic IDs Learning in Recommendation
Industrial recommender systems rely on unique Item Identifiers (ItemIDs). However, this method struggles with scalability and generalization in large, dynamic datasets that have sparse long-tail data. Content-based Semantic IDs (SIDs) address this by sharing knowledge through content quantization. However, by ignoring dynamic behavioral properties, purely content-based SIDs have limited expressive power. Existing methods attempt to incorporate behavioral information but overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed and diverse, creating a vast information gap in quality and quantity between popular and long-tail items. This oversight leads to two critical limitations: (1) Noise Corruption: Indiscriminate behavior-content alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2)Signal Obscurity: The equal-weighting scheme for SIDs fails to reflect the varying importance of different behavioral signals, making it difficult for downstream tasks to distinguish important SIDs from uninformative ones. To tackle these issues, we propose a mixture-of-quantization framework, MMQ-v2, to adaptively Align, Denoise, and Amplify multimodal information from content and behavior modalities for semantic IDs learning. The semantic IDs generated by this framework named ADA-SID. It introduces two innovations: an adaptive behavior-content alignment that is aware of information richness to shield representations from noise, and a dynamic behavioral router to amplify critical signals by applying different weights to SIDs. Extensive experiments on public and large-scale industrial datasets demonstrate ADA-SID's significant superiority in both generative and discriminative recommendation tasks.
♻ ☆ Towards Automated Quality Assurance of Patent Specifications: A Multi-Dimensional LLM Framework
Although AI drafting tools have gained prominence in patent writing, the systematic evaluation of AI-generated patent content quality represents a significant research gap. To address this gap, We propose to evaluate patents using regulatory compliance, technical coherence, and figure-reference consistency detection modules, and then generate improvement suggestions via an integration module. The framework is validated on a comprehensive dataset comprising 80 human-authored and 80 AI-generated patents from two patent drafting tools. Evaluation is performed on 10,841 total sentences, 8,924 non-template sentences, and 554 patent figures for the three detection modules respectively, achieving balanced accuracies of 99.74%, 82.12%, and 91.2% against expert annotations. Additional analysis was conducted to examine defect distributions across patent sections, technical domains, and authoring sources. Section-based analysis indicates that figure-text consistency and technical detail precision require particular attention. Mechanical Engineering and Construction show more claim-specification inconsistencies due to complex technical documentation requirements. AI-generated patents show a significant gap compared to human-authored ones. While human-authored patents primarily contain surface-level errors like typos, AI-generated patents exhibit more structural defects in figure-text alignment and cross-references.
♻ ☆ Decoupled Multimodal Fusion for User Interest Modeling in Click-Through Rate Prediction
Modern industrial recommendation systems improve recommendation performance by integrating multimodal representations from pre-trained models into ID-based Click-Through Rate (CTR) prediction frameworks. However, existing approaches typically adopt modality-centric modeling strategies that process ID-based and multimodal embeddings independently, failing to capture fine-grained interactions between content semantics and behavioral signals. In this paper, we propose Decoupled Multimodal Fusion (DMF), which introduces a modality-enriched modeling strategy to enable fine-grained interactions between ID-based collaborative representations and multimodal representations for user interest modeling. Specifically, we construct target-aware features to bridge the semantic gap across different embedding spaces and leverage them as side information to enhance the effectiveness of user interest modeling. Furthermore, we design an inference-optimized attention mechanism that decouples the computation of target-aware features and ID-based embeddings before the attention layer, thereby alleviating the computational bottleneck introduced by incorporating target-aware features. To achieve comprehensive multimodal integration, DMF combines user interest representations learned under the modality-centric and modality-enriched modeling strategies. Offline experiments on public and industrial datasets demonstrate the effectiveness of DMF. Moreover, DMF has been deployed on the product recommendation system of the international e-commerce platform Lazada, achieving relative improvements of 5.30% in CTCVR and 7.43% in GMV with negligible computational overhead.
♻ ☆ A Task-Centric Perspective on Recommendation Systems
Many studies in recommender systems (RecSys) adopt a general problem definition, i.e., to recommend preferred items to users based on past interactions. Such abstraction often lacks the domain-specific nuances necessary for practical deployment. However, models are frequently evaluated using datasets collected from online recommender platforms, which inherently reflect domain or task specificities. In this paper, we analyze RecSys task formulations, emphasizing key components such as input-output structures, temporal dynamics, and candidate item selection. All these factors directly impact offline evaluation. We further examine the complexities of user-item interactions, including decision-making costs, multi-step engagements, and unobservable interactions, which may influence model design. Additionally, we explore the balance between task specificity and model generalizability, highlighting how well-defined task formulations serve as the foundation for robust evaluation and effective solution development. By clarifying task definitions and their implications, this work provides a structured perspective on RecSys research. The goal is to help researchers better navigate the field, particularly in understanding specificities of the RecSys tasks and ensuring fair and meaningful evaluations.
♻ ☆ OpenZL: A Graph-Based Model for Compression
Research techniques in the last decade have improved lossless compression ratios by significantly increasing processing time. These techniques have remained obscure because production systems require high throughput and low resource utilization. In practice, application-specific compression algorithms that leverage knowledge of the data structure and semantics are more popular. Application-specific compressor systems outperform even the best generic compressors, but these techniques have some drawbacks. Application-specific compressors are inherently limited in applicability, have high development costs, and are difficult to maintain and deploy. In this work, we show that these challenges can be overcome with a new compression strategy. We propose the "graph model" of compression, a new theoretical framework for representing compression as a directed acyclic graph of modular codecs. OpenZL compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder. OpenZL's design enables rapid development of tailored compressors with minimal code; its universal decoder eliminates deployment lag; and its investment in a well-vetted standard component library minimizes security risks. Experimental results demonstrate that OpenZL achieves superior compression ratios and speeds compared to state-of-the-art general-purpose compressors on a variety of real-world datasets. Internal deployments at Meta have also shown consistent improvements in size and/or speed, with development timelines reduced from months to days. OpenZL thus represents a significant advance in practical, scalable, and maintainable data compression for modern data-intensive applications.
Multimedia 11
☆ MORE: Multi-Organ Medical Image REconstruction Dataset
CT reconstruction provides radiologists with images for diagnosis and treatment, yet current deep learning methods are typically limited to specific anatomies and datasets, hindering generalization ability to unseen anatomies and lesions. To address this, we introduce the Multi-Organ medical image REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies with 15 lesion types. This dataset serves two key purposes: (1) enabling robust training of deep learning models on extensive, heterogeneous data, and (2) facilitating rigorous evaluation of model generalization for CT reconstruction. We further establish a strong baseline solution that outperforms prior approaches under these challenging conditions. Our results demonstrate that: (1) a comprehensive dataset helps improve the generalization capability of models, and (2) optimization-based methods offer enhanced robustness for unseen anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our project page https://more-med.github.io/
comment: Accepted to ACMMM 2025
☆ Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis
Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.
☆ AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.
comment: Accepted at 32nd International Conference on MultiMedia Modeling
☆ Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise
Multimodal learning faces two major challenges: modality imbalance and data noise, which significantly affect the robustness and generalization ability of models. Existing methods achieve modality balance by suppressing dominant modalities, but they neglect the inherent differences in the information value between modalities, potentially leading to convergence to suboptimal solutions. This paper proposes an innovative modality compression paradigm, Contribution-Guided Asymmetric Learning (CAL), which aims to enhance the contribution of high-contribution modalities while compressing weak modalities to increase their contribution, allowing both to improve the performance of multimodal information fusion. CAL is based on a modality contribution metric W^m combining the information quantity I(m) and confidence D(m), and it designs an asymmetric gradient acceleration mechanism and a contribution-aware Asymmetric Information Bottleneck (AIB) compression mechanism. The former accelerates the gradient update of modalities, while the latter dynamically compresses the noise of low-contribution modalities. On five benchmark datasets, including emotion recognition, scene recognition, and event localization tasks, CAL has shown outstanding performance in imbalanced fusion tasks and noise robustness tests. On CREMA-D, KS, and AVE, CAL achieves 79.30%, 74.82%, and 74.21% accuracy, significantly outperforming the existing state-of-the-art model ARL. In high-noise robustness tests, CAL also achieved leading performance under various attack strategies on the MVSA-Single and NYUD2 datasets. These results validate the significant advantages of CAL in modality imbalance and noise interference. CAL, as a flexible and efficient framework, is easy to transfer to other tasks and has broad adaptability and potential application prospects.
☆ Multi-hop Parallel Image Semantic Communication for Distortion Accumulation Mitigation
Existing semantic communication schemes primarily focus on single-hop scenarios, overlooking the challenges of multi-hop wireless image transmission. As semantic communication is inherently lossy, distortion accumulates over multiple hops, leading to significant performance degradation. To address this, we propose the multi-hop parallel image semantic communication (MHPSC) framework, which introduces a parallel residual compensation link at each hop against distortion accumulation. To minimize the associated transmission bandwidth overhead, a coarse-to-fine residual compression scheme is designed. A deep learning-based residual compressor first condenses the residuals, followed by the adaptive arithmetic coding (AAC) for further compression. A residual distribution estimation module predicts the prior distribution for the AAC to achieve fine compression performances. This approach ensures robust multi-hop image transmission with only a minor increase in transmission bandwidth. Experimental results confirm that MHPSC outperforms both existing semantic communication and traditional separated coding schemes.
♻ ☆ Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.
comment: 5 pages, 5 figures, 2 tables. Accepted at EUSIPCO 2025
♻ ☆ ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation NeurIPS 2025
Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. However, these metrics often have different scales, assumptions, and dependencies, making joint estimation non-trivial. To address these issues, we introduce ARECHO (Autoregressive Evaluation via Chain-based Hypothesis Optimization), a chain-based, versatile evaluation system for speech assessment grounded in autoregressive dependency modeling. ARECHO is distinguished by three key innovations: (1) a comprehensive speech information tokenization pipeline; (2) a dynamic classifier chain that explicitly captures inter-metric dependencies; and (3) a two-step confidence-oriented decoding algorithm that enhances inference reliability. Experiments demonstrate that ARECHO significantly outperforms the baseline framework across diverse evaluation scenarios, including enhanced speech analysis, speech generation evaluation, and, noisy speech evaluation. Furthermore, its dynamic dependency modeling improves interpretability by capturing inter-metric relationships. Across tasks, ARECHO offers reference-free evaluation using its dynamic classifier chain to support subset queries (single or multiple metrics) and reduces error propagation via confidence-oriented decoding.
comment: NeurIPS 2025 Spotlight
♻ ☆ Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis
Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs to identify aspect terms and determine their sentiment polarity. However, existing approaches often fall short in simultaneously addressing three core challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE). To overcome these limitations, we propose DASCO (\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework), a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees. First, we designed a multi-task pretraining strategy for MABSA on our base model, combining aspect-oriented enhancement, image-text matching, and aspect-level sentiment-sensitive cognition. This improved the model's perception of aspect terms and sentiment cues while achieving effective image-text alignment, addressing key challenges like SCP and MIM. Furthermore, we incorporate dependency trees as syntactic branch combining with semantic branch, guiding the model to selectively attend to critical contextual elements within a target-specific scope while effectively filtering out irrelevant noise for addressing SNE problem. Extensive experiments on two benchmark datasets across three subtasks demonstrate that DASCO achieves state-of-the-art performance in MABSA, with notable gains in JMASA (+2.3\% F1 and +3.5\% precision on Twitter2015). The source code is available at https://github.com/LHaoooo/DASCO .
♻ ☆ ReCon-GS: Continuum-Preserved Gaussian Streaming for Fast and Compact Reconstruction of Dynamic Scenes NeurIPS 2025
Online free-viewpoint video (FVV) reconstruction is challenged by slow per-frame optimization, inconsistent motion estimation, and unsustainable storage demands. To address these challenges, we propose the Reconfigurable Continuum Gaussian Stream, dubbed ReCon-GS, a novel storage-aware framework that enables high fidelity online dynamic scene reconstruction and real-time rendering. Specifically, we dynamically allocate multi-level Anchor Gaussians in a density-adaptive fashion to capture inter-frame geometric deformations, thereby decomposing scene motion into compact coarse-to-fine representations. Then, we design a dynamic hierarchy reconfiguration strategy that preserves localized motion expressiveness through on-demand anchor re-hierarchization, while ensuring temporal consistency through intra-hierarchical deformation inheritance that confines transformation priors to their respective hierarchy levels. Furthermore, we introduce a storage-aware optimization mechanism that flexibly adjusts the density of Anchor Gaussians at different hierarchy levels, enabling a controllable trade-off between reconstruction fidelity and memory usage. Extensive experiments on three widely used datasets demonstrate that, compared to state-of-the-art methods, ReCon-GS improves training efficiency by approximately 15% and achieves superior FVV synthesis quality with enhanced robustness and stability. Moreover, at equivalent rendering quality, ReCon-GS slashes memory requirements by over 50% compared to leading state-of-the-art methods.
comment: Published in NeurIPS 2025
♻ ☆ TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection EMNLP 2025
Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.
comment: EMNLP 2025 Oral; Project Homepage: https://yanzehong.github.io/trust-vl/
♻ ☆ PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models
Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.
Image and Video Processing 16
☆ Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.
☆ MORE: Multi-Organ Medical Image REconstruction Dataset
CT reconstruction provides radiologists with images for diagnosis and treatment, yet current deep learning methods are typically limited to specific anatomies and datasets, hindering generalization ability to unseen anatomies and lesions. To address this, we introduce the Multi-Organ medical image REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies with 15 lesion types. This dataset serves two key purposes: (1) enabling robust training of deep learning models on extensive, heterogeneous data, and (2) facilitating rigorous evaluation of model generalization for CT reconstruction. We further establish a strong baseline solution that outperforms prior approaches under these challenging conditions. Our results demonstrate that: (1) a comprehensive dataset helps improve the generalization capability of models, and (2) optimization-based methods offer enhanced robustness for unseen anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our project page https://more-med.github.io/
comment: Accepted to ACMMM 2025
☆ ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection
Purpose: Medical foundation models (FMs) offer a path to build high-performance diagnostic systems. However, their application to prostate cancer (PCa) detection from micro-ultrasound ({\mu}US) remains untested in clinical settings. We present ProstNFound+, an adaptation of FMs for PCa detection from {\mu}US, along with its first prospective validation. Methods: ProstNFound+ incorporates a medical FM, adapter tuning, and a custom prompt encoder that embeds PCa-specific clinical biomarkers. The model generates a cancer heatmap and a risk score for clinically significant PCa. Following training on multi-center retrospective data, the model is prospectively evaluated on data acquired five years later from a new clinical site. Model predictions are benchmarked against standard clinical scoring protocols (PRI-MUS and PI-RADS). Results: ProstNFound+ shows strong generalization to the prospective data, with no performance degradation compared to retrospective evaluation. It aligns closely with clinical scores and produces interpretable heatmaps consistent with biopsy-confirmed lesions. Conclusion: The results highlight its potential for clinical deployment, offering a scalable and interpretable alternative to expert-driven protocols.
☆ BRIQA: Balanced Reweighting in Image Quality Assessment of Pediatric Brain MRI
Assessing the severity of artifacts in pediatric brain Magnetic Resonance Imaging (MRI) is critical for diagnostic accuracy, especially in low-field systems where the signal-to-noise ratio is reduced. Manual quality assessment is time-consuming and subjective, motivating the need for robust automated solutions. In this work, we propose BRIQA (Balanced Reweighting in Image Quality Assessment), which addresses class imbalance in artifact severity levels. BRIQA uses gradient-based loss reweighting to dynamically adjust per-class contributions and employs a rotating batching scheme to ensure consistent exposure to underrepresented classes. Through experiments, no single architecture performs best across all artifact types, emphasizing the importance of architectural diversity. The rotating batching configuration improves performance across metrics by promoting balanced learning when combined with cross-entropy loss. BRIQA improves average macro F1 score from 0.659 to 0.706, with notable gains in Noise (0.430), Zipper (0.098), Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012) artifact severity classification. The code is available at https://github.com/BioMedIA-MBZUAI/BRIQA.
☆ SAMRI: Segment Anything Model for MRI
Accurate magnetic resonance imaging (MRI) segmentation is crucial for clinical decision-making, but remains labor-intensive when performed manually. Convolutional neural network (CNN)-based methods can be accurate and efficient, but often generalize poorly to MRI's variable contrast, intensity inhomogeneity, and protocols. Although the transformer-based Segment Anything Model (SAM) has demonstrated remarkable generalizability in natural images, existing adaptations often treat MRI as another imaging modality, overlooking these modality-specific challenges. We present SAMRI, an MRI-specialized SAM trained and validated on 1.1 million labeled MR slices spanning whole-body organs and pathologies. We demonstrate that SAM can be effectively adapted to MRI by simply fine-tuning its mask decoder using a two-stage strategy, reducing training time by 94% and trainable parameters by 96% versus full-model retraining. Across diverse MRI segmentation tasks, SAMRI achieves a mean Dice of 0.87, delivering state-of-the-art accuracy across anatomical regions and robust generalization on unseen structures, particularly small and clinically important structures.
☆ CYPRESS: Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing
Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.
☆ Comparative Analysis of Deep Learning Models for Olive Tree Crown and Shadow Segmentation Towards Biovolume Estimation
Olive tree biovolume estimation is a key task in precision agriculture, supporting yield prediction and resource management, especially in Mediterranean regions severely impacted by climate-induced stress. This study presents a comparative analysis of three deep learning models U-Net, YOLOv11m-seg, and Mask RCNN for segmenting olive tree crowns and their shadows in ultra-high resolution UAV imagery. The UAV dataset, acquired over Vicopisano, Italy, includes manually annotated crown and shadow masks. Building on these annotations, the methodology emphasizes spatial feature extraction and robust segmentation; per-tree biovolume is then estimated by combining crown projected area with shadow-derived height using solar geometry. In testing, Mask R-CNN achieved the best overall accuracy (F1 = 0.86; mIoU = 0.72), while YOLOv11m-seg provided the fastest throughput (0.12 second per image). The estimated biovolumes spanned from approximately 4 to 24 cubic meters, reflecting clear structural differences among trees. These results indicate Mask R-CNN is preferable when biovolume accuracy is paramount, whereas YOLOv11m-seg suits large-area deployments where speed is critical; U-Net remains a lightweight, high-sensitivity option. The framework enables accurate, scalable orchard monitoring and can be further strengthened with DEM or DSM integration and field calibration for operational decision support.
comment: 6 pages, 2025 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor)
☆ SPG-CDENet: Spatial Prior-Guided Cross Dual Encoder Network for Multi-Organ Segmentation
Multi-organ segmentation is a critical task in computer-aided diagnosis. While recent deep learning methods have achieved remarkable success in image segmentation, huge variations in organ size and shape challenge their effectiveness in multi-organ segmentation. To address these challenges, we propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel two-stage segmentation paradigm designed to improve multi-organ segmentation accuracy. Our SPG-CDENet consists of two key components: a spatial prior network and a cross dual encoder network. The prior network generates coarse localization maps that delineate the approximate ROI, serving as spatial guidance for the dual encoder network. The cross dual encoder network comprises four essential components: a global encoder, a local encoder, a symmetric cross-attention module, and a flow-based decoder. The global encoder captures global semantic features from the entire image, while the local encoder focuses on features from the prior network. To enhance the interaction between the global and local encoders, a symmetric cross-attention module is proposed across all layers of the encoders to fuse and refine features. Furthermore, the flow-based decoder directly propagates high-level semantic features from the final encoder layer to all decoder layers, maximizing feature preservation and utilization. Extensive qualitative and quantitative experiments on two public datasets demonstrate the superior performance of SPG-CDENet compared to existing segmentation methods. Furthermore, ablation studies further validate the effectiveness of the proposed modules in improving segmentation accuracy.
☆ BitSemCom: A Bit-Level Semantic Communication Framework with Learnable Probabilistic Mapping
Most existing semantic communication systems employ analog modulation, which is incompatible with modern digital communication systems. Although several digital transmission approaches have been proposed to address this issue, an end-to-end bit-level method that is compatible with arbitrary modulation formats, robust to channel noise, and free from quantization errors remains lacking. To this end, we propose BitSemCom, a novel bit-level semantic communication framework that realizes true joint source-channel coding (JSCC) at the bit level. Specifically, we introduce a modular learnable bit mapper that establishes a probabilistic mapping between continuous semantic features and discrete bits, utilizing the Gumbel-Softmax trick to enable differentiable bit generation. Simulation results on image transmission demonstrate that BitSemCom achieves both competitive performance and superior robustness compared to traditional separate source-channel coding (SSCC) schemes, and outperforms deep learning based JSCC with uniform 1-bit quantization, validating the effectiveness of the learnable bit mapper. Despite these improvements, the bit mapper adds only 0.42% parameters and 0.09% computational complexity, making BitSemCom a lightweight and practical solution for real-world semantic communication.
☆ Functional Connectome Fingerprinting Using Convolutional and Dictionary Learning
Advances in data analysis and machine learning have revolutionized the study of brain signatures using fMRI, enabling non-invasive exploration of cognition and behavior through individual neural patterns. Functional connectivity (FC), which quantifies statistical relationships between brain regions, has emerged as a key metric for studying individual variability and developing biomarkers for personalized medicine in neurological and psychiatric disorders. The concept of subject fingerprinting, introduced by Finn et al. (2015), leverages neural connectivity variability to identify individuals based on their unique patterns. While traditional FC methods perform well on small datasets, machine learning techniques are more effective with larger datasets, isolating individual-specific features and maximizing inter-subject differences. In this study, we propose a framework combining convolutional autoencoders and sparse dictionary learning to enhance fingerprint accuracy. Autoencoders capture shared connectivity patterns while isolating subject-specific features in residual FC matrices, which are analyzed using sparse coding to identify distinctive features. Tested on the Human Connectome Project dataset, this approach achieved a 10% improvement over baseline group-averaged FC models. Our results highlight the potential of integrating deep learning and sparse coding techniques for scalable and robust functional connectome fingerprinting, advancing personalized neuroscience applications and biomarker development.
comment: 10 pages, 4 tables, and 11 figures
☆ SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions
Automated segmentation of heterogeneous brain lesions from multi-modal MRI remains a critical challenge in clinical neuroimaging. Current deep learning models are typically specialized `point solutions' that lack generalization and high performance variance, limiting their clinical reliability. To address these gaps, we propose the Unified Multi-Stream SYNAPSE-Net, an adaptive framework designed for both generalization and robustness. The framework is built on a novel hybrid architecture integrating multi-stream CNN encoders, a Swin Transformer bottleneck for global context, a dynamic cross-modal attention fusion (CMAF) mechanism, and a hierarchical gated decoder for high-fidelity mask reconstruction. The architecture is trained with a variance reduction strategy that combines pathology specific data augmentation and difficulty-aware sampling method. The model was evaluated on three different challenging public datasets: the MICCAI 2017 WMH Challenge, the ISLES 2022 Challenge, and the BraTS 2020 Challenge. Our framework attained a state-of-the-art DSC value of 0.831 with the HD95 value of 3.03 in the WMH dataset. For ISLES 2022, it achieved the best boundary accuracy with a statistically significant difference (HD95 value of 9.69). For BraTS 2020, it reached the highest DSC value for the tumor core region (0.8651). These experimental findings suggest that our unified adaptive framework achieves state-of-the-art performance across multiple brain pathologies, providing a robust and clinically feasible solution for automated segmentation. The source code and the pre-trained models are available at https://github.com/mubid-01/SYNAPSE-Net-pre.
comment: 17 pages, 10 figures, 8 tables, submitted to "Medical Image Analysis" journal
☆ Multi-hop Parallel Image Semantic Communication for Distortion Accumulation Mitigation
Existing semantic communication schemes primarily focus on single-hop scenarios, overlooking the challenges of multi-hop wireless image transmission. As semantic communication is inherently lossy, distortion accumulates over multiple hops, leading to significant performance degradation. To address this, we propose the multi-hop parallel image semantic communication (MHPSC) framework, which introduces a parallel residual compensation link at each hop against distortion accumulation. To minimize the associated transmission bandwidth overhead, a coarse-to-fine residual compression scheme is designed. A deep learning-based residual compressor first condenses the residuals, followed by the adaptive arithmetic coding (AAC) for further compression. A residual distribution estimation module predicts the prior distribution for the AAC to achieve fine compression performances. This approach ensures robust multi-hop image transmission with only a minor increase in transmission bandwidth. Experimental results confirm that MHPSC outperforms both existing semantic communication and traditional separated coding schemes.
♻ ☆ ReCon-GS: Continuum-Preserved Gaussian Streaming for Fast and Compact Reconstruction of Dynamic Scenes NeurIPS 2025
Online free-viewpoint video (FVV) reconstruction is challenged by slow per-frame optimization, inconsistent motion estimation, and unsustainable storage demands. To address these challenges, we propose the Reconfigurable Continuum Gaussian Stream, dubbed ReCon-GS, a novel storage-aware framework that enables high fidelity online dynamic scene reconstruction and real-time rendering. Specifically, we dynamically allocate multi-level Anchor Gaussians in a density-adaptive fashion to capture inter-frame geometric deformations, thereby decomposing scene motion into compact coarse-to-fine representations. Then, we design a dynamic hierarchy reconfiguration strategy that preserves localized motion expressiveness through on-demand anchor re-hierarchization, while ensuring temporal consistency through intra-hierarchical deformation inheritance that confines transformation priors to their respective hierarchy levels. Furthermore, we introduce a storage-aware optimization mechanism that flexibly adjusts the density of Anchor Gaussians at different hierarchy levels, enabling a controllable trade-off between reconstruction fidelity and memory usage. Extensive experiments on three widely used datasets demonstrate that, compared to state-of-the-art methods, ReCon-GS improves training efficiency by approximately 15% and achieves superior FVV synthesis quality with enhanced robustness and stability. Moreover, at equivalent rendering quality, ReCon-GS slashes memory requirements by over 50% compared to leading state-of-the-art methods.
comment: Published in NeurIPS 2025
♻ ☆ FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes ECAI
The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.
comment: 8 pages, 6 figures, In Proceedings of European Conference on Artificial Intelligence (ECAI) 2025
♻ ☆ Neighborhood Feature Pooling for Remote Sensing Image Classification
In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.
comment: 9 pages, 5 figures
♻ ☆ GCVAMD: A Modified CausalVAE Model for Causal Age-related Macular Degeneration Risk Factor Detection and Prediction
Age Related Macular Degeneration(AMD) has been one of the most leading causes of permanent vision impairment in ophthalmology. Though treatments, such as anti VEGF drugs or photodynamic therapies, were developed to slow down the degenerative process of AMD, there is still no specific cure to reverse vision loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD or AMD itself within the patient retina in early stages is a crucial task to reduce the possibility of vision impairment. Apart from traditional approaches, deep learning based methods, especially attention mechanism based CNNs and GradCAM based XAI analysis on OCT scans, exhibited successful performance in distinguishing AMD retina from normal retinas, making it possible to use AI driven models to aid medical diagnosis and analysis by ophthalmologists regarding AMD. However, though having significant success, previous works mostly focused on prediction performance itself, not pathologies or underlying causal mechanisms of AMD, which can prohibit intervention analysis on specific factors or even lead to less reliable decisions. Thus, this paper introduces a novel causal AMD analysis model: GCVAMD, which incorporates a modified CausalVAE approach that can extract latent causal factors from only raw OCT images. By considering causality in AMD detection, GCVAMD enables causal inference such as treatment simulation or intervention analysis regarding major risk factors: drusen and neovascularization, while returning informative latent causal features that can enhance downstream tasks. Results show that through GCVAMD, drusen status and neovascularization status can be identified with AMD causal mechanisms in GCVAMD latent spaces, which can in turn be used for various tasks from AMD detection(classification) to intervention analysis.
Computer Vision and Pattern Recognition 118
☆ VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning
Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.
comment: Project Page URL:https://libaolu312.github.io/VFXMaster/
☆ FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.
☆ Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.
☆ Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
☆ Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning
Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.
☆ RegionE: Adaptive Region-Aware Generation for Efficient Image Editing
Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.
comment: 26 pages, 10 figures, 18 tables
☆ Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography
Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model's superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region's most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.
comment: 27 pages, 8 figures
☆ FaCT: Faithful Concept Traces for Explaining Neural Network Decisions NeurIPS 2025
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.
comment: Accepted to NeurIPS 2025; Code is available at https://github.com/m-parchami/FaCT
☆ SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments
Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.
☆ More than a Moment: Towards Coherent Sequences of Audio Descriptions
Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.
☆ Instance-Level Composed Image Retrieval NeurIPS 2025
The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.
comment: NeurIPS 2025
Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.
☆ 3D CT-Based Coronary Calcium Assessment: A Feature-Driven Machine Learning Framework MICCAI
Coronary artery calcium (CAC) scoring plays a crucial role in the early detection and risk stratification of coronary artery disease (CAD). In this study, we focus on non-contrast coronary computed tomography angiography (CCTA) scans, which are commonly used for early calcification detection in clinical settings. To address the challenge of limited annotated data, we propose a radiomics-based pipeline that leverages pseudo-labeling to generate training labels, thereby eliminating the need for expert-defined segmentations. Additionally, we explore the use of pretrained foundation models, specifically CT-FM and RadImageNet, to extract image features, which are then used with traditional classifiers. We compare the performance of these deep learning features with that of radiomics features. Evaluation is conducted on a clinical CCTA dataset comprising 182 patients, where individuals are classified into two groups: zero versus non-zero calcium scores. We further investigate the impact of training on non-contrast datasets versus combined contrast and non-contrast datasets, with testing performed only on non contrast scans. Results show that radiomics-based models significantly outperform CNN-derived embeddings from foundation models (achieving 84% accuracy and p<0.05), despite the unavailability of expert annotations.
comment: 11 pages, 2 Figures, MICCAI AMAI 2025 workshop, to be published in Volume 16206 of the Lecture Notes in Computer Science series
☆ Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples
Skeleton-based human action recognition aims to classify human skeletal sequences, which are spatiotemporal representations of actions, into predefined categories. To reduce the reliance on costly annotations of skeletal sequences while maintaining competitive recognition accuracy, the task of 3D Action Recognition with Limited Training Samples, also known as semi-supervised 3D Action Recognition, has been proposed. In addition, active learning, which aims to proactively select the most informative unlabeled samples for annotation, has been explored in semi-supervised 3D Action Recognition for training sample selection. Specifically, researchers adopt an encoder-decoder framework to embed skeleton sequences into a latent space, where clustering information, combined with a margin-based selection strategy using a multi-head mechanism, is utilized to identify the most informative sequences in the unlabeled set for annotation. However, the most representative skeleton sequences may not necessarily be the most informative for the action recognizer, as the model may have already acquired similar knowledge from previously seen skeleton samples. To solve it, we reformulate Semi-supervised 3D action recognition via active learning from a novel perspective by casting it as a Markov Decision Process (MDP). Built upon the MDP framework and its training paradigm, we train an informative sample selection model to intelligently guide the selection of skeleton sequences for annotation. To enhance the representational capacity of the factors in the state-action pairs within our method, we project them from Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning strategy to accelerate the deployment of our method in real-world scenarios. Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of our method.
comment: Accepted by IEEE Transactions on Image Processing (TIP), 2025
☆ StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA
The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.
☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Accepted by SenSys 2026
☆ Prototype-Driven Adaptation for Few-Shot Object Detection
Few-shot object detection (FSOD) often suffers from base-class bias and unstable calibration when only a few novel samples are available. We propose Prototype-Driven Alignment (PDA), a lightweight, plug-in metric head for DeFRCN that provides a prototype-based "second opinion" complementary to the linear classifier. PDA maintains support-only prototypes in a learnable identity-initialized projection space and optionally applies prototype-conditioned RoI alignment to reduce geometric mismatch. During fine-tuning, prototypes can be adapted via exponential moving average(EMA) updates on labeled foreground RoIs-without introducing class-specific parameters-and are frozen at inference to ensure strict protocol compliance. PDA employs a best-of-K matching scheme to capture intra-class multi-modality and temperature-scaled fusion to combine metric similarities with detector logits. Experiments on VOC FSOD and GFSOD benchmarks show that PDA consistently improves novel-class performance with minimal impact on base classes and negligible computational overhead.
comment: 7 pages,1 figure,2 tables,Preprint
☆ Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design
Achieving high-fidelity, compact RGBD imaging presents a dual challenge: conventional compact optics struggle with RGB sharpness across the entire depth-of-field, while software-only Monocular Depth Estimation (MDE) is an ill-posed problem reliant on unreliable semantic priors. While deep optics with elements like DOEs can encode depth, they introduce trade-offs in fabrication complexity and chromatic aberrations, compromising simplicity. To address this, we first introduce a novel bio-inspired all-spherical monocentric lens, around which we build the Bionic Monocentric Imaging (BMI) framework, a holistic co-design. This optical design naturally encodes depth into its depth-varying Point Spread Functions (PSFs) without requiring complex diffractive or freeform elements. We establish a rigorous physically-based forward model to generate a synthetic dataset by precisely simulating the optical degradation process. This simulation pipeline is co-designed with a dual-head, multi-scale reconstruction network that employs a shared encoder to jointly recover a high-fidelity All-in-Focus (AiF) image and a precise depth map from a single coded capture. Extensive experiments validate the state-of-the-art performance of the proposed framework. In depth estimation, the method attains an Abs Rel of 0.026 and an RMSE of 0.130, markedly outperforming leading software-only approaches and other deep optics systems. For image restoration, the system achieves an SSIM of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior balance between image fidelity and depth accuracy. This study illustrates that the integration of bio-inspired, fully spherical optics with a joint reconstruction algorithm constitutes an effective strategy for addressing the intrinsic challenges in high-performance compact RGBD imaging. Source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI.
comment: The source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI
☆ GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction
Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.
☆ Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation NeurIPS 2025
Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18.6% in scenarios with large source-target gaps.
comment: Accepted by NeurIPS 2025
☆ SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation
Generating hand grasps with language instructions is a widely studied topic that benefits from embodied AI and VR/AR applications. While transferring into hand articulatied object interaction (HAOI), the hand grasps synthesis requires not only object functionality but also long-term manipulation sequence along the object deformation. This paper proposes a novel HAOI sequence generation framework SynHLMA, to synthesize hand language manipulation for articulated objects. Given a complete point cloud of an articulated object, we utilize a discrete HAOI representation to model each hand object interaction frame. Along with the natural language embeddings, the representations are trained by an HAOI manipulation language model to align the grasping process with its language description in a shared representation space. A joint-aware loss is employed to ensure hand grasps follow the dynamic variations of articulated object joints. In this way, our SynHLMA achieves three typical hand manipulation tasks for articulated objects of HAOI generation, HAOI prediction and HAOI interpolation. We evaluate SynHLMA on our built HAOI-lang dataset and experimental results demonstrate the superior hand grasp sequence generation performance comparing with state-of-the-art. We also show a robotics grasp application that enables dexterous grasps execution from imitation learning using the manipulation sequence provided by our SynHLMA. Our codes and datasets will be made publicly available.
☆ LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
comment: 10 pages, 5 figures, 14 tables, Neurips 2025
☆ RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models
Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs into the deep layers of the detector. On the other hand, we devise a Gradient-guided Adaptive Modulation (GAM) strategy, which dynamically adjusts the intensity of semantic transfer based on gradient norm ratios. Without increasing deployment and inference overhead, our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models, underscoring its practical utility for real-time detection. Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.
☆ Mapping and Classification of Trees Outside Forests using Deep Learning
Trees Outside Forests (TOF) play an important role in agricultural landscapes by supporting biodiversity, sequestering carbon, and regulating microclimates. Yet, most studies have treated TOF as a single class or relied on rigid rule-based thresholds, limiting ecological interpretation and adaptability across regions. To address this, we evaluate deep learning for TOF classification using a newly generated dataset and high-resolution aerial imagery from four agricultural landscapes in Germany. Specifically, we compare convolutional neural networks (CNNs), vision transformers, and hybrid CNN-transformer models across six semantic segmentation architectures (ABCNet, LSKNet, FT-UNetFormer, DC-Swin, BANet, and U-Net) to map four categories of woody vegetation: Forest, Patch, Linear, and Tree, derived from previous studies and governmental products. Overall, the models achieved good classification accuracy across the four landscapes, with the FT-UNetFormer performing best (mean Intersection-over-Union 0.74; mean F1 score 0.84), underscoring the importance of spatial context understanding in TOF mapping and classification. Our results show good results for Forest and Linear class and reveal challenges particularly in classifying complex structures with high edge density, notably the Patch and Tree class. Our generalization experiments highlight the need for regionally diverse training data to ensure reliable large-scale mapping. The dataset and code are openly available at https://github.com/Moerizzy/TOFMapper
☆ VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations
Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.
☆ DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis ICCV 2025
Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.
comment: ICCV 2025
☆ Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation
Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.
comment: 18 pages, 6 figures, accepted to ICXR 2025 conference
☆ Balanced conic rectified flow NeurIPS 2025
Rectified flow is a generative model that learns smooth transport mappings between two distributions through an ordinary differential equation (ODE). Unlike diffusion-based generative models, which require costly numerical integration of a generative ODE to sample images with state-of-the-art quality, rectified flow uses an iterative process called reflow to learn smooth and straight ODE paths. This allows for relatively simple and efficient generation of high-quality images. However, rectified flow still faces several challenges. 1) The reflow process requires a large number of generative pairs to preserve the target distribution, leading to significant computational costs. 2) Since the model is typically trained using only generated image pairs, its performance heavily depends on the 1-rectified flow model, causing it to become biased towards the generated data. In this work, we experimentally expose the limitations of the original rectified flow and propose a novel approach that incorporates real images into the training process. By preserving the ODE paths for real images, our method effectively reduces reliance on large amounts of generated data. Instead, we demonstrate that the reflow process can be conducted efficiently using a much smaller set of generated and real images. In CIFAR-10, we achieved significantly better FID scores, not only in one-step generation but also in full-step simulations, while using only of the generative pairs compared to the original method. Furthermore, our approach induces straighter paths and avoids saturation on generated images during reflow, leading to more robust ODE learning while preserving the distribution of real images.
comment: Main paper: 10 pages (total 40 pages including appendix), 5 figures. Accepted at NeurIPS 2025 (Poster). Acknowledgment: Supported by the NRF of Korea (RS-2023-00223062) and IITP grants (RS-2020-II201361, RS-2024-00439762) funded by the Korean government (MSIT)
☆ Aligning What You Separate: Denoised Patch Mixing for Source-Free Domain Adaptation in Medical Image Segmentation
Source-Free Domain Adaptation (SFDA) is emerging as a compelling solution for medical image segmentation under privacy constraints, yet current approaches often ignore sample difficulty and struggle with noisy supervision under domain shift. We present a new SFDA framework that leverages Hard Sample Selection and Denoised Patch Mixing to progressively align target distributions. First, unlabeled images are partitioned into reliable and unreliable subsets through entropy-similarity analysis, allowing adaptation to start from easy samples and gradually incorporate harder ones. Next, pseudo-labels are refined via Monte Carlo-based denoising masks, which suppress unreliable pixels and stabilize training. Finally, intra- and inter-domain objectives mix patches between subsets, transferring reliable semantics while mitigating noise. Experiments on benchmark datasets show consistent gains over prior SFDA and UDA methods, delivering more accurate boundary delineation and achieving state-of-the-art Dice and ASSD scores. Our study highlights the importance of progressive adaptation and denoised supervision for robust segmentation under domain shift.
comment: 5 pages, 3 figures
☆ MSF-Net: Multi-Stage Feature Extraction and Fusion for Robust Photometric Stereo
Photometric stereo is a technique aimed at determining surface normals through the utilization of shading cues derived from images taken under different lighting conditions. However, existing learning-based approaches often fail to accurately capture features at multiple stages and do not adequately promote interaction between these features. Consequently, these models tend to extract redundant features, especially in areas with intricate details such as wrinkles and edges. To tackle these issues, we propose MSF-Net, a novel framework for extracting information at multiple stages, paired with selective update strategy, aiming to extract high-quality feature information, which is critical for accurate normal construction. Additionally, we have developed a feature fusion module to improve the interplay among different features. Experimental results on the DiLiGenT benchmark show that our proposed MSF-Net significantly surpasses previous state-of-the-art methods in the accuracy of surface normal estimation.
☆ U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching NeurIPS 2025
Point clouds captured by scanning sensors are often perturbed by noise, which have a highly negative impact on downstream tasks (e.g. surface reconstruction and shape understanding). Previous works mostly focus on training neural networks with noisy-clean point cloud pairs for learning denoising priors, which requires extensively manual efforts. In this work, we introduce U-CAN, an Unsupervised framework for point cloud denoising with Consistency-Aware Noise2Noise matching. Specifically, we leverage a neural network to infer a multi-step denoising path for each point of a shape or scene with a noise to noise matching scheme. We achieve this by a novel loss which enables statistical reasoning on multiple noisy point cloud observations. We further introduce a novel constraint on the denoised geometry consistency for learning consistency-aware denoising patterns. We justify that the proposed constraint is a general term which is not limited to 3D domain and can also contribute to the area of 2D image denoising. Our evaluations under the widely used benchmarks in point cloud denoising, upsampling and image denoising show significant improvement over the state-of-the-art unsupervised methods, where U-CAN also produces comparable results with the supervised methods.
comment: Accepted by NeurIPS 2025. Project page: https://gloriasze.github.io/U-CAN/
☆ AI-Powered Early Detection of Critical Diseases using Image Processing and Audio Analysis
Early diagnosis of critical diseases can significantly improve patient survival and reduce treatment costs. However, existing diagnostic techniques are often costly, invasive, and inaccessible in low-resource regions. This paper presents a multimodal artificial intelligence (AI) diagnostic framework integrating image analysis, thermal imaging, and audio signal processing for early detection of three major health conditions: skin cancer, vascular blood clots, and cardiopulmonary abnormalities. A fine-tuned MobileNetV2 convolutional neural network was trained on the ISIC 2019 dataset for skin lesion classification, achieving 89.3% accuracy, 91.6% sensitivity, and 88.2% specificity. A support vector machine (SVM) with handcrafted features was employed for thermal clot detection, achieving 86.4% accuracy (AUC = 0.89) on synthetic and clinical data. For cardiopulmonary analysis, lung and heart sound datasets from PhysioNet and Pascal were processed using Mel-Frequency Cepstral Coefficients (MFCC) and classified via Random Forest, reaching 87.2% accuracy and 85.7% sensitivity. Comparative evaluation against state-of-the-art models demonstrates that the proposed system achieves competitive results while remaining lightweight and deployable on low-cost devices. The framework provides a promising step toward scalable, real-time, and accessible AI-based pre-diagnostic healthcare solutions.
☆ Mask-Robust Face Verification for Online Learning via YOLOv5 and Residual Networks
In the contemporary landscape, the fusion of information technology and the rapid advancement of artificial intelligence have ushered school education into a transformative phase characterized by digitization and heightened intelligence. Concurrently, the global paradigm shift caused by the Covid-19 pandemic has catalyzed the evolution of e-learning, accentuating its significance. Amidst these developments, one pivotal facet of the online education paradigm that warrants attention is the authentication of identities within the digital learning sphere. Within this context, our study delves into a solution for online learning authentication, utilizing an enhanced convolutional neural network architecture, specifically the residual network model. By harnessing the power of deep learning, this technological approach aims to galvanize the ongoing progress of online education, while concurrently bolstering its security and stability. Such fortification is imperative in enabling online education to seamlessly align with the swift evolution of the educational landscape. This paper's focal proposition involves the deployment of the YOLOv5 network, meticulously trained on our proprietary dataset. This network is tasked with identifying individuals' faces culled from images captured by students' open online cameras. The resultant facial information is then channeled into the residual network to extract intricate features at a deeper level. Subsequently, a comparative analysis of Euclidean distances against students' face databases is performed, effectively ascertaining the identity of each student.
comment: 9 pages, 10 figures
☆ Test-Time Adaptive Object Detection with Foundation Model NeurIPS 2025
In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.
comment: Accepted by NeurIPS 2025
☆ Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation
Prevalent semantic segmentation methods generally adopt a vanilla classifier to categorize each pixel into specific classes. Although such a classifier learns global information from the training data, this information is represented by a set of fixed parameters (weights and biases). However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model's effectiveness in identifying and segmenting minority class regions. In this paper, we propose an Extended Context-Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset-level) and local (image-level) contextual information. Specifically, we leverage a memory bank to learn dataset-level contextual information of each class, incorporating the class-specific contextual information from the current image to improve the classifier for precise pixel labeling. Additionally, a teacher-student network paradigm is adopted, where the domain expert (teacher network) dynamically adjusts contextual information with ground truth and transfers knowledge to the student network. Comprehensive experiments illustrate that the proposed ECAC can achieve state-of-the-art performance across several datasets, including ADE20K, COCO-Stuff10K, and Pascal-Context.
comment: Accepted at IEEE TRANSACTIONS ON MULTIMEDIA (TMM)
☆ $D^2GS$: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction
Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textit{i.e.} LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose $D^2GS$, a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. $\textbf{First}$, we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. $\textbf{Second}$, we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. $\textbf{Finally}$, we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.
☆ A Study on Inference Latency for Vision Transformers on Mobile Devices
Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.
comment: To appear in Springer LNICST, volume 663, Proceedings of VALUETOOLS 2024
Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
comment: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis
☆ Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation
Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at https://github.com/scu-zwh/TGBFN.
☆ Towards Real-Time Inference of Thin Liquid Film Thickness Profiles from Interference Patterns Using Vision Transformers
Thin film interferometry is a powerful technique for non-invasively measuring liquid film thickness with applications in ophthalmology, but its clinical translation is hindered by the challenges in reconstructing thickness profiles from interference patterns - an ill-posed inverse problem complicated by phase periodicity, imaging noise and ambient artifacts. Traditional reconstruction methods are either computationally intensive, sensitive to noise, or require manual expert analysis, which is impractical for real-time diagnostics. To address this challenge, here we present a vision transformer-based approach for real-time inference of thin liquid film thickness profiles directly from isolated interferograms. Trained on a hybrid dataset combining physiologically-relevant synthetic and experimental tear film data, our model leverages long-range spatial correlations to resolve phase ambiguities and reconstruct temporally coherent thickness profiles in a single forward pass from dynamic interferograms acquired in vivo and ex vivo. The network demonstrates state-of-the-art performance on noisy, rapidly-evolving films with motion artifacts, overcoming limitations of conventional phase-unwrapping and iterative fitting methods. Our data-driven approach enables automated, consistent thickness reconstruction at real-time speeds on consumer hardware, opening new possibilities for continuous monitoring of pre-lens ocular tear films and non-invasive diagnosis of conditions such as the dry eye disease.
comment: 6 pages, 2 figures, will be updated
☆ EA3D: Online Open-World 3D Object Extraction from Streaming Videos NeurIPS 2025
Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.
comment: The Thirty-Ninth Annual Conference on Neural Information Processing Systems(NeurIPS 2025)
☆ Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective
The rise of generative Artificial Intelligence (AI) has made detecting AI-generated images a critical challenge for ensuring authenticity. Existing reconstruction-based methods lack theoretical foundations and on empirical heuristics, limiting interpretability and reliability. In this paper, we introduce the Jacobian-Spectral Lower Bound for reconstruction error from a geometric perspective, showing that real images off the reconstruction manifold exhibit a non-trivial error lower bound, while generated images on the manifold have near-zero error. Furthermore, we reveal the limitations of existing methods that rely on static reconstruction error from a single pass. These methods often fail when some real images exhibit lower error than generated ones. This counterintuitive behavior reduces detection accuracy and requires data-specific threshold tuning, limiting their applicability in real-world scenarios. To address these challenges, we propose ReGap, a training-free method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations. This enables measuring error changes before and after editing, improving detection accuracy by enhancing error separation. Experimental results show that our method outperforms existing baselines, exhibits robustness to common post-processing operations and generalizes effectively across diverse conditions.
☆ DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications
Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.
☆ Region-CAM: Towards Accurate Object Regions in Class Activation Maps for Weakly Supervised Learning Tasks
Class Activation Mapping (CAM) methods are widely applied in weakly supervised learning tasks due to their ability to highlight object regions. However, conventional CAM methods highlight only the most discriminative regions of the target. These highlighted regions often fail to cover the entire object and are frequently misaligned with object boundaries, thereby limiting the performance of downstream weakly supervised learning tasks, particularly Weakly Supervised Semantic Segmentation (WSSS), which demands pixel-wise accurate activation maps to get the best results. To alleviate the above problems, we propose a novel activation method, Region-CAM. Distinct from network feature weighting approaches, Region-CAM generates activation maps by extracting semantic information maps (SIMs) and performing semantic information propagation (SIP) by considering both gradients and features in each of the stages of the baseline classification model. Our approach highlights a greater proportion of object regions while ensuring activation maps to have precise boundaries that align closely with object edges. Region-CAM achieves 60.12% and 58.43% mean intersection over union (mIoU) using the baseline model on the PASCAL VOC training and validation datasets, respectively, which are improvements of 13.61% and 13.13% over the original CAM (46.51% and 45.30%). On the MS COCO validation set, Region-CAM achieves 36.38%, a 16.23% improvement over the original CAM (20.15%). We also demonstrate the superiority of Region-CAM in object localization tasks, using the ILSVRC2012 validation set. Region-CAM achieves 51.7% in Top-1 Localization accuracy Loc1. Compared with LayerCAM, an activation method designed for weakly supervised object localization, Region-CAM achieves 4.5% better performance in Loc1.
comment: Preprint for journal paper
☆ AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians NeurIPS 2025
3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.
comment: 18 pages, 11 figures. NeurIPS 2025; Project page: https://zju3dv.github.io/AtlasGS/
☆ Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection NeurIPS 2025
Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.
comment: Accepted by NeurIPS 2025
☆ PSTF-AttControl: Per-Subject-Tuning-Free Personalized Image Generation with Controllable Face Attributes
Recent advancements in personalized image generation have significantly improved facial identity preservation, particularly in fields such as entertainment and social media. However, existing methods still struggle to achieve precise control over facial attributes in a per-subject-tuning-free (PSTF) way. Tuning-based techniques like PreciseControl have shown promise by providing fine-grained control over facial features, but they often require extensive technical expertise and additional training data, limiting their accessibility. In contrast, PSTF approaches simplify the process by enabling image generation from a single facial input, but they lack precise control over facial attributes. In this paper, we introduce a novel, PSTF method that enables both precise control over facial attributes and high-fidelity preservation of facial identity. Our approach utilizes a face recognition model to extract facial identity features, which are then mapped into the $W^+$ latent space of StyleGAN2 using the e4e encoder. We further enhance the model with a Triplet-Decoupled Cross-Attention module, which integrates facial identity, attribute features, and text embeddings into the UNet architecture, ensuring clean separation of identity and attribute information. Trained on the FFHQ dataset, our method allows for the generation of personalized images with fine-grained control over facial attributes, while without requiring additional fine-tuning or training data for individual identities. We demonstrate that our approach successfully balances personalization with precise facial attribute control, offering a more efficient and user-friendly solution for high-quality, adaptable facial image synthesis. The code is publicly available at https://github.com/UnicomAI/PSTF-AttControl.
comment: Accepted by Image and Vision Computing (18 pages, 8 figures)
☆ Neighborhood Feature Pooling for Remote Sensing Image Classification WACV 2026
In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.
comment: 9 pages, 5 figures. Accepted to WACV 2026 (Winter Conference on Applications of Computer Vision)
☆ Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments
Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.
comment: Preprint under review at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
☆ DRIP: Dynamic patch Reduction via Interpretable Pooling
Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
☆ Auto3DSeg for Brain Tumor Segmentation from 3D MRI in BraTS 2023 Challenge
In this work, we describe our solution to the BraTS 2023 cluster of challenges using Auto3DSeg from MONAI. We participated in all 5 segmentation challenges, and achieved the 1st place results in three of them: Brain Metastasis, Brain Meningioma, BraTS-Africa challenges, and the 2nd place results in the remaining two: Adult and Pediatic Glioma challenges.
comment: BraTS23 winner
☆ Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models ICCV 2025
Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language representations achieves superior performance to vision transformer-based models while handling high-resolution images and enabling practical deployment across diverse populations. By evaluating it on multi-national cohort screening mammograms, our multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements. The proposed method establishes a new paradigm for developing clinically viable VLM-based CAD systems that effectively leverage imaging data and contextual patient information through effective fusion mechanisms.
comment: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025
☆ Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders NeurIPS 2025
Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.
comment: Accepted to NeurIPS 2025
☆ Groupwise Registration with Physics-Informed Test-Time Adaptation on Multi-parametric Cardiac MRI
Multiparametric mapping MRI has become a viable tool for myocardial tissue characterization. However, misalignment between multiparametric maps makes pixel-wise analysis challenging. To address this challenge, we developed a generalizable physics-informed deep-learning model using test-time adaptation to enable group image registration across contrast weighted images acquired from multiple physical models (e.g., a T1 mapping model and T2 mapping model). The physics-informed adaptation utilized the synthetic images from specific physics model as registration reference, allows for transductive learning for various tissue contrast. We validated the model in healthy volunteers with various MRI sequences, demonstrating its improvement for multi-modal registration with a wide range of image contrast variability.
☆ Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning
Climate change and sea-level rise (SLR) pose escalating threats to coastal cities, intensifying the need for efficient and accurate methods to predict potential flood hazards. Traditional physics-based hydrodynamic simulators, although precise, are computationally expensive and impractical for city-scale coastal planning applications. Deep Learning (DL) techniques offer promising alternatives, however, they are often constrained by challenges such as data scarcity and high-dimensional output requirements. Leveraging a recently proposed vision-based, low-resource DL framework, we develop a novel, lightweight Convolutional Neural Network (CNN)-based model designed to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios. Furthermore, we demonstrate the ability of the model to generalize across diverse geographical contexts by utilizing datasets from two distinct regions: Abu Dhabi and San Francisco. Our findings demonstrate that the proposed model significantly outperforms state-of-the-art methods, reducing the mean absolute error (MAE) in predicted flood depth maps on average by nearly 20%. These results highlight the potential of our approach to serve as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies in response to the growing impacts of climate change. Project Page: https://caspiannet.github.io/
comment: Submitted to Hydrology and Earth System Sciences
☆ CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.
☆ DARTS: A Drone-Based AI-Powered Real-Time Traffic Incident Detection System
Rapid and reliable incident detection is critical for reducing crash-related fatalities, injuries, and congestion. However, conventional methods, such as closed-circuit television, dashcam footage, and sensor-based detection, separate detection from verification, suffer from limited flexibility, and require dense infrastructure or high penetration rates, restricting adaptability and scalability to shifting incident hotspots. To overcome these challenges, we developed DARTS, a drone-based, AI-powered real-time traffic incident detection system. DARTS integrates drones' high mobility and aerial perspective for adaptive surveillance, thermal imaging for better low-visibility performance and privacy protection, and a lightweight deep learning framework for real-time vehicle trajectory extraction and incident detection. The system achieved 99% detection accuracy on a self-collected dataset and supports simultaneous online visual verification, severity assessment, and incident-induced congestion propagation monitoring via a web-based interface. In a field test on Interstate 75 in Florida, DARTS detected and verified a rear-end collision 12 minutes earlier than the local transportation management center and monitored incident-induced congestion propagation, suggesting potential to support faster emergency response and enable proactive traffic control to reduce congestion and secondary crash risk. Crucially, DARTS's flexible deployment architecture reduces dependence on frequent physical patrols, indicating potential scalability and cost-effectiveness for use in remote areas and resource-constrained settings. This study presents a promising step toward a more flexible and integrated real-time traffic incident detection system, with significant implications for the operational efficiency and responsiveness of modern transportation management.
comment: Preprint version. This manuscript is currently under review at Transportation Research Part C: Emerging Technologies. The PDF corresponds to the version submitted in June 2025. The main findings of this work were recognized with the Best Intelligent Transportation Systems Paper Award at the 2025 TRB Annual Meeting
☆ Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement
We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model's ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.
☆ Fine-tuning Segment Anything for Real-Time Tumor Tracking in Cine-MRI
In this work, we address the TrackRAD2025 challenge of real-time tumor tracking in cine-MRI sequences of the thoracic and abdominal regions under strong data scarcity constraints. Two complementary strategies were explored: (i) unsupervised registration with the IMPACT similarity metric and (ii) foundation model-based segmentation leveraging SAM 2.1 and its recent variants through prompt-based interaction. Due to the one-second runtime constraint, the SAM-based method was ultimately selected. The final configuration used SAM2.1 b+ with mask-based prompts from the first annotated slice, fine-tuned solely on the small labeled subset from TrackRAD2025. Training was configured to minimize overfitting, using 1024x1024 patches (batch size 1), standard augmentations, and a balanced Dice + IoU loss. A low uniform learning rate (0.0001) was applied to all modules (prompt encoder, decoder, Hiera backbone) to preserve generalization while adapting to annotator-specific styles. Training lasted 300 epochs (~12h on RTX A6000, 48GB). The same inference strategy was consistently applied across all anatomical sites and MRI field strengths. Test-time augmentation was considered but ultimately discarded due to negligible performance gains. The final model was selected based on the highest Dice Similarity Coefficient achieved on the validation set after fine-tuning. On the hidden test set, the model reached a Dice score of 0.8794, ranking 6th overall in the TrackRAD2025 challenge. These results highlight the strong potential of foundation models for accurate and real-time tumor tracking in MRI-guided radiotherapy.
comment: Paper for the Trackrad2025 challenge, Team BreizhTrack
☆ Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer
Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present "Brain-IT", a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.
☆ SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing NeurIPS 2025
Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at https://github.com/Harvard-AI-and-Robotics-Lab/SplitFlow.
comment: Camera-ready version for NeurIPS 2025, 10 pages (main paper)
☆ Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy
Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and atom manipulation, but its utility is often limited by tip degradation and slow serial data acquisition. Fabrication adds another layer of complexity since the tip is often subjected to large voltages, which may alter the shape of its apex, requiring it to be conditioned. Here, we propose a machine learning (ML) approach for image repair and super-resolution to alleviate both challenges. Using a dataset of only 36 pristine experimental images of Si(001):H, we demonstrate that a physics-informed synthetic data generation pipeline can be used to train several state-of-the-art flow-matching and diffusion models. Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy (CMMD) score and structural similarity demonstrates that our models are able to effectively restore images and offer a two- to fourfold reduction in image acquisition time by accurately reconstructing images from sparsely sampled data. Our framework has the potential to significantly increase STM experimental throughput by offering a route to reducing the frequency of tip-conditioning procedures and to enhancing frame rates in existing high-speed STM systems.
♻ ☆ GENRE-CMR: Generalizable Deep Learning for Diverse Multi-Domain Cardiac MRI Reconstruction
Accelerated Cardiovascular Magnetic Resonance (CMR) image reconstruction remains a critical challenge due to the trade-off between scan time and image quality, particularly when generalizing across diverse acquisition settings. We propose GENRE-CMR, a generative adversarial network (GAN)-based architecture employing a residual deep unrolled reconstruction framework to enhance reconstruction fidelity and generalization. The architecture unrolls iterative optimization into a cascade of convolutional subnetworks, enriched with residual connections to enable progressive feature propagation from shallow to deeper stages. To further improve performance, we integrate two loss functions: (1) an Edge-Aware Region (EAR) loss, which guides the network to focus on structurally informative regions and helps prevent common reconstruction blurriness; and (2) a Statistical Distribution Alignment (SDA) loss, which regularizes the feature space across diverse data distributions via a symmetric KL divergence formulation. Extensive experiments confirm that GENRE-CMR surpasses state-of-the-art methods on training and unseen data, achieving 0.9552 SSIM and 38.90 dB PSNR on unseen distributions across various acceleration factors and sampling trajectories. Ablation studies confirm the contribution of each proposed component to reconstruction quality and generalization. Our framework presents a unified and robust solution for high-quality CMR reconstruction, paving the way for clinically adaptable deployment across heterogeneous acquisition protocols.
♻ ☆ MILo: Mesh-In-the-Loop Gaussian Splatting for Detailed and Efficient Surface Reconstruction
While recent advances in Gaussian Splatting have enabled fast reconstruction of high-quality 3D scenes from images, extracting accurate surface meshes remains a challenge. Current approaches extract the surface through costly post-processing steps, resulting in the loss of fine geometric details or requiring significant time and leading to very dense meshes with millions of vertices. More fundamentally, the a posteriori conversion from a volumetric to a surface representation limits the ability of the final mesh to preserve all geometric structures captured during training. We present MILo, a novel Gaussian Splatting framework that bridges the gap between volumetric and surface representations by differentiably extracting a mesh from the 3D Gaussians. We design a fully differentiable procedure that constructs the mesh-including both vertex locations and connectivity-at every iteration directly from the parameters of the Gaussians, which are the only quantities optimized during training. Our method introduces three key technical contributions: a bidirectional consistency framework ensuring both representations-Gaussians and the extracted mesh-capture the same underlying geometry during training; an adaptive mesh extraction process performed at each training iteration, which uses Gaussians as differentiable pivots for Delaunay triangulation; a novel method for computing signed distance values from the 3D Gaussians that enables precise surface extraction while avoiding geometric erosion. Our approach can reconstruct complete scenes, including backgrounds, with state-of-the-art quality while requiring an order of magnitude fewer mesh vertices than previous methods. Due to their light weight and empty interior, our meshes are well suited for downstream applications such as physics simulations or animation.
comment: 10 pages. A presentation video of our approach is available at https://youtu.be/_SGNhhNz0fE
♻ ☆ DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO NeurIPS 2025
Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
comment: NeurIPS 2025
♻ ☆ RL-I2IT: Image-to-Image Translation with Deep Reinforcement Learning
Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and propose a novel framework that performs RL-based I2IT (RL-I2IT). The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image. Considering that it is challenging to handle high dimensional continuous state and action spaces in the conventional RL framework, we introduce meta policy with a new concept Plan to the standard Actor-Critic model, which is of a lower dimension than the original image and can facilitate the actor to generate a tractable high dimensional action. In the RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is available at https://github.com/Algolzw/SPAC-Deformable-Registration.
♻ ☆ Quantizing Space and Time: Fusing Time Series and Images for Earth Observation
We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6% in R^2 and 2% in RMSE on average, and exceeds baseline methods by 50% in R^2 and 12% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.
♻ ☆ FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering NeurIPS 2025
While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.
comment: Accepted by NeurIPS 2025 - main track. Project page: https://focus-mllm-vqa.github.io/
♻ ☆ RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
♻ ☆ ScribbleVS: Scribble-Supervised Medical Image Segmentation via Dynamic Competitive Pseudo Label Selection
In clinical medicine, precise image segmentation can provide substantial support to clinicians. However, obtaining high-quality segmentation typically demands extensive pixel-level annotations, which are labor-intensive and expensive. Scribble annotations offer a more cost-effective alternative by improving labeling efficiency. Nonetheless, using such sparse supervision for training reliable medical image segmentation models remains a significant challenge. Some studies employ pseudo-labeling to enhance supervision, but these methods are susceptible to noise interference. To address these challenges, we introduce ScribbleVS, a framework designed to learn from scribble annotations. We introduce a Regional Pseudo Labels Diffusion Module to expand the scope of supervision and reduce the impact of noise present in pseudo labels. Additionally, we introduce a Dynamic Competitive Selection module for enhanced refinement in selecting pseudo labels. Experiments conducted on the ACDC, MSCMRseg, WORD, and BraTS2020 datasets demonstrate promising results, achieving segmentation precision comparable to fully supervised models. The codes of this study are available at https://github.com/ortonwang/ScribbleVS.
♻ ☆ HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene NeurIPS 2025
Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations. These issues collectively hinder coherent and efficient dynamic reconstruction. To address these limitations, we propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation. It first identifies motion-relevant regions via an Anchor Filter to suppress redundant updates in static areas. A self-supervised Induced Flow-Guided Deformation module induces anchor motion using multi-frame feature aggregation, eliminating the need for explicit flow labels. To further handle fine-grained deformations, a Hierarchical Anchor Propagation mechanism increases anchor resolution based on motion complexity and propagates multi-level transformations. Extensive experiments on synthetic and real-world benchmarks validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.
comment: Accepted to NeurIPS 2025. Project page: https://echopickle.github.io/HAIF-GS.github.io/
♻ ☆ Simulating Automotive Radar with Lidar and Camera Inputs IROS 2025
Low-cost millimeter automotive radar has received more and more attention due to its ability to handle adverse weather and lighting conditions in autonomous driving. However, the lack of quality datasets hinders research and development. We report a new method that is able to simulate 4D millimeter wave radar signals including pitch, yaw, range, and Doppler velocity along with radar signal strength (RSS) using camera image, light detection and ranging (lidar) point cloud, and ego-velocity. The method is based on two new neural networks: 1) DIS-Net, which estimates the spatial distribution and number of radar signals, and 2) RSS-Net, which predicts the RSS of the signal based on appearance and geometric information. We have implemented and tested our method using open datasets from 3 different models of commercial automotive radar. The experimental results show that our method can successfully generate high-fidelity radar signals. Moreover, we have trained a popular object detection neural network with data augmented by our synthesized radar. The network outperforms the counterpart trained only on raw radar data, a promising result to facilitate future radar-based research and development.
comment: Accepted by IROS 2025
♻ ☆ FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving NeurIPS 2025
Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors-future lane dividers and 3D boxes-on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution. The same VLA then functions as an inverse-dynamics model, planning trajectories from current observations and the visual CoT. To equip VLAs with image generation while preserving understanding, we introduce a unified pre-training paradigm that expands the vocabulary to include visual tokens and jointly optimizes VQA (for semantics) and future-frame prediction (for dynamics). A progressive easy-to-hard scheme first predicts lane/box priors to enforce physical constraints, then completes full future frames for fine details. On nuScenes and NAVSIM, FSDrive improves trajectory accuracy and reduces collisions under both ST-P3 and UniAD metrics, and attains competitive FID for future-frame generation despite using lightweight autoregression. It also advances scene understanding on DriveLM. Together, these results indicate that visual CoT narrows the cross-modal gap and yields safer, more anticipatory planning. Code is available at https://github.com/MIV-XJTU/FSDrive.
comment: Accepted to NeurIPS 2025 as Spotlight Presentation. Code: https://github.com/MIV-XJTU/FSDrive
♻ ☆ Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)
Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
comment: 8 pages, 2 figures, 1 table. Invited report, CCN 2025 Algonauts Project session (3rd-place team). Code: https://github.com/erensemih/Algonauts2025_ModalityRNN v3: Added equal contribution footnote to author list. Corrected reference list
♻ ☆ Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behaviour classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviours such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioural analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework's reliability and adaptability across varying road and environmental conditions.
♻ ☆ U-DECN: End-to-End Underwater Object Detection ConvNet with Improved DeNoising Training
Underwater object detection has higher requirements of running speed and deployment efficiency for the detector due to its specific environmental challenges. NMS of two- or one-stage object detectors and transformer architecture of query-based end-to-end object detectors are not conducive to deployment on underwater embedded devices with limited processing power. As for the detrimental effect of underwater color cast noise, recent underwater object detectors make network architecture or training complex, which also hinders their application and deployment on unmanned underwater vehicles. In this paper, we propose the Underwater DECO with improved deNoising training (U-DECN), the query-based end-to-end object detector (with ConvNet encoder-decoder architecture) for underwater color cast noise that addresses the above problems. We integrate advanced technologies from DETR variants into DECO and design optimization methods specifically for the ConvNet architecture, including Deformable Convolution in SIM and Separate Contrastive DeNoising Forward methods. To address the underwater color cast noise issue, we propose an Underwater Color DeNoising Query method to improve the generalization of the model for the biased object feature information by different color cast noise. Our U-DECN, with ResNet-50 backbone, achieves the best 64.0 AP on DUO and the best 58.1 AP on RUOD, and 21 FPS (5 times faster than Deformable DETR and DINO 4 FPS) on NVIDIA AGX Orin by TensorRT FP16, outperforming the other state-of-the-art query-based end-to-end object detectors. The code is available at https://github.com/LEFTeyex/U-DECN.
comment: 10 pages, 6 figures, 7 tables, accepted by IEEE TGRS
♻ ☆ Probabilistic Kernel Function for Fast Angle Testing
In this paper, we study the angle testing problem in the context of similarity search in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and employs a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be both theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5X ~ 3X higher query-per-second (QPS) throughput compared to the widely-used graph-based search algorithm HNSW.
♻ ☆ FastJAM: a Fast Joint Alignment Model for Images NeurIPS 2025
Joint Alignment (JA) of images aims to align a collection of images into a unified coordinate frame, such that semantically-similar features appear at corresponding spatial locations. Most existing approaches often require long training times, large-capacity models, and extensive hyperparameter tuning. We introduce FastJAM, a rapid, graph-based method that drastically reduces the computational complexity of joint alignment tasks. FastJAM leverages pairwise matches computed by an off-the-shelf image matcher, together with a rapid nonparametric clustering, to construct a graph representing intra- and inter-image keypoint relations. A graph neural network propagates and aggregates these correspondences, efficiently predicting per-image homography parameters via image-level pooling. Utilizing an inverse-compositional loss, that eliminates the need for a regularization term over the predicted transformations (and thus also obviates the hyperparameter tuning associated with such terms), FastJAM performs image JA quickly and effectively. Experimental results on several benchmarks demonstrate that FastJAM achieves results better than existing modern JA methods in terms of alignment quality, while reducing computation time from hours or minutes to mere seconds. Our code is available at our project webpage, https://bgu-cs-vil.github.io/FastJAM/
comment: Accepted to NeurIPS 2025. Pages 1-10 are the Main Paper. Pages 23-31 are Supplemental Material. FastJAM website - https://bgu-cs-vil.github.io/FastJAM/
♻ ☆ Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model's spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.
♻ ☆ Activation Matching for Explanation Generation
In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of \(x\). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.
♻ ☆ Single Image Estimation of Cell Migration Direction by Deep Circular Regression
In this paper, we address the problem of estimating the migration direction of cells based on a single image. A solution to this problem lays the foundation for a variety of applications that were previously not possible. To our knowledge, there is only one related work that employs a classification CNN with four classes (quadrants). However, this approach does not allow for detailed directional resolution. We tackle the single image estimation problem using deep circular regression, with a particular focus on cycle-sensitive methods. On two common datasets, we achieve a mean estimation error of $\sim\!17^\circ$, representing a significant improvement over previous work, which reported estimation error of $30^\circ$ and $34^\circ$, respectively.
♻ ☆ When are radiology reports useful for training medical image classifiers?
Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it's not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.
♻ ☆ MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance
In this study, we propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework, aiming to improve shape consistency and motion control in existing video-based face generation approaches. Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation, providing a unified framework for modeling face expressions and head pose. This not only enables precise extraction of motion features from driving videos, but also contributes to the faithful preservation of face shape and geometry. Specifically, we enhance the latent diffusion model with rich 3D expression and detailed pose information by incorporating depth maps, normal maps, and rendering maps derived from FLAME sequences. These maps serve as motion guidance and are encoded into the denoising UNet through a specifically designed Geometric Guidance Encoder (GGE). A multi-layer feature fusion module with integrated self-attention mechanisms is used to combine facial appearance and motion latent features within the spatial domain. By utilizing the 3D face parametric model as motion guidance, our method enables parametric alignment of face identity between the reference image and the motion captured from the driving video. Experimental results on benchmark datasets show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling. In addition, it demonstrates strong generalization performance on out-of-domain images. Code is publicly available at https://github.com/weimengting/MagicPortrait.
♻ ☆ HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models NeurIPS2025
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters.
comment: Accepted by NeurIPS2025 (Oral)
♻ ☆ DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for Alignment
Vision Language Foundation Models based on CLIP architecture for remote sensing primarily rely on short text captions, which often result in incomplete semantic representations. Although longer captions convey richer information, existing models struggle to process them effectively because of limited text-encoding capacity, and there remains a shortage of resources that align remote sensing images with both short text and long text captions. To address this gap, we introduce DGTRSD, a dual-granularity remote sensing image-text dataset, where each image is paired with both a short text caption and a long text description, providing a solid foundation for dual-granularity semantic modeling. Based on this, we further propose DGTRS-CLIP, a dual-granularity curriculum learning framework that combines short text and long text supervision to achieve dual-granularity semantic alignment. Extensive experiments on four typical zero-shot tasks: long text cross-modal retrieval, short text cross-modal retrieval, image classification, and semantic localization demonstrate that DGTRS-CLIP consistently outperforms existing methods across all tasks. The code has been open-sourced and is available at https://github.com/MitsuiChen14/DGTRS.
♻ ☆ XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark
Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24\% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.
♻ ☆ NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.
comment: Accepted by Neurips2025, Project page at at https://artanic30.github.io/project_pages/NoisyGRPO/
♻ ☆ InstDrive: Instance-Aware 3D Gaussian Splatting for Driving Scenes
Reconstructing dynamic driving scenes from dashcam videos has attracted increasing attention due to its significance in autonomous driving and scene understanding. While recent advances have made impressive progress, most methods still unify all background elements into a single representation, hindering both instance-level understanding and flexible scene editing. Some approaches attempt to lift 2D segmentation into 3D space, but often rely on pre-processed instance IDs or complex pipelines to map continuous features to discrete identities. Moreover, these methods are typically designed for indoor scenes with rich viewpoints, making them less applicable to outdoor driving scenarios. In this paper, we present InstDrive, an instance-aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene. We use masks generated by SAM as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives. At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel-based loss. A lightweight static codebook further bridges continuous features and discrete identities without requiring data pre-processing or complex optimization. Quantitative and qualitative experiments demonstrate the effectiveness of InstDrive, and to the best of our knowledge, it is the first framework to achieve 3D instance segmentation in dynamic, open-world driving scenes.More visualizations are available at our project page.
♻ ☆ DPMambaIR: All-in-One Image Restoration via Degradation-Aware Prompt State Space Model
All-in-One image restoration aims to address multiple image degradation problems using a single model, offering a more practical and versatile solution compared to designing dedicated models for each degradation type. Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration. However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts. To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework that introduces a fine-grained degradation extractor and a Degradation-Aware Prompt State Space Model (DP-SSM). The DP-SSM leverages the fine-grained degradation features captured by the extractor as dynamic prompts, which are then incorporated into the state space modeling process. This enhances the model's adaptability to diverse degradation types, while a complementary High-Frequency Enhancement Block (HEB) recovers local high-frequency details. Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively. These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.
♻ ☆ Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.
♻ ☆ InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.
♻ ☆ HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment
Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models. While existing methods have achieved notable progress, they still face significant challenges in maintaining consistency across different poses. Specifically, geometric distortions lead to a lack of spatial consistency, mismatches in garment structure and texture across poses result in semantic inconsistency, and the loss or distortion of fine-grained details diminishes visual fidelity. To address these challenges, we propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses. HF-VTON consists of three key modules: (1) the Appearance-Preserving Warp Alignment Module (APWAM), which aligns garments to human poses, addressing geometric deformations and ensuring spatial consistency; (2) the Semantic Representation and Comprehension Module (SRCM), which captures fine-grained garment attributes and multi-pose data to enhance semantic representation, maintaining structural, textural, and pattern consistency; and (3) the Multimodal Prior-Guided Appearance Generation Module (MPAGM), which integrates multimodal features and prior knowledge from pre-trained models to optimize appearance generation, ensuring both semantic and geometric consistency. Additionally, to overcome data limitations in existing benchmarks, we introduce the SAMP-VTONS dataset, featuring multi-pose pairs and rich textual annotations for a more comprehensive evaluation. Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS, excelling in visual fidelity, semantic consistency, and detail preservation.
comment: After the publication of the paper, we discovered some significant errors/omissions that need to be corrected and improved
♻ ☆ Diverse Teaching and Label Propagation for Generic Semi-Supervised Medical Image Segmentation
Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation hinders the effective utilization of unlabeled data and limits further improvements, resulting in suboptimal performance when these issues occur. In this paper, we aim to develop a generic framework that masters all three tasks. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data and increasing the diversity of the model. To tackle this issue, we employ a Diverse Teaching and Label Propagation Network (DTLP-Net) to boosting the Generic Semi-Supervised Medical Image Segmentation. Our DTLP-Net involves a single student model and two diverse teacher models, which can generate reliable pseudo-labels for the student model. The first teacher model decouple the training process with labeled and unlabeled data, The second teacher is momentum-updated periodically, thus generating reliable yet divers pseudo-labels. To fully utilize the information within the data, we adopt inter-sample and intra-sample data augmentation to learn the global and local knowledge. In addition, to further capture the voxel-level correlations, we propose label propagation to enhance the model robust. We evaluate our proposed framework on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks. The results showcase notable improvements compared to state-of-the-art methods across all five settings, indicating the potential of our framework to tackle more challenging SSL scenarios.
comment: Under Review
♻ ☆ Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving
Ensuring the safety of vision-language models (VLMs) in autonomous driving systems is of paramount importance, yet existing research has largely focused on conventional benchmarks rather than safety-critical evaluation. In this work, we present SCD-Bench (Safety Cognition Driving Benchmark) a novel framework specifically designed to assess the safety cognition capabilities of VLMs within interactive driving scenarios. To address the scalability challenge of data annotation, we introduce ADA (Autonomous Driving Annotation), a semi-automated labeling system, further refined through expert review by professionals with domain-specific knowledge in autonomous driving. To facilitate scalable and consistent evaluation, we also propose an automated assessment pipeline leveraging large language models, which demonstrates over 98% agreement with human expert judgments. In addressing the broader challenge of aligning VLMs with safety cognition in driving environments, we construct SCD-Training, the first large-scale dataset tailored for this task, comprising 324.35K high-quality samples. Through extensive experiments, we show that models trained on SCD-Training exhibit marked improvements not only on SCD-Bench, but also on general and domain-specific benchmarks, offering a new perspective on enhancing safety-aware interactions in vision-language systems for autonomous driving.
♻ ☆ Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation
Single image super-resolution traditionally assumes spatially-invariant degradation models, yet real-world imaging systems exhibit complex distance-dependent effects including atmospheric scattering, depth-of-field variations, and perspective distortions. This fundamental limitation necessitates spatially-adaptive reconstruction strategies that explicitly incorporate geometric scene understanding for optimal performance. We propose a rigorous variational framework that characterizes super-resolution as a spatially-varying inverse problem, formulating the degradation operator as a pseudodifferential operator with distance-dependent spectral characteristics that enable theoretical analysis of reconstruction limits across depth ranges. Our neural architecture implements discrete gradient flow dynamics through cascaded residual blocks with depth-conditional convolution kernels, ensuring convergence to stationary points of the theoretical energy functional while incorporating learned distance-adaptive regularization terms that dynamically adjust smoothness constraints based on local geometric structure. Spectral constraints derived from atmospheric scattering theory prevent bandwidth violations and noise amplification in far-field regions, while adaptive kernel generation networks learn continuous mappings from depth to reconstruction filters. Comprehensive evaluation across five benchmark datasets demonstrates state-of-the-art performance, achieving 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2 and 4 scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively. This work establishes the first theoretically-grounded distance-adaptive super-resolution framework and demonstrates significant improvements on depth-variant scenarios while maintaining competitive performance across traditional benchmarks.
♻ ☆ Why Foundation Models in Pathology Are Failing
In non-medical domains, foundation models (FMs) have revolutionized computer vision and language processing through large-scale self-supervised and multimodal learning. Consequently, their rapid adoption in computational pathology was expected to deliver comparable breakthroughs in cancer diagnosis, prognostication, and multimodal retrieval. However, recent systematic evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor robustness, geometric instability, heavy computational demands, and concerning safety vulnerabilities. This short paper examines these shortcomings and argues that they stem from deeper conceptual mismatches between the assumptions underlying generic foundation modeling in mainstream AI and the intrinsic complexity of human tissue. Seven interrelated causes are identified: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and a fundamental design flaw related to tissue patch size. These findings suggest that current pathology foundation models remain conceptually misaligned with the nature of tissue morphology and call for a fundamental rethinking of the paradigm itself.
♻ ☆ LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
comment: Withdrawn because the submission was premature and not agreed by all parties in collaboration
♻ ☆ Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning NeurIPS2025
Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.
comment: [NeurIPS2025] Project Page: https://codegoat24.github.io/UnifiedReward/think
♻ ☆ RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation NeurIPS 2025
Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.
comment: 25 pages, 18 figures, Accepted by NeurIPS 2025
♻ ☆ L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery NeurIPS 2025
We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct LiRSI-XA dataset, which encompasses approximately $110,000$ remote sensing submaps and $13,000$ LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on LiRSI-XA demonstrate that, within a $100km^2$ retrieval range, L2RSI accurately localizes $83.27\%$ of point cloud submaps within a $30m$ radius for top-$1$ retrieved location. Our project page is publicly available at https://shizw695.github.io/L2RSI/.
comment: 17 pages, 7 figures, NeurIPS 2025
♻ ☆ Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation AAAI 2026
Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision.
comment: Accepted to the AAAI 2026 Student Abstract and Poster Program
♻ ☆ AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets
Background: Development of artificial intelligence (AI) models for lung cancer screening requires large, well-annotated low-dose computed tomography (CT) datasets and rigorous performance benchmarks. Purpose: To create a reproducible benchmarking resource leveraging the Duke Lung Cancer Screening (DLCS) and multiple public datasets to develop and evaluate models for nodule detection and classification. Materials & Methods: This retrospective study uses the DLCS dataset (1,613 patients; 2,487 nodules) and external datasets including LUNA16, LUNA25, and NLST-3D. For detection, MONAI RetinaNet models were trained on DLCS (DLCS-De) and LUNA16 (LUNA16-De) and evaluated using the Competition Performance Metric (CPM). For nodule-level classification, we compare five strategies: pretrained models (Models Genesis, Med3D), a self-supervised foundation model (FMCB), and ResNet50 with random initialization versus Strategic Warm-Start (ResNet50-SWS) pretrained with detection-derived candidate patches stratified by confidence. Results: For detection on the DLCS test set, DLCS-De achieved sensitivity 0.82 at 2 false positives/scan (CPM 0.63) versus LUNA16-De (0.62, CPM 0.45). For external validation on NLST-3D, DLCS-De (sensitivity 0.72, CPM 0.58) also outperformed LUNA16-De (sensitivity 0.64, CPM 0.49). For classification across multiple datasets, ResNet50-SWS attained AUCs of 0.71 (DLCS; 95% CI, 0.61-0.81), 0.90 (LUNA16; 0.87-0.93), 0.81 (NLST-3D; 0.79-0.82), and 0.80 (LUNA25; 0.78-0.82), matching or exceeding pretrained/self-supervised baselines. Performance differences reflected dataset label standards. Conclusion: This work establishes a standardized benchmarking resource for lung cancer AI research, supporting model development, validation, and translation. All code, models, and data are publicly released to promote reproducibility.
comment: 2 tables, 5 figures
♻ ☆ Caption-Driven Explainability: Probing CNNs for Bias via CLIP ICIP 2025
Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at https://github.com/patch0816/caption-driven-xai
comment: Accepted and presented at the IEEE ICIP 2025 Satellite Workshop "Generative AI for World Simulations and Communications & Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos", Anchorage, Alaska, USA, September 14, 2025. Camera-ready preprint; the official IEEE Xplore publication will follow. Code: https://github.com/patch0816/caption-driven-xai
♻ ☆ Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers NeurIPS 2025
This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.
comment: NeurIPS 2025. Project page: https://pixel-perfect-depth.github.io/
♻ ☆ WaMaIR: Image Restoration via Multiscale Wavelet Convolutions and Mamba-based Channel Modeling with Texture Enhancement
Image restoration is a fundamental and challenging task in computer vision, where CNN-based frameworks demonstrate significant computational efficiency. However, previous CNN-based methods often face challenges in adequately restoring fine texture details, which are limited by the small receptive field of CNN structures and the lack of channel feature modeling. In this paper, we propose WaMaIR, which is a novel framework with a large receptive field for image perception and improves the reconstruction of texture details in restored images. Specifically, we introduce the Global Multiscale Wavelet Transform Convolutions (GMWTConvs) for expandding the receptive field to extract image features, preserving and enriching texture features in model inputs. Meanwhile, we propose the Mamba-Based Channel-Aware Module (MCAM), explicitly designed to capture long-range dependencies within feature channels, which enhancing the model sensitivity to color, edges, and texture information. Additionally, we propose Multiscale Texture Enhancement Loss (MTELoss) for image restoration to guide the model in preserving detailed texture structures effectively. Extensive experiments confirm that WaMaIR outperforms state-of-the-art methods, achieving better image restoration and efficient computational performance of the model.
comment: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Oral
♻ ☆ Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.
comment: camera ready revision
♻ ☆ Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum NeurIPS 2025
Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.
comment: NeurIPS 2025
Dynamic Traceback Learning for Medical Report Generation
Automated medical report generation has demonstrated the potential to significantly reduce the workload associated with time-consuming medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multimodal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.
comment: Accepted to IEEE Transactions on Multimedia (TMM)
♻ ☆ Learning World Models for Interactive Video Generation
Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.
comment: Project page: https://sites.google.com/view/vrag
♻ ☆ OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models
We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.
comment: 8 pages, 1 supplementary page, 9 figures
♻ ☆ Quality-Aware Prototype Memory for Face Representation Learning
Prototype Memory is a powerful model for face representation learning. It enables training face recognition models on datasets of any size by generating prototypes (classifier weights) on the fly and efficiently utilizing them. Prototype Memory demonstrated strong results in many face recognition benchmarks. However, the algorithm of prototype generation, used in it, is prone to the problems of imperfectly calculated prototypes in case of low-quality or poorly recognizable faces in the images, selected for the prototype creation. All images of the same person presented in the mini-batch are used with equal weights, and the resulting averaged prototype can be contaminated by imperfect embeddings of low-quality face images. This may lead to misleading training signals and degrade the performance of the trained models. In this paper, we propose a simple and effective way to improve Prototype Memory with quality-aware prototype generation. Quality-Aware Prototype Memory uses different weights for images of different quality in the process of prototype generation. With this improvement, prototypes receive more informative signals from high-quality images and are less affected by low-quality ones. We propose and compare several methods of quality estimation and usage, perform extensive experiments on the different face recognition benchmarks and demonstrate the advantages of the proposed model compared to the basic version of Prototype Memory.
comment: Preprint
♻ ☆ GenIR: Generative Visual Feedback for Mental Image Retrieval NeurIPS 2025
Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind. That is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction
comment: NeurIPS 2025
♻ ☆ MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory ACM MM '25
Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.
comment: Updated version: corresponds to the ACM MM '25 published paper and includes full appendix material
♻ ☆ CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.
comment: Datasets link: https://huggingface.co/datasets/LLDDSS/Causal3D_Dataset
♻ ☆ LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering NeurIPS 2025
In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.
comment: NeurIPS 2025; Web: https://lodge-gs.github.io/
♻ ☆ NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods NeurIPS 2025
Novel view synthesis is an important problem with many applications, including AR/VR, gaming, and robotic simulations. With the recent rapid development of Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods, it is becoming difficult to keep track of the current state of the art (SoTA) due to methods using different evaluation protocols, codebases being difficult to install and use, and methods not generalizing well to novel 3D scenes. In our experiments, we show that even tiny differences in the evaluation protocols of various methods can artificially boost the performance of these methods. This raises questions about the validity of quantitative comparisons performed in the literature. To address these questions, we propose NerfBaselines, an evaluation framework which provides consistent benchmarking tools, ensures reproducibility, and simplifies the installation and use of various methods. We validate our implementation experimentally by reproducing the numbers reported in the original papers. For improved accessibility, we release a web platform that compares commonly used methods on standard benchmarks. We strongly believe NerfBaselines is a valuable contribution to the community as it ensures that quantitative results are comparable and thus truly measure progress in the field of novel view synthesis.
comment: NeurIPS 2025 D&B; Web: https://jkulhanek.com/nerfbaselines
♻ ☆ VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.
comment: Accepted by EMNLP 2025
♻ ☆ KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning
Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: https://github.com/faeyelab/karma.
comment: This work has been submitted to the IEEE for possible publication
Machine Learning 100
☆ E-Scores for (In)Correctness Assessment of Generative Model Outputs
While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a desired user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as a measure of incorrectness. In addition to achieving the same statistical guarantees as before, e-scores provide users flexibility in adaptively choosing tolerance levels after observing the e-scores themselves, by upper bounding a post-hoc notion of error called size distortion. We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types: mathematical factuality and property constraints satisfaction.
☆ Neural Stochastic Flows: Solver-Free Modelling and Inference for SDE Solutions NeurIPS 2025
Stochastic differential equations (SDEs) are well suited to modelling noisy and irregularly sampled time series found in finance, physics, and machine learning. Traditional approaches require costly numerical solvers to sample between arbitrary time points. We introduce Neural Stochastic Flows (NSFs) and their latent variants, which directly learn (latent) SDE transition laws using conditional normalising flows with architectural constraints that preserve properties inherited from stochastic flows. This enables one-shot sampling between arbitrary states and yields up to two orders of magnitude speed-ups at large time gaps. Experiments on synthetic SDE simulations and on real-world tracking and video data show that NSFs maintain distributional accuracy comparable to numerical approaches while dramatically reducing computation for arbitrary time-point sampling.
comment: NeurIPS 2025 (poster). Project page: https://nkiyohara.github.io/nsf-neurips2025/
☆ Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning
Multiple instance learning (MIL) is often used in medical imaging to classify high-resolution 2D images by processing patches or classify 3D volumes by processing slices. However, conventional MIL approaches treat instances separately, ignoring contextual relationships such as the appearance of nearby patches or slices that can be essential in real applications. We design a synthetic classification task where accounting for adjacent instance features is crucial for accurate prediction. We demonstrate the limitations of off-the-shelf MIL approaches by quantifying their performance compared to the optimal Bayes estimator for this task, which is available in closed-form. We empirically show that newer correlated MIL methods still struggle to generalize as well as possible when trained from scratch on tens of thousands of instances.
☆ MLPrE -- A tool for preprocessing and exploratory data analysis prior to machine learning model construction
With the recent growth of Deep Learning for AI, there is a need for tools to meet the demand of data flowing into those models. In some cases, source data may exist in multiple formats, and therefore the source data must be investigated and properly engineered for a Machine Learning model or graph database. Overhead and lack of scalability with existing workflows limit integration within a larger processing pipeline such as Apache Airflow, driving the need for a robust, extensible, and lightweight tool to preprocess arbitrary datasets that scales with data type and size. To address this, we present Machine Learning Preprocessing and Exploratory Data Analysis, MLPrE, in which SparkDataFrames were utilized to hold data during processing and ensure scalability. A generalizable JSON input file format was utilized to describe stepwise changes to that DataFrame. Stages were implemented for input and output, filtering, basic statistics, feature engineering, and exploratory data analysis. A total of 69 stages were implemented into MLPrE, of which we highlight and demonstrate key stages using six diverse datasets. We further highlight MLPrE's ability to independently process multiple fields in flat files and recombine them, otherwise requiring an additional pipeline, using a UniProt glossary term dataset. Building on this advantage, we demonstrated the clustering stage with available wine quality data. Lastly, we demonstrate the preparation of data for a graph database in the final stages of MLPrE using phosphosite kinase data. Overall, our MLPrE tool offers a generalizable and scalable tool for preprocessing and early data analysis, filling a critical need for such a tool given the ever expanding use of machine learning. This tool serves to accelerate and simplify early stage development in larger workflows.
☆ How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs NeurIPS 2025
Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.
comment: NeurIPS 2025, 24 pages, 6 figures
☆ Meshless solutions of PDE inverse problems on irregular geometries
Solving inverse and optimization problems over solutions of nonlinear partial differential equations (PDEs) on complex spatial domains is a long-standing challenge. Here we introduce a method that parameterizes the solution using spectral bases on arbitrary spatiotemporal domains, whereby the basis is defined on a hyperrectangle containing the true domain. We find the coefficients of the basis expansion by solving an optimization problem whereby both the equations, the boundary conditions and any optimization targets are enforced by a loss function, building on a key idea from Physics-Informed Neural Networks (PINNs). Since the representation of the function natively has exponential convergence, so does the solution of the optimization problem, as long as it can be solved efficiently. We find empirically that the optimization protocols developed for machine learning find solutions with exponential convergence on a wide range of equations. The method naturally allows for the incorporation of data assimilation by including additional terms in the loss function, and for the efficient solution of optimization problems over the PDE solutions.
☆ Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
☆ LieSolver: A PDE-constrained solver for IBVPs using Lie symmetries
We introduce a method for efficiently solving initial-boundary value problems (IBVPs) that uses Lie symmetries to enforce the associated partial differential equation (PDE) exactly by construction. By leveraging symmetry transformations, the model inherently incorporates the physical laws and learns solutions from initial and boundary data. As a result, the loss directly measures the model's accuracy, leading to improved convergence. Moreover, for well-posed IBVPs, our method enables rigorous error estimation. The approach yields compact models, facilitating an efficient optimization. We implement LieSolver and demonstrate its application to linear homogeneous PDEs with a range of initial conditions, showing that it is faster and more accurate than physics-informed neural networks (PINNs). Overall, our method improves both computational efficiency and the reliability of predictions for PDE-constrained problems.
☆ Physics-Guided Conditional Diffusion Networks for Microwave Image Reconstruction
A conditional latent-diffusion based framework for solving the electromagnetic inverse scattering problem associated with microwave imaging is introduced. This generative machine-learning model explicitly mirrors the non-uniqueness of the ill-posed inverse problem. Unlike existing inverse solvers utilizing deterministic machine learning techniques that produce a single reconstruction, the proposed latent-diffusion model generates multiple plausible permittivity maps conditioned on measured scattered-field data, thereby generating several potential instances in the range-space of the non-unique inverse mapping. A forward electromagnetic solver is integrated into the reconstruction pipeline as a physics-based evaluation mechanism. The space of candidate reconstructions form a distribution of possibilities consistent with the conditioning data and the member of this space yielding the lowest scattered-field data discrepancy between the predicted and measured scattered fields is reported as the final solution. Synthetic and experimental labeled datasets are used for training and evaluation of the model. An innovative labeled synthetic dataset is created that exemplifies a varied set of scattering features. Training of the model using this new dataset produces high quality permittivity reconstructions achieving improved generalization with excellent fidelity to shape recognition. The results highlight the potential of hybrid generative physics frameworks as a promising direction for robust, data-driven microwave imaging.
☆ Scaling flow-based approaches for topology sampling in $\mathrm{SU}(3)$ gauge theory
We develop a methodology based on out-of-equilibrium simulations to mitigate topological freezing when approaching the continuum limit of lattice gauge theories. We reduce the autocorrelation of the topological charge employing open boundary conditions, while removing exactly their unphysical effects using a non-equilibrium Monte Carlo approach in which periodic boundary conditions are gradually switched on. We perform a detailed analysis of the computational costs of this strategy in the case of the four-dimensional $\mathrm{SU}(3)$ Yang-Mills theory. After achieving full control of the scaling, we outline a clear strategy to sample topology efficiently in the continuum limit, which we check at lattice spacings as small as $0.045$ fm. We also generalize this approach by designing a customized Stochastic Normalizing Flow for evolutions in the boundary conditions, obtaining superior performances with respect to the purely stochastic non-equilibrium approach, and paving the way for more efficient future flow-based solutions.
comment: 1+39 pages, 14 figures
☆ Convolutional Spiking-based GRU Cell for Spatio-temporal Data
Spike-based temporal messaging enables SNNs to efficiently process both purely temporal and spatio-temporal time-series or event-driven data. Combining SNNs with Gated Recurrent Units (GRUs), a variant of recurrent neural networks, gives rise to a robust framework for sequential data processing; however, traditional RNNs often lose local details when handling long sequences. Previous approaches, such as SpikGRU, fail to capture fine-grained local dependencies in event-based spatio-temporal data. In this paper, we introduce the Convolutional Spiking GRU (CS-GRU) cell, which leverages convolutional operations to preserve local structure and dependencies while integrating the temporal precision of spiking neurons with the efficient gating mechanisms of GRUs. This versatile architecture excels on both temporal datasets (NTIDIGITS, SHD) and spatio-temporal benchmarks (MNIST, DVSGesture, CIFAR10DVS). Our experiments show that CS-GRU outperforms state-of-the-art GRU variants by an average of 4.35%, achieving over 90% accuracy on sequential tasks and up to 99.31% on MNIST. It is worth noting that our solution achieves 69% higher efficiency compared to SpikGRU. The code is available at: https://github.com/YesmineAbdennadher/CS-GRU.
comment: 6 pages, 1 figure. Published in 2025 IEEE International Workshop On Machine Learning for Signal Processing, Aug. 31-Sep. 3, 2025, Istanbul, Turkey
☆ PyDPF: A Python Package for Differentiable Particle Filtering
State-space models (SSMs) are a widely used tool in time series analysis. In the complex systems that arise from real-world data, it is common to employ particle filtering (PF), an efficient Monte Carlo method for estimating the hidden state corresponding to a sequence of observations. Applying particle filtering requires specifying both the parametric form and the parameters of the system, which are often unknown and must be estimated. Gradient-based optimisation techniques cannot be applied directly to standard particle filters, as the filters themselves are not differentiable. However, several recently proposed methods modify the resampling step to make particle filtering differentiable. In this paper, we present an implementation of several such differentiable particle filters (DPFs) with a unified API built on the popular PyTorch framework. Our implementation makes these algorithms easily accessible to a broader research community and facilitates straightforward comparison between them. We validate our framework by reproducing experiments from several existing studies and demonstrate how DPFs can be applied to address several common challenges with state space modelling.
comment: 42 pages, 0 figures, under review at the Journal of Statistical Software, the python package can be found at https://pypi.org/project/pydpf/ , the full documentation at https://python-dpf.readthedocs.io/en/latest/#documentation-index , and the source code including experiments and replication material at https://github.com/John-JoB/pydpf
☆ A Configuration-First Framework for Reproducible, Low-Code Localization
Machine learning is increasingly permeating radio-based localization services. To keep results credible and comparable, everyday workflows should make rigorous experiment specification and exact repeatability the default, without blocking advanced experimentation. However, in practice, researchers face a three-way gap that could be filled by a framework that offers (i) low coding effort for end-to-end studies, (ii) reproducibility by default including versioned code, data, and configurations, controlled randomness, isolated runs, and recorded artifacts, and (iii) built-in extensibility so new models, metrics, and stages can be added with minimal integration effort. Existing tools rarely deliver all three for machine learning in general and localization workflows in particular. In this paper we introduce LOCALIZE, a low-code, configuration-first framework for radio localization in which experiments are declared in human-readable configuration, a workflow orchestrator runs standardized pipelines from data preparation to reporting, and all artifacts, such as datasets, models, metrics, and reports, are versioned. The preconfigured, versioned datasets reduce initial setup and boilerplate, speeding up model development and evaluation. The design, with clear extension points, allows experts to add components without reworking the infrastructure. In a qualitative comparison and a head-to-head study against a plain Jupyter notebook baseline, we show that the framework reduces authoring effort while maintaining comparable runtime and memory behavior. Furthermore, using a Bluetooth Low Energy dataset, we show that scaling across training data (1x to 10x) keeps orchestration overheads bounded as data grows. Overall, the framework makes reproducible machine-learning-based localization experimentation practical, accessible, and extensible.
comment: 20 pages, 7 figures. Preprint submitted to ACM Transactions on Software Engineering and Methodology (TOSEM), 2025
☆ Model Inversion Attacks Meet Cryptographic Fuzzy Extractors
Model inversion attacks pose an open challenge to privacy-sensitive applications that use machine learning (ML) models. For example, face authentication systems use modern ML models to compute embedding vectors from face images of the enrolled users and store them. If leaked, inversion attacks can accurately reconstruct user faces from the leaked vectors. There is no systematic characterization of properties needed in an ideal defense against model inversion, even for the canonical example application of a face authentication system susceptible to data breaches, despite a decade of best-effort solutions. In this paper, we formalize the desired properties of a provably strong defense against model inversion and connect it, for the first time, to the cryptographic concept of fuzzy extractors. We further show that existing fuzzy extractors are insecure for use in ML-based face authentication. We do so through a new model inversion attack called PIPE, which achieves a success rate of over 89% in most cases against prior schemes. We then propose L2FE-Hash, the first candidate fuzzy extractor which supports standard Euclidean distance comparators as needed in many ML-based applications, including face authentication. We formally characterize its computational security guarantees, even in the extreme threat model of full breach of stored secrets, and empirically show its usable accuracy in face authentication for practical face distributions. It offers attack-agnostic security without requiring any re-training of the ML model it protects. Empirically, it nullifies both prior state-of-the-art inversion attacks as well as our new PIPE attack.
☆ Graph Network-based Structural Simulator: Graph Neural Networks for Structural Dynamics
Graph Neural Networks (GNNs) have recently been explored as surrogate models for numerical simulations. While their applications in computational fluid dynamics have been investigated, little attention has been given to structural problems, especially for dynamic cases. To address this gap, we introduce the Graph Network-based Structural Simulator (GNSS), a GNN framework for surrogate modeling of dynamic structural problems. GNSS follows the encode-process-decode paradigm typical of GNN-based machine learning models, and its design makes it particularly suited for dynamic simulations thanks to three key features: (i) expressing node kinematics in node-fixed local frames, which avoids catastrophic cancellation in finite-difference velocities; (ii) employing a sign-aware regression loss, which reduces phase errors in long rollouts; and (iii) using a wavelength-informed connectivity radius, which optimizes graph construction. We evaluate GNSS on a case study involving a beam excited by a 50kHz Hanning-modulated pulse. The results show that GNSS accurately reproduces the physics of the problem over hundreds of timesteps and generalizes to unseen loading conditions, where existing GNNs fail to converge or deliver meaningful predictions. Compared with explicit finite element baselines, GNSS achieves substantial inference speedups while preserving spatial and temporal fidelity. These findings demonstrate that locality-preserving GNNs with physics-consistent update rules are a competitive alternative for dynamic, wave-dominated structural simulations.
comment: 16 pages, 14 figures
☆ Mechanistic Interpretability of RNNs emulating Hidden Markov Models NeurIPS 2025
Recurrent neural networks (RNNs) provide a powerful approach in neuroscience to infer latent dynamics in neural populations and to generate hypotheses about the neural computations underlying behavior. However, past work has focused on relatively simple, input-driven, and largely deterministic behaviors - little is known about the mechanisms that would allow RNNs to generate the richer, spontaneous, and potentially stochastic behaviors observed in natural settings. Modeling with Hidden Markov Models (HMMs) has revealed a segmentation of natural behaviors into discrete latent states with stochastic transitions between them, a type of dynamics that may appear at odds with the continuous state spaces implemented by RNNs. Here we first show that RNNs can replicate HMM emission statistics and then reverse-engineer the trained networks to uncover the mechanisms they implement. In the absence of inputs, the activity of trained RNNs collapses towards a single fixed point. When driven by stochastic input, trajectories instead exhibit noise-sustained dynamics along closed orbits. Rotation along these orbits modulates the emission probabilities and is governed by transitions between regions of slow, noise-driven dynamics connected by fast, deterministic transitions. The trained RNNs develop highly structured connectivity, with a small set of "kick neurons" initiating transitions between these regions. This mechanism emerges during training as the network shifts into a regime of stochastic resonance, enabling it to perform probabilistic computations. Analyses across multiple HMM architectures - fully connected, cyclic, and linear-chain - reveal that this solution generalizes through the modular reuse of the same dynamical motif, suggesting a compositional principle by which RNNs can emulate complex discrete latent dynamics.
comment: Accepted at NeurIPS 2025
☆ Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy NeurIPS 2025
A central challenge in machine learning is to understand how noise or measurement errors affect low-rank approximations, particularly in the spectral norm. This question is especially important in differentially private low-rank approximation, where one aims to preserve the top-$p$ structure of a data-derived matrix while ensuring privacy. Prior work often analyzes Frobenius norm error or changes in reconstruction quality, but these metrics can over- or under-estimate true subspace distortion. The spectral norm, by contrast, captures worst-case directional error and provides the strongest utility guarantees. We establish new high-probability spectral-norm perturbation bounds for symmetric matrices that refine the classical Eckart--Young--Mirsky theorem and explicitly capture interactions between a matrix $A \in \mathbb{R}^{n \times n}$ and an arbitrary symmetric perturbation $E$. Under mild eigengap and norm conditions, our bounds yield sharp estimates for $\|(A + E)_p - A_p\|$, where $A_p$ is the best rank-$p$ approximation of $A$, with improvements of up to a factor of $\sqrt{n}$. As an application, we derive improved utility guarantees for differentially private PCA, resolving an open problem in the literature. Our analysis relies on a novel contour bootstrapping method from complex analysis and extends it to a broad class of spectral functionals, including polynomials and matrix exponentials. Empirical results on real-world datasets confirm that our bounds closely track the actual spectral error under diverse perturbation regimes.
comment: NeurIPS 2025
☆ Subgraph Federated Learning via Spectral Methods NeurIPS
We consider the problem of federated learning (FL) with graph-structured data distributed across multiple clients. In particular, we address the prevalent scenario of interconnected subgraphs, where interconnections between clients significantly influence the learning process. Existing approaches suffer from critical limitations, either requiring the exchange of sensitive node embeddings, thereby posing privacy risks, or relying on computationally-intensive steps, which hinders scalability. To tackle these challenges, we propose FedLap, a novel framework that leverages global structure information via Laplacian smoothing in the spectral domain to effectively capture inter-node dependencies while ensuring privacy and scalability. We provide a formal analysis of the privacy of FedLap, demonstrating that it preserves privacy. Notably, FedLap is the first subgraph FL scheme with strong privacy guarantees. Extensive experiments on benchmark datasets demonstrate that FedLap achieves competitive or superior utility compared to existing techniques.
comment: To be presented at The Annual Conference on Neural Information Processing Systems (NeurIPS) 2025
☆ Continuous subsurface property retrieval from sparse radar observations using physics informed neural networks
Estimating subsurface dielectric properties is essential for applications ranging from environmental surveys of soils to nondestructive evaluation of concrete in infrastructure. Conventional wave inversion methods typically assume few discrete homogeneous layers and require dense measurements or strong prior knowledge of material boundaries, limiting scalability and accuracy in realistic settings where properties vary continuously. We present a physics informed machine learning framework that reconstructs subsurface permittivity as a fully neural, continuous function of depth, trained to satisfy both measurement data and Maxwells equations. We validate the framework with both simulations and custom built radar experiments on multilayered natural materials. Results show close agreement with in-situ permittivity measurements (R^2=0.93), with sensitivity to even subtle variations (Delta eps_r=2). Parametric analysis reveals that accurate profiles can be recovered with as few as three strategically placed sensors in two layer systems. This approach reframes subsurface inversion from boundary-driven to continuous property estimation, enabling accurate characterization of smooth permittivity variations and advancing electromagnetic imaging using low cost radar systems.
comment: 22 pages, 9 main text figures + 2 supplementary figures
☆ Are Language Models Efficient Reasoners? A Perspective from Logic Programming NeurIPS 2025
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.
comment: Accepted to NeurIPS 2025
☆ Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
comment: 13 pages, 6 figures
☆ BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training
We introduce BOLT-GAN, a simple yet effective modification of the WGAN framework inspired by the Bayes Optimal Learning Threshold (BOLT). We show that with a Lipschitz continuous discriminator, BOLT-GAN implicitly minimizes a different metric distance than the Earth Mover (Wasserstein) distance and achieves better training stability. Empirical evaluations on four standard image generation benchmarks (CIFAR-10, CelebA-64, LSUN Bedroom-64, and LSUN Church-64) show that BOLT-GAN consistently outperforms WGAN, achieving 10-60% lower Frechet Inception Distance (FID). Our results suggest that BOLT is a broadly applicable principle for enhancing GAN training.
☆ INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.
☆ Uncertainty Quantification for Regression: A Unified Framework based on kernel scores
Regression tasks, notably in safety-critical domains, require proper uncertainty quantification, yet the literature remains largely classification-focused. In this light, we introduce a family of measures for total, aleatoric, and epistemic uncertainty based on proper scoring rules, with a particular emphasis on kernel scores. The framework unifies several well-known measures and provides a principled recipe for designing new ones whose behavior, such as tail sensitivity, robustness, and out-of-distribution responsiveness, is governed by the choice of kernel. We prove explicit correspondences between kernel-score characteristics and downstream behavior, yielding concrete design guidelines for task-specific measures. Extensive experiments demonstrate that these measures are effective in downstream tasks and reveal clear trade-offs among instantiations, including robustness and out-of-distribution detection performance.
☆ Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning
Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.
☆ Generalized Sobolev IPM for Graph-Based Measures
We study the Sobolev IPM problem for measures supported on a graph metric space, where critic function is constrained to lie within the unit ball defined by Sobolev norm. While Le et al. (2025) achieved scalable computation by relating Sobolev norm to weighted $L^p$-norm, the resulting framework remains intrinsically bound to $L^p$ geometric structure, limiting its ability to incorporate alternative structural priors beyond the $L^p$ geometry paradigm. To overcome this limitation, we propose to generalize Sobolev IPM through the lens of \emph{Orlicz geometric structure}, which employs convex functions to capture nuanced geometric relationships, building upon recent advances in optimal transport theory -- particularly Orlicz-Wasserstein (OW) and generalized Sobolev transport -- that have proven instrumental in advancing machine learning methodologies. This generalization encompasses classical Sobolev IPM as a special case while accommodating diverse geometric priors beyond traditional $L^p$ structure. It however brings up significant computational hurdles that compound those already inherent in Sobolev IPM. To address these challenges, we establish a novel theoretical connection between Orlicz-Sobolev norm and Musielak norm which facilitates a novel regularization for the generalized Sobolev IPM (GSI). By further exploiting the underlying graph structure, we show that GSI with Musielak regularization (GSI-M) reduces to a simple \emph{univariate optimization} problem, achieving remarkably computational efficiency. Empirically, GSI-M is several-order faster than the popular OW in computation, and demonstrates its practical advantages in comparing probability measures on a given graph for document classification and several tasks in topological data analysis.
☆ Learning-Augmented Online Bidding in Stochastic Settings
Online bidding is a classic optimization problem, with several applications in online decision-making, the design of interruptible systems, and the analysis of approximation algorithms. In this work, we study online bidding under learning-augmented settings that incorporate stochasticity, in either the prediction oracle or the algorithm itself. In the first part, we study bidding under distributional predictions, and find Pareto-optimal algorithms that offer the best-possible tradeoff between the consistency and the robustness of the algorithm. In the second part, we study the power and limitations of randomized bidding algorithms, by presenting upper and lower bounds on the consistency/robustness tradeoffs. Previous works focused predominantly on oracles that do not leverage stochastic information on the quality of the prediction, and deterministic algorithms.
☆ Monitoring the calibration of probability forecasts with an application to concept drift detection involving image classification
Machine learning approaches for image classification have led to impressive advances in that field. For example, convolutional neural networks are able to achieve remarkable image classification accuracy across a wide range of applications in industry, defense, and other areas. While these machine learning models boast impressive accuracy, a related concern is how to assess and maintain calibration in the predictions these models make. A classification model is said to be well calibrated if its predicted probabilities correspond with the rates events actually occur. While there are many available methods to assess machine learning calibration and recalibrate faulty predictions, less effort has been spent on developing approaches that continually monitor predictive models for potential loss of calibration as time passes. We propose a cumulative sum-based approach with dynamic limits that enable detection of miscalibration in both traditional process monitoring and concept drift applications. This enables early detection of operational context changes that impact image classification performance in the field. The proposed chart can be used broadly in any situation where the user needs to monitor probability predictions over time for potential lapses in calibration. Importantly, our method operates on probability predictions and event outcomes and does not require under-the-hood access to the machine learning model.
☆ Perturbation Bounds for Low-Rank Inverse Approximations under Noise NeurIPS 2025
Low-rank pseudoinverses are widely used to approximate matrix inverses in scalable machine learning, optimization, and scientific computing. However, real-world matrices are often observed with noise, arising from sampling, sketching, and quantization. The spectral-norm robustness of low-rank inverse approximations remains poorly understood. We systematically study the spectral-norm error $\| (\tilde{A}^{-1})_p - A_p^{-1} \|$ for an $n\times n$ symmetric matrix $A$, where $A_p^{-1}$ denotes the best rank-\(p\) approximation of $A^{-1}$, and $\tilde{A} = A + E$ is a noisy observation. Under mild assumptions on the noise, we derive sharp non-asymptotic perturbation bounds that reveal how the error scales with the eigengap, spectral decay, and noise alignment with low-curvature directions of $A$. Our analysis introduces a novel application of contour integral techniques to the \emph{non-entire} function $f(z) = 1/z$, yielding bounds that improve over naive adaptations of classical full-inverse bounds by up to a factor of $\sqrt{n}$. Empirically, our bounds closely track the true perturbation error across a variety of real-world and synthetic matrices, while estimates based on classical results tend to significantly overpredict. These findings offer practical, spectrum-aware guarantees for low-rank inverse approximations in noisy computational environments.
comment: NeurIPS 2025
☆ A Framework for Bounding Deterministic Risk with PAC-Bayes: Applications to Majority Votes
PAC-Bayes is a popular and efficient framework for obtaining generalization guarantees in situations involving uncountable hypothesis spaces. Unfortunately, in its classical formulation, it only provides guarantees on the expected risk of a randomly sampled hypothesis. This requires stochastic predictions at test time, making PAC-Bayes unusable in many practical situations where a single deterministic hypothesis must be deployed. We propose a unified framework to extract guarantees holding for a single hypothesis from stochastic PAC-Bayesian guarantees. We present a general oracle bound and derive from it a numerical bound and a specialization to majority vote. We empirically show that our approach consistently outperforms popular baselines (by up to a factor of 2) when it comes to generalization bounds on deterministic classifiers.
☆ PitchFlower: A flow-based neural audio codec with pitch controllability
We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.
comment: 5 pages, 5 figures
☆ Leveraging an Atmospheric Foundational Model for Subregional Sea Surface Temperature Forecasting
The accurate prediction of oceanographic variables is crucial for understanding climate change, managing marine resources, and optimizing maritime activities. Traditional ocean forecasting relies on numerical models; however, these approaches face limitations in terms of computational cost and scalability. In this study, we adapt Aurora, a foundational deep learning model originally designed for atmospheric forecasting, to predict sea surface temperature (SST) in the Canary Upwelling System. By fine-tuning this model with high-resolution oceanographic reanalysis data, we demonstrate its ability to capture complex spatiotemporal patterns while reducing computational demands. Our methodology involves a staged fine-tuning process, incorporating latitude-weighted error metrics and optimizing hyperparameters for efficient learning. The experimental results show that the model achieves a low RMSE of 0.119K, maintaining high anomaly correlation coefficients (ACC $\approx 0.997$). The model successfully reproduces large-scale SST structures but faces challenges in capturing finer details in coastal regions. This work contributes to the field of data-driven ocean forecasting by demonstrating the feasibility of using deep learning models pre-trained in different domains for oceanic applications. Future improvements include integrating additional oceanographic variables, increasing spatial resolution, and exploring physics-informed neural networks to enhance interpretability and understanding. These advancements can improve climate modeling and ocean prediction accuracy, supporting decision-making in environmental and economic sectors.
comment: 18 pages, 9 figures
☆ Hybrid Quantum-Classical Recurrent Neural Networks
We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the entire recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC, residing in an exponentially large Hilbert space $\mathbb{C}^{2^n}$. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit measurements, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling, adopting projective measurements as a limiting case to obtain mid-circuit readouts while maintaining a coherent recurrent quantum memory. We further devise a soft attention mechanism over the mid-circuit readouts in a sequence-to-sequence model and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
☆ Robust variable selection for spatial point processes observed with noise
We propose a method for variable selection in the intensity function of spatial point processes that combines sparsity-promoting estimation with noise-robust model selection. As high-resolution spatial data becomes increasingly available through remote sensing and automated image analysis, identifying spatial covariates that influence the localization of events is crucial to understand the underlying mechanism. However, results from automated acquisition techniques are often noisy, for example due to measurement uncertainties or detection errors, which leads to spurious displacements and missed events. We study the impact of such noise on sparse point-process estimation across different models, including Poisson and Thomas processes. To improve noise robustness, we propose to use stability selection based on point-process subsampling and to incorporate a non-convex best-subset penalty to enhance model-selection performance. In extensive simulations, we demonstrate that such an approach reliably recovers true covariates under diverse noise scenarios and improves both selection accuracy and stability. We then apply the proposed method to a forestry data set, analyzing the distribution of trees in relation to elevation and soil nutrients in a tropical rain forest. This shows the practical utility of the method, which provides a systematic framework for robust variable selection in spatial point-process models under noise, without requiring additional knowledge of the process.
☆ Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations
Recently proposed generative models for discrete data, such as Masked Diffusion Models (MDMs), exploit conditional independence approximations to reduce the computational cost of popular Auto-Regressive Models (ARMs), at the price of some bias in the sampling distribution. We study the resulting computation-vs-accuracy trade-off, providing general error bounds (in relative entropy) that depend only on the average number of tokens generated per iteration and are independent of the data dimensionality (i.e. sequence length), thus supporting the empirical success of MDMs. We then investigate the gain obtained by using non-constant schedule sizes (i.e. varying the number of unmasked tokens during the generation process) and identify the optimal schedule as a function of a so-called information profile of the data distribution, thus allowing for a principled optimization of schedule sizes. We define methods directly as sampling algorithms and do not use classical derivations as time-reversed diffusion processes, leading us to simple and transparent proofs.
Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information
Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) -- which involve multiple parents per node -- remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the $f$-divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the $f$-divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings.
☆ Using latent representations to link disjoint longitudinal data for mixed-effects regression
Many rare diseases offer limited established treatment options, leading patients to switch therapies when new medications emerge. To analyze the impact of such treatment switches within the low sample size limitations of rare disease trials, it is important to use all available data sources. This, however, is complicated when usage of measurement instruments change during the observation period, for example when instruments are adapted to specific age ranges. The resulting disjoint longitudinal data trajectories, complicate the application of traditional modeling approaches like mixed-effects regression. We tackle this by mapping observations of each instrument to a aligned low-dimensional temporal trajectory, enabling longitudinal modeling across instruments. Specifically, we employ a set of variational autoencoder architectures to embed item values into a shared latent space for each time point. Temporal disease dynamics and treatment switch effects are then captured through a mixed-effects regression model applied to latent representations. To enable statistical inference, we present a novel statistical testing approach that accounts for the joint parameter estimation of mixed-effects regression and variational autoencoders. The methodology is applied to quantify the impact of treatment switches for patients with spinal muscular atrophy. Here, our approach aligns motor performance items from different measurement instruments for mixed-effects regression and maps estimated effects back to the observed item level to quantify the treatment switch effect. Our approach allows for model selection as well as for assessing effects of treatment switching. The results highlight the potential of modeling in joint latent representations for addressing small data challenges.
comment: 31 pages, 3 figures, 3 tables
☆ Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains
We study the convergence of off-policy TD(0) with linear function approximation when used to approximate the expected discounted reward in a Markov chain. It is well known that the combination of off-policy learning and function approximation can lead to divergence of the algorithm. Existing results for this setting modify the algorithm, for instance by reweighing the updates using importance sampling. This establishes convergence at the expense of additional complexity. In contrast, our approach is to analyse the standard algorithm, but to restrict our attention to the class of reversible Markov chains. We demonstrate convergence under this mild reversibility condition on the structure of the chain, which in many applications can be assumed using domain knowledge. In particular, we establish a convergence guarantee under an upper bound on the discount factor in terms of the difference between the on-policy and off-policy process. This improves upon known results in the literature that state that convergence holds for a sufficiently small discount factor by establishing an explicit bound. Convergence is with probability one and achieves projected Bellman error equal to zero. To obtain these results, we adapt the stochastic approximation framework that was used by Tsitsiklis and Van Roy [1997 for the on-policy case, to the off-policy case. We illustrate our results using different types of reversible Markov chains, such as one-dimensional random walks and random walks on a weighted graph.
☆ FaCT: Faithful Concept Traces for Explaining Neural Network Decisions NeurIPS 2025
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.
comment: Accepted to NeurIPS 2025; Code is available at https://github.com/m-parchami/FaCT
☆ Support Vector Machine-Based Burnout Risk Prediction with an Interactive Interface for Organizational Use
Burnout is a psychological syndrome marked by emotional exhaustion, depersonalization, and reduced personal accomplishment, with a significant impact on individual well-being and organizational performance. This study proposes a machine learning approach to predict burnout risk using the HackerEarth Employee Burnout Challenge dataset. Three supervised algorithms were evaluated: nearest neighbors (KNN), random forest, and support vector machine (SVM), with model performance evaluated through 30-fold cross-validation using the determination coefficient (R2). Among the models tested, SVM achieved the highest predictive performance (R2 = 0.84) and was statistically superior to KNN and Random Forest based on paired $t$-tests. To ensure practical applicability, an interactive interface was developed using Streamlit, allowing non-technical users to input data and receive burnout risk predictions. The results highlight the potential of machine learning to support early detection of burnout and promote data-driven mental health strategies in organizational settings.
comment: 12 pages, including figures and references. Streamlit app available at: https://employee-burnout-svm.streamlit.app/
☆ TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
comment: 30 pages, 18 figures, 13 tables
☆ Right for the Right Reasons: Avoiding Reasoning Shortcuts via Prototypical Neurosymbolic AI
Neurosymbolic AI is growing in popularity thanks to its ability to combine neural perception and symbolic reasoning in end-to-end trainable models. However, recent findings reveal these are prone to shortcut reasoning, i.e., to learning unindented concepts--or neural predicates--which exploit spurious correlations to satisfy the symbolic constraints. In this paper, we address reasoning shortcuts at their root cause and we introduce prototypical neurosymbolic architectures. These models are able to satisfy the symbolic constraints (be right) because they have learnt the correct basic concepts (for the right reasons) and not because of spurious correlations, even in extremely low data regimes. Leveraging the theory of prototypical learning, we demonstrate that we can effectively avoid reasoning shortcuts by training the models to satisfy the background knowledge while taking into account the similarity of the input with respect to the handful of labelled datapoints. We extensively validate our approach on the recently proposed rsbench benchmark suite in a variety of settings and tasks with very scarce supervision: we show significant improvements in learning the right concepts both in synthetic tasks (MNIST-EvenOdd and Kand-Logic) and real-world, high-stake ones (BDD-OIA). Our findings pave the way to prototype grounding as an effective, annotation-efficient strategy for safe and reliable neurosymbolic learning.
☆ Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks NeurIPS 2025
Robust validation metrics remain essential in contemporary deep learning, not only to detect overfitting and poor generalization, but also to monitor training dynamics. In the supervised classification setting, we investigate whether interactions between training data and model weights can yield such a metric that both tracks generalization during training and attributes performance to individual training samples. We introduce Gradient-Weight Alignment (GWA), quantifying the coherence between per-sample gradients and model weights. We show that effective learning corresponds to coherent alignment, while misalignment indicates deteriorating generalization. GWA is efficiently computable during training and reflects both sample-specific contributions and dataset-wide learning dynamics. Extensive experiments show that GWA accurately predicts optimal early stopping, enables principled model comparisons, and identifies influential training samples, providing a validation-set-free approach for model analysis directly from the training data.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ An In-Depth Analysis of Cyber Attacks in Secured Platforms
There is an increase in global malware threats. To address this, an encryption-type ransomware has been introduced on the Android operating system. The challenges associated with malicious threats in phone use have become a pressing issue in mobile communication, disrupting user experiences and posing significant privacy threats. This study surveys commonly used machine learning techniques for detecting malicious threats in phones and examines their performance. The majority of past research focuses on customer feedback and reviews, with concerns that people might create false reviews to promote or devalue products and services for personal gain. Hence, the development of techniques for detecting malicious threats using machine learning has been a key focus. This paper presents a comprehensive comparative study of current research on the issue of malicious threats and methods for tackling these challenges. Nevertheless, a huge amount of information is required by these methods, presenting a challenge for developing robust, specialized automated anti-malware systems. This research describes the Android Applications dataset, and the accuracy of the techniques is measured using the accuracy levels of the metrics employed in this study.
☆ Scalable Utility-Aware Multiclass Calibration
Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. In this work, we study scalable \emph{evaluation} of multiclass calibration. To this end, we propose utility calibration, a general framework that measures the calibration error relative to a specific utility function that encapsulates the goals or decision criteria relevant to the end user. We demonstrate how this framework can unify and re-interpret several existing calibration metrics, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and, going beyond such binarized approaches, toward assessing calibration for richer classes of downstream utilities.
☆ Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions
Agentic AI represents a transformative shift in artificial intelligence, but its rapid advancement has led to a fragmented understanding, often conflating modern neural systems with outdated symbolic models -- a practice known as conceptual retrofitting. This survey cuts through this confusion by introducing a novel dual-paradigm framework that categorizes agentic systems into two distinct lineages: the Symbolic/Classical (relying on algorithmic planning and persistent state) and the Neural/Generative (leveraging stochastic generation and prompt-driven orchestration). Through a systematic PRISMA-based review of 90 studies (2018--2025), we provide a comprehensive analysis structured around this framework across three dimensions: (1) the theoretical foundations and architectural principles defining each paradigm; (2) domain-specific implementations in healthcare, finance, and robotics, demonstrating how application constraints dictate paradigm selection; and (3) paradigm-specific ethical and governance challenges, revealing divergent risks and mitigation strategies. Our analysis reveals that the choice of paradigm is strategic: symbolic systems dominate safety-critical domains (e.g., healthcare), while neural systems prevail in adaptive, data-rich environments (e.g., finance). Furthermore, we identify critical research gaps, including a significant deficit in governance models for symbolic systems and a pressing need for hybrid neuro-symbolic architectures. The findings culminate in a strategic roadmap arguing that the future of Agentic AI lies not in the dominance of one paradigm, but in their intentional integration to create systems that are both adaptable and reliable. This work provides the essential conceptual toolkit to guide future research, development, and policy toward robust and trustworthy hybrid intelligent systems.
☆ GPTOpt: Towards Efficient LLM-Based Black-Box Optimization
Global optimization of expensive, derivative-free black-box functions demands extreme sample efficiency. Classical methods such as Bayesian Optimization (BO) can be effective, but they often require careful parameter tuning to each application domain. At the same time, Large Language Models (LLMs) have shown broad capabilities, yet state-of-the-art models remain limited in solving continuous black-box optimization tasks. We introduce GPTOpt, an LLM-based optimization method that equips LLMs with continuous black-box optimization capabilities. By fine-tuning large language models on extensive synthetic datasets derived from diverse BO parameterizations, GPTOpt leverages LLM pre-training to generalize across optimization tasks. On a variety of black-box optimization benchmarks, GPTOpt surpasses traditional optimizers, highlighting the capacity of LLMs for advanced numerical reasoning and introducing a flexible framework for global optimization without parameter tuning.
☆ A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory
While many problems in machine learning focus on learning mappings between finite-dimensional spaces, scientific applications require approximating mappings between function spaces, i.e., operators. We study the problem of learning collections of operators and provide both theoretical and empirical advances. We distinguish between two regimes: (i) multiple operator learning, where a single network represents a continuum of operators parameterized by a parametric function, and (ii) learning several distinct single operators, where each operator is learned independently. For the multiple operator case, we introduce two new architectures, $\mathrm{MNO}$ and $\mathrm{MONet}$, and establish universal approximation results in three settings: continuous, integrable, or Lipschitz operators. For the latter, we further derive explicit scaling laws that quantify how the network size must grow to achieve a target approximation accuracy. For learning several single operators, we develop a framework for balancing architectural complexity across subnetworks and show how approximation order determines computational efficiency. Empirical experiments on parametric PDE benchmarks confirm the strong expressive power and efficiency of the proposed architectures. Overall, this work establishes a unified theoretical and practical foundation for scalable neural operator learning across multiple operators.
Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.
☆ Position: Biology is the Challenge Physics-Informed ML Needs to Evolve
Physics-Informed Machine Learning (PIML) has successfully integrated mechanistic understanding into machine learning, particularly in domains governed by well-known physical laws. This success has motivated efforts to apply PIML to biology, a field rich in dynamical systems but shaped by different constraints. Biological modeling, however, presents unique challenges: multi-faceted and uncertain prior knowledge, heterogeneous and noisy data, partial observability, and complex, high-dimensional networks. In this position paper, we argue that these challenges should not be seen as obstacles to PIML, but as catalysts for its evolution. We propose Biology-Informed Machine Learning (BIML): a principled extension of PIML that retains its structural grounding while adapting to the practical realities of biology. Rather than replacing PIML, BIML retools its methods to operate under softer, probabilistic forms of prior knowledge. We outline four foundational pillars as a roadmap for this transition: uncertainty quantification, contextualization, constrained latent structure inference, and scalability. Foundation Models and Large Language Models will be key enablers, bridging human expertise with computational modeling. We conclude with concrete recommendations to build the BIML ecosystem and channel PIML-inspired innovation toward challenges of high scientific and societal relevance.
☆ A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks
The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.
comment: Appeared on KDIR IC3K Conference 2025 (Best Paper Award)
☆ Parameter Averaging in Link Prediction
Ensemble methods are widely employed to improve generalization in machine learning. This has also prompted the adoption of ensemble learning for the knowledge graph embedding (KGE) models in performing link prediction. Typical approaches to this end train multiple models as part of the ensemble, and the diverse predictions are then averaged. However, this approach has some significant drawbacks. For instance, the computational overhead of training multiple models increases latency and memory overhead. In contrast, model merging approaches offer a promising alternative that does not require training multiple models. In this work, we introduce model merging, specifically weighted averaging, in KGE models. Herein, a running average of model parameters from a training epoch onward is maintained and used for predictions. To address this, we additionally propose an approach that selectively updates the running average of the ensemble model parameters only when the generalization performance improves on a validation dataset. We evaluate these two different weighted averaging approaches on link prediction tasks, comparing the state-of-the-art benchmark ensemble approach. Additionally, we evaluate the weighted averaging approach considering literal-augmented KGE models and multi-hop query answering tasks as well. The results demonstrate that the proposed weighted averaging approach consistently improves performance across diverse evaluation settings.
☆ Analysis of Semi-Supervised Learning on Hypergraphs
Hypergraphs provide a natural framework for modeling higher-order interactions, yet their theoretical underpinnings in semi-supervised learning remain limited. We provide an asymptotic consistency analysis of variational learning on random geometric hypergraphs, precisely characterizing the conditions ensuring the well-posedness of hypergraph learning as well as showing convergence to a weighted $p$-Laplacian equation. Motivated by this, we propose Higher-Order Hypergraph Learning (HOHL), which regularizes via powers of Laplacians from skeleton graphs for multiscale smoothness. HOHL converges to a higher-order Sobolev seminorm. Empirically, it performs strongly on standard baselines.
♻ ☆ 3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency
AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.
♻ ☆ SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars
In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types--LAMOST low-resolution and Gaia XP--followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy.
comment: 27 pages, 8 figures, 5 tables. Minor update: added corrected acknowledgments and corrected a misstated hyperparameter value (noted in footnote) for reproducibility. Submitted to AAS Journals. Comments welcome
♻ ☆ Curiosity-driven RL for symbolic equation solving NeurIPS 2025
We explore if RL can be useful for symbolic mathematics. Previous work showed contrastive learning can solve linear equations in one variable. We show model-free PPO \cite{schulman2017proximal} augmented with curiosity-based exploration and graph-based actions can solve nonlinear equations such as those involving radicals, exponentials, and trig functions. Our work suggests curiosity-based exploration may be useful for general symbolic reasoning tasks.
comment: Accepted at the NeurIPS 2025 MATH-AI Workshop
♻ ☆ The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection
Diverse learning algorithms, optimization methods, and natural selection share a common mathematical structure, despite their apparent differences. Here I show that a simple notational partitioning of change by the Price equation reveals a universal force-metric-bias (FMB) law: $\Delta\mathbf{\theta} = \mathbf{M}\,\mathbf{f} + \mathbf{b} + \mathbf{\xi}$. The force $\mathbf{f}$ drives improvement in parameters, $\Delta\mathbf{\theta}$, in proportion to the slope of performance with respect to the parameters. The metric $\mathbf{M}$ rescales movement by inverse curvature. The bias $\mathbf{b}$ adds momentum or changes in the frame of reference. The noise $\mathbf{\xi}$ enables exploration. This framework unifies natural selection, Bayesian updating, Newton's method, stochastic gradient descent, stochastic Langevin dynamics, Adam optimization, and most other algorithms as special cases of the same underlying process. The Price equation also reveals why Fisher information, Kullback-Leibler divergence, and d'Alembert's principle arise naturally in learning dynamics. By exposing this common structure, the FMB law provides a principled foundation for understanding, comparing, and designing learning algorithms across disciplines.
comment: Version 2: fixed definition of force in abstract; Version 3: added citations and some minor editing
♻ ☆ ASGO: Adaptive Structured Gradient Optimization
Training deep neural networks is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than by vectors. Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block diagonal. These structured properties are crucial for designing efficient optimization algorithms, but are not utilized by many current popular optimizers like Adam. In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients. By a fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. Based on this convergence theory, we further demonstrate that ASGO can benefit from low-rank gradients and block diagonal Hessians. We also discuss practical modifications of ASGO and empirically verify ASGO's effectiveness on language model tasks. Code is available at https://github.com/infinity-stars/ASGO.
comment: 39 pages
♻ ☆ Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks
Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of non-equilibrium statistical physics. We show that, for large network width $m$, and large number of samples per input dimension $n/d$, the training dynamics exhibits a separation of timescales which implies: $(i)$~The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; $(ii)$~Inductive bias towards small complexity if the initialization has small enough complexity; $(iii)$~A dynamical decoupling between feature learning and overfitting regimes; $(iv)$~A non-monotone behavior of the test error, associated `feature unlearning' regime at large times.
comment: 88 pages; 63 pdf figures
♻ ☆ SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning NeurIPS 2025
Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only an 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, we observe that MLLMs are affected by a recency bias, where placing the most relevant example last can lead to substantial performance improvements of up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context. SMMILE is available at https://smmile-benchmark.github.io.
comment: NeurIPS 2025 (Datasets & Benchmarks Track)
♻ ☆ Exact Sequence Interpolation with Transformers SC
We prove that transformers can exactly interpolate datasets of finite input sequences in $\mathbb{R}^d$, $d\geq 2$, with corresponding output sequences of smaller or equal length. Specifically, given $N$ sequences of arbitrary but finite lengths in $\mathbb{R}^d$ and output sequences of lengths $m^1, \dots, m^N \in \mathbb{N}$, we construct a transformer with $\mathcal{O}(\sum_{j=1}^N m^j)$ blocks and $\mathcal{O}(d \sum_{j=1}^N m^j)$ parameters that exactly interpolates the dataset. Our construction provides complexity estimates that are independent of the input sequence length, by alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices in the self-attention mechanism, a common feature of practical transformer implementations. These results are first established in the hardmax self-attention setting, where the geometric structure permits an explicit and quantitative analysis, and are then extended to the softmax setting. Finally, we demonstrate the applicability of our exact interpolation construction to learning problems, in particular by providing convergence guarantees to a global minimizer under regularized training strategies. Our analysis contributes to the theoretical understanding of transformer models, offering an explanation for their excellent performance in exact sequence-to-sequence interpolation tasks.
comment: 27 pages, 9 figures. Funded by the European Union (Horizon Europe MSCA project ModConFlex, grant number 101073558)
♻ ☆ MP-FVM: Enhancing Finite Volume Method for Water Infiltration Modeling in Unsaturated Soils via Message-passing Encoder-decoder Network
The spatiotemporal water flow dynamics in unsaturated soils can generally be modeled by the Richards equation. To overcome the computational challenges associated with solving this highly nonlinear partial differential equation (PDE), we present a novel solution algorithm, which we name as the MP-FVM (Message Passing-Finite Volume Method), to holistically integrate adaptive fixed-point iteration scheme, encoder-decoder neural network architecture, Sobolev training, and message passing mechanism in a finite volume discretization framework. We thoroughly discuss the need and benefits of introducing these components to achieve synergistic improvements in accuracy and stability of the solution. We also show that our MP-FVM algorithm can accurately solve the mixed-form $n$-dimensional Richards equation with guaranteed convergence under reasonable assumptions. Through several illustrative examples, we demonstrate that our MP-FVM algorithm not only achieves superior accuracy, but also better preserves the underlying physical laws and mass conservation of the Richards equation compared to state-of-the-art solution algorithms and the commercial HYDRUS solver.
comment: 36 pages, 14 figures, Accepted by Computers and Geotechnics
♻ ☆ Towards Scaling Deep Neural Networks with Predictive Coding: Theory and Practice
Backpropagation (BP) is the standard algorithm for training the deep neural networks that power modern artificial intelligence including large language models. However, BP is energy inefficient and unlikely to be implemented by the brain. This thesis studies an alternative, potentially more efficient brain-inspired algorithm called predictive coding (PC). Unlike BP, PC networks (PCNs) perform inference by iterative equilibration of neuron activities before learning or weight updates. Recent work has suggested that this iterative inference procedure provides a range of benefits over BP, such as faster training. However, these advantages have not been consistently observed, the inference and learning dynamics of PCNs are still poorly understood, and deep PCNs remain practically untrainable. Here, we make significant progress towards scaling PCNs by taking a theoretical approach grounded in optimisation theory. First, we show that the learning dynamics of PC can be understood as an approximate trust-region method using second-order information, despite explicitly using only first-order local updates. Second, going beyond this approximation, we show that PC can in principle make use of arbitrarily higher-order information, such that for feedforward networks the effective landscape on which PC learns is far more benign and robust to vanishing gradients than the (mean squared error) loss landscape. Third, motivated by a study of the inference dynamics of PCNs, we propose a new parameterisation called "$\mu$PC", which for the first time allows stable training of 100+ layer networks with little tuning and competitive performance on simple tasks. Overall, this thesis significantly advances our fundamental understanding of the inference and learning dynamics of PCNs, while highlighting the need for future research to focus on hardware co-design if PC is to compete with BP at scale.
comment: PhD thesis
♻ ☆ Score-Aware Policy-Gradient and Performance Guarantees using Local Lyapunov Stability
In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the stationary distribution of the MDP belongs to an exponential family that is parametrized by policy parameters, we can improve existing policy gradient methods for average-reward RL. Our key identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs), that enable policy gradient estimation without relying on value-function estimation in the aforementioned setting. We show that SAGE-based policy-gradient locally converges, and we obtain its regret. This includes cases when the state space of the MDP is countable and unstable policies can exist. Under appropriate assumptions such as starting sufficiently close to a maximizer and the existence of a local Lyapunov function, the policy under SAGE-based stochastic gradient ascent has an overwhelming probability of converging to the associated optimal policy. Furthermore, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic method on several examples inspired from stochastic networks, queueing systems, and models derived from statistical physics. Our results demonstrate that a SAGE-based method finds close-to-optimal policies faster than an actor-critic method.
♻ ☆ Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling
We present LrcSSM, a $\textit{non-linear}$ recurrent model that processes long sequences as fast as today's linear state-space layers. By forcing its Jacobian matrix to be diagonal, the full sequence can be solved in parallel, giving $\mathcal{O}(TD)$ time and memory and only $\mathcal{O}(\log T)$ sequential depth, for input-sequence length $T$ and a state dimension $D$. Moreover, LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide. Importantly, the diagonal Jacobian structure of our model results in no performance loss compared to the original model with dense Jacobian, and the approach can be generalized to other non-linear recurrent models, demonstrating broader applicability. On a suite of long-range forecasting tasks, we demonstrate that LrcSSM outperforms Transformers, LRU, S5, and Mamba.
♻ ☆ Exploring the In-Context Learning Capabilities of LLMs for Money Laundering Detection in Financial Graphs ICDM 2025
The complexity and interconnectivity of entities involved in money laundering demand investigative reasoning over graph-structured data. This paper explores the use of large language models (LLMs) as reasoning engines over localized subgraphs extracted from a financial knowledge graph. We propose a lightweight pipeline that retrieves k-hop neighborhoods around entities of interest, serializes them into structured text, and prompts an LLM via few-shot in-context learning to assess suspiciousness and generate justifications. Using synthetic anti-money laundering (AML) scenarios that reflect common laundering behaviors, we show that LLMs can emulate analyst-style logic, highlight red flags, and provide coherent explanations. While this study is exploratory, it illustrates the potential of LLM-based graph reasoning in AML and lays groundwork for explainable, language-driven financial crime analytics.
comment: Accepted at AI4FCF-ICDM 2025
♻ ☆ The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a na\"ive probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.
♻ ☆ Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization
Recent studies indicate that deep neural networks degrade in generalization performance under noisy supervision. Existing methods focus on isolating clean subsets or correcting noisy labels, facing limitations such as high computational costs, heavy hyperparameter tuning process, and coarse-grained optimization. To address these challenges, we propose a novel two-stage noisy learning framework that enables instance-level optimization through a dynamically weighted loss function, avoiding hyperparameter tuning. To obtain stable and accurate information about noise modeling, we introduce a simple yet effective metric, termed wrong event, which dynamically models the cleanliness and difficulty of individual samples while maintaining computational costs. Our framework first collects wrong event information and builds a strong base model. Then we perform noise-robust training on the base model, using a probabilistic model to handle the wrong event information of samples. Experiments on five synthetic and real-world LNL benchmarks demonstrate our method surpasses state-of-the-art methods in performance, achieves a nearly 75% reduction in computational time and improves model scalability.
♻ ☆ Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking
In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method's performance.
comment: Code and models are available at https://github.com/yophis/decom-renorm-merge
♻ ☆ NGGAN: Noise Generation GAN Based on the Practical Measurement Dataset for Narrowband Powerline Communications
To effectively process impulse noise for narrowband powerline communications (NB-PLCs) transceivers, capturing comprehensive statistics of nonperiodic asynchronous impulsive noise (APIN) is a critical task. However, existing mathematical noise generative models only capture part of the characteristics of noise. In this study, we propose a novel generative adversarial network (GAN) called noise generation GAN (NGGAN) that learns the complicated characteristics of practically measured noise samples for data synthesis. To closely match the statistics of complicated noise over the NB-PLC systems, we measured the NB-PLC noise via the analog coupling and bandpass filtering circuits of a commercial NB-PLC modem to build a realistic dataset. To train NGGAN, we adhere to the following principles: 1) we design the length of input signals that the NGGAN model can fit to facilitate cyclostationary noise generation; 2) the Wasserstein distance is used as a loss function to enhance the similarity between the generated noise and training data; and 3) to measure the similarity performances of GAN-based models based on the mathematical and practically measured datasets, we conduct both quantitative and qualitative analyses. The training datasets include: 1) a piecewise spectral cyclostationary Gaussian model (PSCGM); 2) a frequency-shift (FRESH) filter; and 3) practical measurements from NB-PLC systems. Simulation results demonstrate that the generated noise samples from the proposed NGGAN are highly close to the real noise samples. The principal component analysis (PCA) scatter plots and Fr\'echet inception distance (FID) analysis have shown that NGGAN outperforms other GAN-based models by generating noise samples with superior fidelity and higher diversity.
comment: 16 pages, 15 figures, 11 tables, and published in IEEE Transactions on Instrumentation and Measurement, 2025
♻ ☆ A method for the systematic generation of graph XAI benchmarks via Weisfeiler-Leman coloring
Graph neural networks have become the de facto model for learning from structured data. However, the decision-making process of GNNs remains opaque to the end user, which undermines their use in safety-critical applications. Several explainable AI techniques for graphs have been developed to address this major issue. Focusing on graph classification, these explainers identify subgraph motifs that explain predictions. Therefore, a robust benchmarking of graph explainers is required to ensure that the produced explanations are of high quality, i.e., aligned with the GNN's decision process. However, current graph-XAI benchmarks are limited to simplistic synthetic datasets or a few real-world tasks curated by domain experts, hindering rigorous and reproducible evaluation, and consequently stalling progress in the field. To overcome these limitations, we propose a method to automate the construction of graph XAI benchmarks from generic graph classification datasets. Our approach leverages the Weisfeiler-Leman color refinement algorithm to efficiently perform approximate subgraph matching and mine class-discriminating motifs, which serve as proxy ground-truth class explanations. At the same time, we ensure that these motifs can be learned by GNNs because their discriminating power aligns with WL expressiveness. This work also introduces the OpenGraphXAI benchmark suite, which consists of 15 ready-made graph-XAI datasets derived by applying our method to real-world molecular classification datasets. The suite is available to the public along with a codebase to generate over 2,000 additional graph-XAI benchmarks. Finally, we present a use case that illustrates how the suite can be used to assess the effectiveness of a selection of popular graph explainers, demonstrating the critical role of a sufficiently large benchmark collection for improving the significance of experimental results.
♻ ☆ Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.
♻ ☆ Pearl: A Foundation Model for Placing Every Atom in the Right Location
Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers $3.6\times$ improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.
comment: technical report
♻ ☆ Auto-Adaptive PINNs with Applications to Phase Transitions
We propose an adaptive sampling method for the training of Physics Informed Neural Networks (PINNs) which allows for sampling based on an arbitrary problem-specific heuristic which may depend on the network and its gradients. In particular we focus our analysis on the Allen-Cahn equations, attempting to accurately resolve the characteristic interfacial regions using a PINN without any post-hoc resampling. In experiments, we show the effectiveness of these methods over residual-adaptive frameworks.
♻ ☆ Continuous Domain Generalization NeurIPS25
Real-world data distributions often shift continuously across multiple latent factors such as time, geography, and socioeconomic contexts. However, existing domain generalization approaches typically treat domains as discrete or as evolving along a single axis (e.g., time). This oversimplification fails to capture the complex, multidimensional nature of real-world variation. This paper introduces the task of Continuous Domain Generalization (CDG), which aims to generalize predictive models to unseen domains defined by arbitrary combinations of continuous variations. We present a principled framework grounded in geometric and algebraic theories, showing that optimal model parameters across domains lie on a low-dimensional manifold. To model this structure, we propose a Neural Lie Transport Operator (NeuralLio), which enables structure-preserving parameter transitions by enforcing geometric continuity and algebraic consistency. To handle noisy or incomplete domain variation descriptors, we introduce a gating mechanism to suppress irrelevant dimensions and a local chart-based strategy for robust generalization. Extensive experiments on synthetic and real-world datasets, including remote sensing, scientific documents, and traffic forecasting, demonstrate that our method significantly outperforms existing baselines in both generalization accuracy and robustness.
comment: 23 pages, 9 figures. Accepted by NeurIPS25
♻ ☆ SGFusion: Stochastic Geographic Gradient Fusion in Federated Learning
This paper proposes Stochastic Geographic Gradient Fusion (SGFusion), a novel training algorithm to leverage the geographic information of mobile users in Federated Learning (FL). SGFusion maps the data collected by mobile devices onto geographical zones and trains one FL model per zone, which adapts well to the data and behaviors of users in that zone. SGFusion models the local data-based correlation among geographical zones as a hierarchical random graph (HRG) optimized by Markov Chain Monte Carlo sampling. At each training step, every zone fuses its local gradient with gradients derived from a small set of other zones sampled from the HRG. This approach enables knowledge fusion and sharing among geographical zones in a probabilistic and stochastic gradient fusion process with self-attention weights, such that "more similar" zones have "higher probabilities" of sharing gradients with "larger attention weights." SGFusion remarkably improves model utility without introducing undue computational cost. Extensive theoretical and empirical results using a heart-rate prediction dataset collected across 6 countries show that models trained with SGFusion converge with upper-bounded expected errors and significantly improve utility in all countries compared to existing approaches without notable cost in system scalability.
♻ ☆ Tracking the Median of Gradients with a Stochastic Proximal Point Method
There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. Here we study SGD with robust gradient estimators based on estimating the median. We first derive iterative methods based on the stochastic proximal point method for computing the median gradient and generalizations thereof. Then we propose an algorithm estimating the median gradient across iterations, and find that several well known methods are particular cases of this framework. For instance, we observe that different forms of clipping allow to compute online estimators of the median of gradients, in contrast to (heavy-ball) momentum, which corresponds to an online estimator of the mean. Finally, we provide a theoretical framework for an algorithm computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed, state-dependent noise.
♻ ☆ TuneNSearch: a hybrid transfer learning and local search approach for solving vehicle routing problems
This paper introduces TuneNSearch, a hybrid transfer learning and local search approach for addressing diverse variants of the vehicle routing problem (VRP). Our method uses reinforcement learning to generate high-quality solutions, which are subsequently refined by an efficient local search procedure. To ensure broad adaptability across VRP variants, TuneNSearch begins with a pre-training phase on the multi-depot VRP (MDVRP), followed by a fine-tuning phase to adapt it to other problem formulations. The learning phase utilizes a Transformer-based architecture enhanced with edge-aware attention, which integrates edge distances directly into the attention mechanism to better capture spatial relationships inherent to routing problems. We show that the pre-trained model generalizes effectively to single-depot variants, achieving performance comparable to models trained specifically on single-depot instances. Simultaneously, it maintains strong performance on multi-depot variants, an ability that models pre-trained solely on single-depot problems lack. For example, on 100-node instances of multi-depot variants, TuneNSearch outperforms a model pre-trained on the CVRP by 44%. In contrast, on 100-node instances of single-depot variants, TuneNSearch performs similar to the CVRP model. To validate the effectiveness of our method, we conduct extensive computational experiments on public benchmark and randomly generated instances. Across multiple CVRPLIB datasets, TuneNSearch consistently achieves performance deviations of less than 3% from the best-known solutions in the literature, compared to 6-25% for other neural-based models, depending on problem complexity. Overall, our approach demonstrates strong generalization to different problem sizes, instance distributions, and VRP formulations, while maintaining polynomial runtime complexity despite the integration of the local search algorithm.
♻ ☆ Lift What You Can: Green Online Learning with Heterogeneous Ensembles
Ensemble methods for stream mining necessitate managing multiple models and updating them as data distributions evolve. Considering the calls for more sustainability, established methods are however not sufficiently considerate of ensemble members' computational expenses and instead overly focus on predictive capabilities. To address these challenges and enable green online learning, we propose heterogeneous online ensembles (HEROS). For every training step, HEROS chooses a subset of models from a pool of models initialized with diverse hyperparameter choices under resource constraints to train. We introduce a Markov decision process to theoretically capture the trade-offs between predictive performance and sustainability constraints. Based on this framework, we present different policies for choosing which models to train on incoming data. Most notably, we propose the novel $\zeta$-policy, which focuses on training near-optimal models at reduced costs. Using a stochastic model, we theoretically prove that our $\zeta$-policy achieves near optimal performance while using fewer resources compared to the best performing policy. In our experiments across 11 benchmark datasets, we find empiric evidence that our $\zeta$-policy is a strong contribution to the state-of-the-art, demonstrating highly accurate performance, in some cases even outperforming competitors, and simultaneously being much more resource-friendly.
♻ ☆ Reinforcement Learning Teachers of Test Time Scaling NeurIPS 2025
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT
comment: Accepted at NeurIPS 2025
♻ ☆ Group Interventions on Deep Networks for Causal Discovery in Subsystems
Causal discovery uncovers complex relationships between variables, enhancing predictions, decision-making, and insights into real-world systems, especially in nonlinear multivariate time series. However, most existing methods primarily focus on pairwise cause-effect relationships, overlooking interactions among groups of variables, i.e., subsystems and their collective causal influence. In this study, we introduce gCDMI, a novel multi-group causal discovery method that leverages group-level interventions on trained deep neural networks and employs model invariance testing to infer causal relationships. Our approach involves three key steps. First, we use deep learning to jointly model the structural relationships among groups of all time series. Second, we apply group-wise interventions to the trained model. Finally, we conduct model invariance testing to determine the presence of causal links among variable groups. We evaluate our method on simulated datasets, demonstrating its superior performance in identifying group-level causal relationships compared to existing methods. Additionally, we validate our approach on real-world datasets, including brain networks and climate ecosystems. Our results highlight that applying group-level interventions to deep learning models, combined with invariance testing, can effectively reveal complex causal structures, offering valuable insights for domains such as neuroscience and climate science.
comment: Submitted to IEEE Access. We are working on the revised version
♻ ☆ Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)
Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
comment: 8 pages, 2 figures, 1 table. Invited report, CCN 2025 Algonauts Project session (3rd-place team). Code: https://github.com/erensemih/Algonauts2025_ModalityRNN v3: Added equal contribution footnote to author list. Corrected reference list
♻ ☆ SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning
How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.
♻ ☆ Brain-inspired Computational Intelligence via Predictive Coding
Artificial intelligence (AI) is rapidly becoming one of the key technologies of this century. The majority of results in AI thus far have been achieved using deep neural networks trained with a learning algorithm called error backpropagation, always considered biologically implausible. To this end, recent works have studied learning algorithms for deep neural networks inspired by the neurosciences. One such theory, called predictive coding (PC), has shown promising properties that make it potentially valuable for the machine learning community: it can model information processing in different areas of the brain, can be used in control and robotics, has a solid mathematical foundation in variational inference, and performs its computations asynchronously. Inspired by such properties, works that propose novel PC-like algorithms are starting to be present in multiple sub-fields of machine learning and AI at large. Here, we survey such efforts by first providing a broad overview of the history of PC to provide common ground for the understanding of the recent developments, then by describing current efforts and results, and concluding with a large discussion of possible implications and ways forward.
comment: 26 Pages, 9 Figures
♻ ☆ HyperMARL: Adaptive Hypernetworks for Multi-Agent RL NeurIPS 2025
Adaptive cooperation in multi-agent reinforcement learning (MARL) requires policies to express homogeneous, specialised, or mixed behaviours, yet achieving this adaptivity remains a critical challenge. While parameter sharing (PS) is standard for efficient learning, it notoriously suppresses the behavioural diversity required for specialisation. This failure is largely due to cross-agent gradient interference, a problem we find is surprisingly exacerbated by the common practice of coupling agent IDs with observations. Existing remedies typically add complexity through altered objectives, manual preset diversity levels, or sequential updates -- raising a fundamental question: can shared policies adapt without these intricacies? We propose a solution built on a key insight: an agent-conditioned hypernetwork can generate agent-specific parameters and decouple observation- and agent-conditioned gradients, directly countering the interference from coupling agent IDs with observations. Our resulting method, HyperMARL, avoids the complexities of prior work and empirically reduces policy gradient variance. Across diverse MARL benchmarks (22 scenarios, up to 30 agents), HyperMARL achieves performance competitive with six key baselines while preserving behavioural diversity comparable to non-parameter sharing methods, establishing it as a versatile and principled approach for adaptive MARL. The code is publicly available at https://github.com/KaleabTessera/HyperMARL.
comment: To appear at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025). A preliminary version of this work was presented at the CoCoMARL workshop, RLC 2025
♻ ☆ Reliable Evaluation and Benchmarks for Statement Autoformalization EMNLP 2025
Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.
comment: Accepted to EMNLP 2025. New benchmarks released, see https://github.com/augustepoiroux/RLMEval , https://huggingface.co/datasets/PAug/ProofNetSharp , and https://huggingface.co/datasets/PAug/ProofNetVerif . For code, see https://github.com/augustepoiroux/LeanInteract
♻ ☆ OmegAMP: Targeted AMP Discovery through Biologically Informed Generation
Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as limited controllability, lack of representations that efficiently model antimicrobial properties, and low experimental hit rates. To address these challenges, we introduce OmegAMP, a framework designed for reliable AMP generation with increased controllability. Its diffusion-based generative model leverages a novel conditioning mechanism to achieve fine-grained control over desired physicochemical properties and to direct generation towards specific activity profiles, including species-specific effectiveness. This is further enhanced by a biologically informed encoding space that significantly improves overall generative performance. Complementing these generative capabilities, OmegAMP leverages a novel synthetic data augmentation strategy to train classifiers for AMP filtering, drastically reducing false positive rates and thereby increasing the likelihood of experimental success. Our in silico experiments demonstrate that OmegAMP delivers state-of-the-art performance across key stages of the AMP discovery pipeline, enabling us to achieve an unprecedented success rate in wet lab experiments. We tested 25 candidate peptides, 24 of them (96%) demonstrated antimicrobial activity, proving effective even against multi-drug resistant strains. Our findings underscore OmegAMP's potential to significantly advance computational frameworks in the fight against antimicrobial resistance.
♻ ☆ Transformers from Compressed Representations
Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.
♻ ☆ Differential Privacy as a Perk: Federated Learning over Multiple-Access Fading Channels with a Multi-Antenna Base Station
Federated Learning (FL) is a distributed learning paradigm that preserves privacy by eliminating the need to exchange raw data during training. In its prototypical edge instantiation with underlying wireless transmissions enabled by analog over-the-air computing (AirComp), referred to as \emph{over-the-air FL (AirFL)}, the inherent channel noise plays a unique role of \emph{frenemy} in the sense that it degrades training due to noisy global aggregation while providing a natural source of randomness for privacy-preserving mechanisms, formally quantified by \emph{differential privacy (DP)}. It remains, nevertheless, challenging to effectively harness such channel impairments, as prior arts, under assumptions of either simple channel models or restricted types of loss functions, mostly considering (local) DP enhancement with a single-round or non-convergent bound on privacy loss. In this paper, we study AirFL over multiple-access fading channels with a multi-antenna base station (BS) subject to user-level DP requirements. Despite a recent study, which claimed in similar settings that artificial noise (AN) must be injected to ensure DP in general, we demonstrate, on the contrary, that DP can be gained as a \emph{perk} even \emph{without} employing any AN. Specifically, we derive a novel bound on DP that converges under general bounded-domain assumptions on model parameters, along with a convergence bound with general smooth and non-convex loss functions. Next, we optimize over receive beamforming and power allocations to characterize the optimal convergence-privacy trade-offs, which also reveal explicit conditions in which DP is achievable without compromising training. Finally, our theoretical findings are validated by extensive numerical results.
comment: 15 pages, 5 figures, submitted for possible publication
♻ ☆ Improving Robustness of AlphaZero Algorithms to Test-Time Environment Changes
The AlphaZero framework provides a standard way of combining Monte Carlo planning with prior knowledge provided by a previously trained policy-value neural network. AlphaZero usually assumes that the environment on which the neural network was trained will not change at test time, which constrains its applicability. In this paper, we analyze the problem of deploying AlphaZero agents in potentially changed test environments and demonstrate how the combination of simple modifications to the standard framework can significantly boost performance, even in settings with a low planning budget available. The code is publicly available on GitHub.
comment: Presented at the 37th Benelux Conference on Artificial Intelligence and the 34th Belgian Dutch Conference on Machine Learning (BNAIC/BeNeLearn 2025)
♻ ☆ Exploring End-to-end Differentiable Neural Charged Particle Tracking -- A Loss Landscape Perspective
Measurement and analysis of high energetic particles for scientific, medical or industrial applications is a complex procedure, requiring the design of sophisticated detector and data processing systems. The development of adaptive and differentiable software pipelines using a combination of conventional and machine learning algorithms is therefore getting ever more important to optimize and operate the system efficiently while maintaining end-to-end (E2E) differentiability. We propose for the application of charged particle tracking an E2E differentiable decision-focused learning scheme using graph neural networks with combinatorial components solving a linear assignment problem for each detector layer. We demonstrate empirically that including differentiable variations of discrete assignment operations allows for efficient network optimization, working better or on par with approaches that lack E2E differentiability. In additional studies, we dive deeper into the optimization process and provide further insights from a loss landscape perspective. We demonstrate that while both methods converge into similar performing, globally well-connected regions, they suffer under substantial predictive instability across initialization and optimization methods, which can have unpredictable consequences on the performance of downstream tasks such as image reconstruction. We also point out a dependency between the interpolation factor of the gradient estimator and the prediction stability of the model, suggesting the choice of sufficiently small values. Given the strong global connectivity of learned solutions and the excellent training performance, we argue that E2E differentiability provides, besides the general availability of gradient information, an important tool for robust particle tracking to mitigate prediction instabilities by favoring solutions that perform well on downstream tasks.
comment: Published in Transactions on Machine Learning Research (TMLR), 2025
♻ ☆ Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review
Reinforcement learning (RL) has become a foundational approach for enabling intelligent robotic behavior in dynamic and uncertain environments. This work presents an in-depth review of RL principles, advanced deep reinforcement learning (DRL) algorithms, and their integration into robotic and control systems. Beginning with the formalism of Markov Decision Processes (MDPs), the study outlines essential elements of the agent-environment interaction and explores core algorithmic strategies including actor-critic methods, value-based learning, and policy gradients. Emphasis is placed on modern DRL techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving high-dimensional, continuous control tasks. A structured taxonomy is introduced to categorize RL applications across domains such as locomotion, manipulation, multi-agent coordination, and human-robot interaction, along with training methodologies and deployment readiness levels. The review synthesizes recent research efforts, highlighting technical trends, design patterns, and the growing maturity of RL in real-world robotics. Overall, this work aims to bridge theoretical advances with practical implementations, providing a consolidated perspective on the evolving role of RL in autonomous robotic systems.
♻ ☆ TabArena: A Living Benchmark for Machine Learning on Tabular Data NeurIPS 2025
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
comment: Accepted (spotlight) at NeurIPS 2025 Datasets and Benchmarks Track. v3: NeurIPS camera-ready version. v2: fixed author list. 51 pages. Code available at https://tabarena.ai/code; examples at https://tabarena.ai/code-examples; dataset curation at https://tabarena.ai/data-tabular-ml-iid-study and https://tabarena.ai/dataset-curation
♻ ☆ GnnXemplar: Exemplars to Explanations -- Natural Language Rules for Global GNN Interpretability NeurIPS 2025
Graph Neural Networks (GNNs) are widely used for node classification, yet their opaque decision-making limits trust and adoption. While local explanations offer insights into individual predictions, global explanation methods, those that characterize an entire class, remain underdeveloped. Existing global explainers rely on motif discovery in small graphs, an approach that breaks down in large, real-world settings where subgraph repetition is rare, node attributes are high-dimensional, and predictions arise from complex structure-attribute interactions. We propose GnnXemplar, a novel global explainer inspired from Exemplar Theory from cognitive science. GnnXemplar identifies representative nodes in the GNN embedding space, exemplars, and explains predictions using natural language rules derived from their neighborhoods. Exemplar selection is framed as a coverage maximization problem over reverse k-nearest neighbors, for which we provide an efficient greedy approximation. To derive interpretable rules, we employ a self-refining prompt strategy using large language models (LLMs). Experiments across diverse benchmarks show that GnnXemplar significantly outperforms existing methods in fidelity, scalability, and human interpretability, as validated by a user study with 60 participants.
comment: 38 pages, 20 figures, NeurIPS 2025 (Oral)
♻ ☆ Probabilistic Kernel Function for Fast Angle Testing
In this paper, we study the angle testing problem in the context of similarity search in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and employs a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be both theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5X ~ 3X higher query-per-second (QPS) throughput compared to the widely-used graph-based search algorithm HNSW.
♻ ☆ OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents NeurIPS 2025
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.
comment: NeurIPS 2025 Datasets & Benchmarks Track (Spotlight)
♻ ☆ SNN-Based Online Learning of Concepts and Action Laws in an Open World
We present the architecture of a fully autonomous, bio-inspired cognitive agent built around a spiking neural network (SNN) implementing the agent's semantic memory. This agent explores its universe and learns concepts of objects/situations and of its own actions in a one-shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent's knowledge of its universe's action laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.
♻ ☆ Learning-Augmented Online Bipartite Fractional Matching NeurIPS 2025
Online bipartite matching is a fundamental problem in online optimization, extensively studied both in its integral and fractional forms due to its theoretical significance and practical applications, such as online advertising and resource allocation. Motivated by recent progress in learning-augmented algorithms, we study online bipartite fractional matching when the algorithm is given advice in the form of a suggested matching in each iteration. We develop algorithms for both the vertex-weighted and unweighted variants that provably dominate the naive "coin flip" strategy of randomly choosing between the advice-following and advice-free algorithms. Moreover, our algorithm for the vertex-weighted setting extends to the AdWords problem under the small bids assumption, yielding a significant improvement over the seminal work of Mahdian, Nazerzadeh, and Saberi (EC 2007, TALG 2012). Complementing our positive results, we establish a hardness bound on the robustness-consistency tradeoff that is attainable by any algorithm. We empirically validate our algorithms through experiments on synthetic and real-world data.
comment: To appear in NeurIPS 2025. Full version
♻ ☆ Differential Mamba AACL 2025
Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available: https://github.com/NadavSc/Diff-Mamba
comment: AACL 2025. We provide the code at https://github.com/NadavSc/Diff-Mamba
♻ ☆ Redistributing Rewards Across Time and Agents for Multi-Agent Reinforcement Learning
Credit assignmen, disentangling each agent's contribution to a shared reward, is a critical challenge in cooperative multi-agent reinforcement learning (MARL). To be effective, credit assignment methods must preserve the environment's optimal policy. Some recent approaches attempt this by enforcing return equivalence, where the sum of distributed rewards must equal the team reward. However, their guarantees are conditional on a learned model's regression accuracy, making them unreliable in practice. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), an approach that decouples credit modeling from this constraint. A neural network learns unnormalized contribution scores, while a separate, deterministic normalization step enforces return equivalence by construction. We demonstrate that this method is equivalent to a valid Potential-Based Reward Shaping (PBRS), which guarantees the optimal policy is preserved regardless of model accuracy. Empirically, on challenging SMACLite and Google Research Football (GRF) benchmarks, TAR$^2$ accelerates learning and achieves higher final performance than strong baselines. These results establish our method as an effective solution for the agent-temporal credit assignment problem.
comment: 16 pages, 4 figures, 4 tables
Human-Computer Interaction 20
☆ User Misconceptions of LLM-Based Conversational Programming Assistants
Programming assistants powered by large language models (LLMs) have become widely available, with conversational assistants like ChatGPT proving particularly accessible to less experienced programmers. However, the varied capabilities of these tools across model versions and the mixed availability of extensions that enable web search, code execution, or retrieval-augmented generation create opportunities for user misconceptions about what systems can and cannot do. Such misconceptions may lead to over-reliance, unproductive practices, or insufficient quality control in LLM-assisted programming. Here, we aim to characterize misconceptions that users of conversational LLM-based assistants may have in programming contexts. Using a two-phase approach, we first brainstorm and catalog user misconceptions that may occur, and then conduct a qualitative analysis to examine whether these conceptual issues surface in naturalistic Python-programming conversations with an LLM-based chatbot drawn from an openly available dataset. Indeed, we see evidence that some users have misplaced expectations about the availability of LLM-based chatbot features like web access, code execution, or non-text output generation. We also see potential evidence for deeper conceptual issues around the scope of information required to debug, validate, and optimize programs. Our findings reinforce the need for designing LLM-based tools that more clearly communicate their programming capabilities to users.
☆ ggtime: A Grammar of Temporal Graphics
Visualizing changes over time is fundamental to learning from the past and anticipating the future. However, temporal semantics can be complicated, and existing visualization tools often struggle to accurately represent these complexities. It is common to use bespoke plot helper functions designed to produce specific graphics, due to the absence of flexible general tools that respect temporal semantics. We address this problem by proposing a grammar of temporal graphics, and an associated software implementation, 'ggtime', that encodes temporal semantics into a declarative grammar for visualizing temporal data. The grammar introduces new composable elements that support visualization across linear, cyclical, quasi-cyclical, and other granularities; standardization of irregular durations; and alignment of time points across different granularities and time zones. It is designed for interoperability with other semantic variables, allowing navigation across the space of visualizations while preserving temporal semantics.
☆ Psychoacoustic assessment of synthetic sounds for electric vehicles in a virtual reality experiment
The growing adoption of electric vehicles, known for their quieter operation compared to internal combustion engine vehicles, raises concerns about their detectability, particularly for vulnerable road users. To address this, regulations mandate the inclusion of exterior sound signals for electric vehicles, specifying minimum sound pressure levels at low speeds. These synthetic exterior sounds are often used in noisy urban environments, creating the challenge of enhancing detectability without introducing excessive noise annoyance. This study investigates the design of synthetic exterior sound signals that balance high noticeability with low annoyance. An audiovisual experiment with 14 participants was conducted using 15 virtual reality scenarios featuring a passing car. The scenarios included various sound signals, such as pure, intermittent, and complex tones at different frequencies. Two baseline cases, a diesel engine and only tyre noise, were also tested. Participants rated sounds for annoyance, noticeability, and informativeness using 11-point ICBEN scales. The findings highlight how psychoacoustic sound quality metrics predict annoyance ratings better than conventional sound metrics, providing insight into optimising sound design for electric vehicles. By improving pedestrian safety while minimising noise pollution, this research supports the development of effective and user-friendly exterior sound standards for electric vehicles.
☆ Small Talk, Big Impact? LLM-based Conversational Agents to Mitigate Passive Fatigue in Conditional Automated Driving
Passive fatigue during conditional automated driving can compromise driver readiness and safety. This paper presents findings from a test-track study with 40 participants in a real-world rural automated driving scenario. In this scenario, a Large Language Model (LLM) based conversational agent (CA) was designed to check in with drivers and re-engage them with their surroundings. Drawing on in-car video recordings, sleepiness ratings and interviews, we analysed how drivers interacted with the agent and how these interactions shaped alertness. Users found the CA helpful for supporting vigilance during passive fatigue. Thematic analysis of acceptability further revealed three user preference profiles that implicate future intention to use CAs. Positioning empirically observed profiles within existing CA archetype frameworks highlights the need for adaptive design sensitive to diverse user groups. This work underscores the potential of CAs as proactive Human-Machine Interface (HMI) interventions, demonstrating how natural language can support context-aware interaction during automated driving.
comment: Submitted to CHI '26 Conference on Human Factors in Computing Systems
☆ CGM-Led Multimodal Tracking with Chatbot Support: An Autoethnography in Sub-Health
Metabolic disorders present a pressing global health challenge, with China carrying the world's largest burden. While continuous glucose monitoring (CGM) has transformed diabetes care, its potential for supporting sub-health populations -- such as individuals who are overweight, prediabetic, or anxious -- remains underexplored. At the same time, large language models (LLMs) are increasingly used in health coaching, yet CGM is rarely incorporated as a first-class signal. To address this gap, we conducted a six-week autoethnography, combining CGM with multimodal indicators captured via common digital devices and a chatbot that offered personalized reflections and explanations of glucose fluctuations. Our findings show how CGM-led, data-first multimodal tracking, coupled with conversational support, shaped everyday practices of diet, activity, stress, and wellbeing. This work contributes to HCI by extending CGM research beyond clinical diabetes and demonstrating how LLM-driven agents can support preventive health and reflection in at-risk populations.
comment: conference paper, prepint
☆ Designing for Dignity while Driving: Interaction Needs of Blind and Low-Vision Passengers in Fully Automated Vehicles
Fully automated vehicles (FAVs) hold promise for enhancing the mobility of blind and low-vision (BLV) individuals. To understand the situated interaction needs of BLV passengers, we conducted six on-road, and in-lab focus groups with 16 participants, immersing them in real-world driving conditions. Our thematic analysis reveals that BLV participants express a high initial 'faith' in FAVs, but require layered, value-sensitive information during the ride to cultivate trust. The participants' modality preference for voice suggests re-evaluating the role of haptics for BLV users in FAVs. Our findings show the importance of a respectful interaction design in FAVs that both address BLV users' mobility challenges and uphold their dignity. While others have advocated for a dignity lens, our contribution lies in grounding this framework in empirical findings and unpacking what it means to design for dignity in the context of FAVs.
☆ On the Go with AR: Attention to Virtual and Physical Targets while Varying Augmentation Density
Augmented reality is projected to be a primary mode of information consumption on the go, seamlessly integrating virtual content into the physical world. However, the potential perceptual demands of viewing virtual annotations while navigating a physical environment could impact user efficacy and safety, and the implications of these demands are not well understood. Here, we investigate the impact of virtual path guidance and augmentation density (visual clutter) on search performance and memory. Participants walked along a predefined path, searching for physical or virtual items. They experienced two levels of augmentation density, and either walked freely or with enforced speed and path guidance. Augmentation density impacted behavior and reduced awareness of uncommon objects in the environment. Analysis of search task performance and post-experiment item recall revealed differing attention to physical and virtual objects. On the basis of these findings we outline considerations for AR apps designed for use on the go.
comment: Conference Paper, 16 pages. Published at the 2025 CHI Conference on Human Factors in Computing Systems
☆ Risks and Opportunities in Human-Machine Teaming in Operationalizing Machine Learning Target Variables
Predictive modeling has the potential to enhance human decision-making. However, many predictive models fail in practice due to problematic problem formulation in cases where the prediction target is an abstract concept or construct and practitioners need to define an appropriate target variable as a proxy to operationalize the construct of interest. The choice of an appropriate proxy target variable is rarely self-evident in practice, requiring both domain knowledge and iterative data modeling. This process is inherently collaborative, involving both domain experts and data scientists. In this work, we explore how human-machine teaming can support this process by accelerating iterations while preserving human judgment. We study the impact of two human-machine teaming strategies on proxy construction: 1) relevance-first: humans leading the process by selecting relevant proxies, and 2) performance-first: machines leading the process by recommending proxies based on predictive performance. Based on a controlled user study of a proxy construction task (N = 20), we show that the performance-first strategy facilitated faster iterations and decision-making, but also biased users towards well-performing proxies that are misaligned with the application goal. Our study highlights the opportunities and risks of human-machine teaming in operationalizing machine learning target variables, yielding insights for future research to explore the opportunities and mitigate the risks.
comment: 23 pages, 6 figures
☆ The Impact of Navigation Aids on Search Performance and Object Recall in Wide-Area Augmented Reality
Head-worn augmented reality (AR) is a hotly pursued and increasingly feasible contender paradigm for replacing or complementing smartphones and watches for continual information consumption. Here, we compare three different AR navigation aids (on-screen compass, on-screen radar and in-world vertical arrows) in a wide-area outdoor user study (n=24) where participants search for hidden virtual target items amongst physical and virtual objects. We analyzed participants' search task performance, movements, eye-gaze, survey responses and object recall. There were two key findings. First, all navigational aids enhanced search performance relative to a control condition, with some benefit and strongest user preference for in-world arrows. Second, users recalled fewer physical objects than virtual objects in the environment, suggesting reduced awareness of the physical environment. Together, these findings suggest that while navigational aids presented in AR can enhance search task performance, users may pay less attention to the physical environment, which could have undesirable side-effects.
comment: Conference Paper, 17 pages. Published at the 2023 CHI Conference on Human Factors in Computing Systems
☆ Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning
We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI
☆ Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.
☆ Symbolically Scaffolded Play: Designing Role-Sensitive Prompts for Generative NPC Dialogue
Large Language Models (LLMs) promise to transform interactive games by enabling non-player characters (NPCs) to sustain unscripted dialogue. Yet it remains unclear whether constrained prompts actually improve player experience. We investigate this question through The Interview, a voice-based detective game powered by GPT-4o. A within-subjects usability study ($N=10$) compared high-constraint (HCP) and low-constraint (LCP) prompts, revealing no reliable experiential differences beyond sensitivity to technical breakdowns. Guided by these findings, we redesigned the HCP into a hybrid JSON+RAG scaffold and conducted a synthetic evaluation with an LLM judge, positioned as an early-stage complement to usability testing. Results uncovered a novel pattern: scaffolding effects were role-dependent: the Interviewer (quest-giver NPC) gained stability, while suspect NPCs lost improvisational believability. These findings overturn the assumption that tighter constraints inherently enhance play. Extending fuzzy-symbolic scaffolding, we introduce \textit{Symbolically Scaffolded Play}, a framework in which symbolic structures are expressed as fuzzy, numerical boundaries that stabilize coherence where needed while preserving improvisation where surprise sustains engagement.
♻ ☆ To Rely or Not to Rely? Evaluating Interventions for Appropriate Reliance on Large Language Models
As Large Language Models become integral to decision-making, optimism about their power is tempered with concern over their errors. Users may over-rely on LLM advice that is confidently stated but wrong, or under-rely due to mistrust. Reliance interventions have been developed to help users of LLMs, but they lack rigorous evaluation for appropriate reliance. We benchmark the performance of three relevant interventions by conducting a randomized online experiment with 400 participants attempting two challenging tasks: LSAT logical reasoning and image-based numerical estimation. For each question, participants first answered independently, then received LLM advice modified by one of three reliance interventions and answered the question again. Our findings indicate that while interventions reduce over-reliance, they generally fail to improve appropriate reliance. Furthermore, people became more confident after making wrong reliance decisions in certain contexts, demonstrating poor calibration. Based on our findings, we discuss implications for designing effective reliance interventions in human-LLM collaboration.
♻ ☆ Which Demographic Features Are Relevant for Individual Fairness Evaluation of U.S. Recidivism Risk Assessment Tools?
Despite its constitutional relevance, the technical ``individual fairness'' criterion has not been operationalized in U.S. state or federal statutes/regulations. We conduct a human subjects experiment to address this gap, evaluating which demographic features are relevant for individual fairness evaluation of recidivism risk assessment (RRA) tools. Our analyses conclude that the individual similarity function should consider age and sex, but it should ignore race.
comment: ICAIL 2025
♻ ☆ Towards Human-AI Synergy in UI Design: Supporting Iterative Generation with LLMs
In automated UI design generation, a key challenge is the lack of support for iterative processes, as most systems focus solely on end-to-end output. This stems from limited capabilities in interpreting design intent and a lack of transparency for refining intermediate results. To better understand these challenges, we conducted a formative study that identified concrete and actionable requirements for supporting iterative design with Generative Tools. Guided by these findings, we propose PrototypeFlow, a human-centered system for automated UI generation that leverages multi-modal inputs and models. PrototypeFlow takes natural language descriptions and layout preferences as input to generate the high-fidelity UI design. At its core is a theme design module that clarifies implicit design intent through prompt enhancement and orchestrates sub-modules for component-level generation. Designers retain full control over inputs, intermediate results, and final prototypes, enabling flexible and targeted refinement by steering generation and directly editing outputs. Our experiments and user studies confirmed the effectiveness and usefulness of our proposed PrototypeFlow.
comment: ACM Transactions on Computer-Human Interaction
♻ ☆ Consistency of Responses and Continuations Generated by Large Language Models on Social Media AAAI
Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using three open-source models: Gemma, Llama3 and Llama3.3 and one commercial Model:Claude. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic consistency between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: these models show a strong tendency to moderate negative emotions. When the input text carries negative emotions such as anger, disgust, fear, or sadness, LLM tends to generate content with more neutral emotions, or even convert them into positive emotions such as joy or surprise. At the same time, we compared the LLM-generated content with human-authored content. The four models systematically generated responses with reduced emotional intensity and showed a preference for neutral rational emotions in the response task. In addition, these models all maintained a high semantic similarity with the original text, although their performance in the continuation task and the response task was different. These findings provide deep insights into the emotion and semantic processing capabilities of LLM, which are of great significance for its deployment in social media environments and human-computer interaction design.
comment: This paper has been accepted by the International AAAI Conference on Web and Social Media (ICWSM) 2026 (Los Angeles, California, U.S.)
♻ ☆ ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups
Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.
comment: For associated code repository, see http://github.com/banyasp/consistencyAI For user-friendly web app, see http://v0-llm-comparison-webapp.vercel.app/
♻ ☆ Vital Insight: Assisting Experts' Context-Driven Sensemaking of Multi-modal Personal Tracking Data Using Visualization and Human-In-The-Loop LLM
Passive tracking methods, such as phone and wearable sensing, have become dominant in monitoring human behaviors in modern ubiquitous computing studies. While there have been significant advances in machine-learning approaches to translate periods of raw sensor data to model momentary behaviors, (e.g., physical activity recognition), there still remains a significant gap in the translation of these sensing streams into meaningful, high-level, context-aware insights that are required for various applications (e.g., summarizing an individual's daily routine). To bridge this gap, experts often need to employ a context-driven sensemaking process in real-world studies to derive insights. This process often requires manual effort and can be challenging even for experienced researchers due to the complexity of human behaviors. We conducted three rounds of user studies with 21 experts to explore solutions to address challenges with sensemaking. We follow a human-centered design process to identify needs and design, iterate, build, and evaluate Vital Insight (VI), a novel, LLM-assisted, prototype system to enable human-in-the-loop inference (sensemaking) and visualizations of multi-modal passive sensing data from smartphones and wearables. Using the prototype as a technology probe, we observe experts' interactions with it and develop an expert sensemaking model that explains how experts move between direct data representations and AI-supported inferences to explore, question, and validate insights. Through this iterative process, we also synthesize and discuss a list of design implications for the design of future AI-augmented visualization systems to better assist experts' sensemaking processes in multi-modal health sensing data.
comment: Jiachen Li, Xiwen Li, Justin Steinberg, Akshat Choube, Bingsheng Yao, Xuhai Xu, Dakuo Wang, Elizabeth Mynatt, and Varun Mishra. 2025. Vital Insight: Assisting Experts' Context-Driven Sensemaking of Multi-modal Personal Tracking Data Using Visualization and Human-in-the-Loop LLM. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 9, 3, Article 101 (September 2025), 37 pages
♻ ☆ Detecting the Use of Generative AI in Crowdsourced Surveys: Implications for Data Integrity SC
The widespread adoption of generative AI (GenAI) has introduced new challenges in crowdsourced data collection, particularly in survey-based research. While GenAI offers powerful capabilities, its unintended use in crowdsourcing, such as generating automated survey responses, threatens the integrity of empirical research and complicates efforts to understand public opinion and behavior. In this study, we investigate and evaluate two approaches for detecting AI-generated responses in online surveys: LLM-based detection and signature-based detection. We conducted experiments across seven survey studies, comparing responses collected before 2022 with those collected after the release of ChatGPT. Our findings reveal a significant increase in AI-generated responses in the post-2022 studies, highlighting how GenAI may silently distort crowdsourced data. This work raises broader concerns about evolving landscape of data integrity, where GenAI can compromise data quality, mislead researchers, and influence downstream findings in fields such as health, politics, and social behavior. By surfacing detection strategies and empirical evidence of GenAI's impact, we aim to contribute to ongoing conversation about safeguarding research integrity and supporting scholars navigating these methodological and ethical challenges.
comment: Accepted by CSCW 2025 workshop Beyond Information: Online Participatory Culture and Information Disorder
♻ ☆ Constraining Participation: Affordances of Feedback Features in Interfaces to Large Language Models
Large language models (LLMs) are now accessible to anyone with a computer, a web browser, and an internet connection via browser-based interfaces, shifting the dynamics of participation in AI development. This article examines how interactive feedback features in ChatGPT's interface afford user participation in LLM iteration. Drawing on a survey of early ChatGPT users and applying the mechanisms and conditions framework of affordances, we analyse how these features shape user input. Our analysis indicates that these features encourage simple, frequent, and performance-focused feedback while discouraging collective input and discussions among users. Drawing on participatory design literature, we argue such constraints, if replicated across broader user bases, risk reinforcing power imbalances between users, the public, and companies developing LLMs. Our analysis contributes to the growing literature on participatory AI by critically examining the limitations of existing feedback processes and proposing directions for redesign. Rather than focusing solely on aligning model outputs with specific user preferences, we advocate for creating infrastructure that supports sustained dialogue about the purpose and applications of LLMs. This approach requires attention to the ongoing work of "infrastructuring" - creating and sustaining the social, technical, and institutional structures necessary to address matters of concern to stakeholders impacted by LLM development and deployment.
comment: Version accepted for publication
Software Engineering 28
☆ Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.
☆ A Configuration-First Framework for Reproducible, Low-Code Localization
Machine learning is increasingly permeating radio-based localization services. To keep results credible and comparable, everyday workflows should make rigorous experiment specification and exact repeatability the default, without blocking advanced experimentation. However, in practice, researchers face a three-way gap that could be filled by a framework that offers (i) low coding effort for end-to-end studies, (ii) reproducibility by default including versioned code, data, and configurations, controlled randomness, isolated runs, and recorded artifacts, and (iii) built-in extensibility so new models, metrics, and stages can be added with minimal integration effort. Existing tools rarely deliver all three for machine learning in general and localization workflows in particular. In this paper we introduce LOCALIZE, a low-code, configuration-first framework for radio localization in which experiments are declared in human-readable configuration, a workflow orchestrator runs standardized pipelines from data preparation to reporting, and all artifacts, such as datasets, models, metrics, and reports, are versioned. The preconfigured, versioned datasets reduce initial setup and boilerplate, speeding up model development and evaluation. The design, with clear extension points, allows experts to add components without reworking the infrastructure. In a qualitative comparison and a head-to-head study against a plain Jupyter notebook baseline, we show that the framework reduces authoring effort while maintaining comparable runtime and memory behavior. Furthermore, using a Bluetooth Low Energy dataset, we show that scaling across training data (1x to 10x) keeps orchestration overheads bounded as data grows. Overall, the framework makes reproducible machine-learning-based localization experimentation practical, accessible, and extensible.
comment: 20 pages, 7 figures. Preprint submitted to ACM Transactions on Software Engineering and Methodology (TOSEM), 2025
☆ Fuzz Smarter, Not Harder: Towards Greener Fuzzing with GreenAFL
Fuzzing has become a key search-based technique for software testing, but continuous fuzzing campaigns consume substantial computational resources and generate significant carbon footprints. Existing grey-box fuzzing approaches like AFL++ focus primarily on coverage maximisation, without considering the energy costs of exploring different execution paths. This paper presents GreenAFL, an energy-aware framework that incorporates power consumption into the fuzzing heuristics to reduce the environmental impact of automated testing whilst maintaining coverage. GreenAFL introduces two key modifications to traditional fuzzing workflows: energy-aware corpus minimisation considering power consumption when reducing initial corpora, and energy-guided heuristics that direct mutation towards high-coverage, low-energy inputs. We conduct an ablation study comparing vanilla AFL++, energy-based corpus minimisation, and energy-based heuristics to evaluate the individual contributions of each component. Results show that highest coverage, and lowest energy usage is achieved whenever at least one of our modifications is used.
☆ Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies
Large Language Models have gained remarkable interest in industry and academia. The increasing interest in LLMs in academia is also reflected in the number of publications on this topic over the last years. For instance, alone 78 of the around 425 publications at ICSE 2024 performed experiments with LLMs. Conducting empirical studies with LLMs remains challenging and raises questions on how to achieve reproducible results, for both other researchers and practitioners. One important step towards excelling in empirical research on LLMs and their application is to first understand to what extent current research results are eventually reproducible and what factors may impede reproducibility. This investigation is within the scope of our work. We contribute an analysis of the reproducibility of LLM-centric studies, provide insights into the factors impeding reproducibility, and discuss suggestions on how to improve the current state. In particular, we studied the 86 articles describing LLM-centric studies, published at ICSE 2024 and ASE 2024. Of the 86 articles, 18 provided research artefacts and used OpenAI models. We attempted to replicate those 18 studies. Of the 18 studies, only five were fit for reproduction. For none of the five studies, we were able to fully reproduce the results. Two studies seemed to be partially reproducible, and three studies did not seem to be reproducible. Our results highlight not only the need for stricter research artefact evaluations but also for more robust study designs to ensure the reproducible value of future publications.
☆ Proceedings of the 12th Workshop on Horn Clauses for Verification and Synthesis
This volume contains the post-proceedings of the 12th Workshop on Horn Clauses for Verification and Synthesis (HCVS 2025), which took place in Zagreb, Croatia, on July 22, 2025, as affiliated workshop of the 37th International Conference on Computer Aided Verification (CAV 2025).
☆ What Challenges Do Developers Face in AI Agent Systems? An Empirical Study on Stack Overflow
AI agents have rapidly gained popularity across research and industry as systems that extend large language models with additional capabilities to plan, use tools, remember, and act toward specific goals. Yet despite their promise, developers face persistent and often underexplored challenges when building, deploying, and maintaining these emerging systems. To identify these challenges, we study developer discussions on Stack Overflow, the world's largest developer-focused Q and A platform with about 60 million questions and answers and 30 million users. We construct a taxonomy of developer challenges through tag expansion and filtering, apply LDA-MALLET for topic modeling, and manually validate and label the resulting themes. Our analysis reveals seven major areas of recurring issues encompassing 77 distinct technical challenges related to runtime integration, dependency management, orchestration complexity, and evaluation reliability. We further quantify topic popularity and difficulty to identify which issues are most common and hardest to resolve, map the tools and programming languages used in agent development, and track their evolution from 2021 to 2025 in relation to major AI model and framework releases. Finally, we present the implications of our results, offering concrete guidance for practitioners, researchers, and educators on agent reliability and developer support.
comment: 12 pages, 4 Figures
☆ Dissect-and-Restore: AI-based Code Verification with Transient Refactoring
Formal verification is increasingly recognized as a critical foundation for building reliable software systems. However, the need for specialized expertise to write precise specifications, navigate complex proof obligations, and learn annotations often makes verification an order of magnitude more expensive than implementation. While modern AI systems can recognize patterns in mathematical proofs and interpret natural language, effectively integrating them into the formal verification process remains an open challenge. We present Prometheus, a novel AI-assisted system that facilitates automated code verification with current AI capabilities in conjunction with modular software engineering principles (e.g., modular refactoring). Our approach begins by decomposing complex program logic, such as nested loops, into smaller, verifiable components. Once verified, these components are recomposed to construct a proof of the original program. This decomposition-recomposition workflow is non-trivial. Prometheus addresses this by guiding the proof search through structured decomposition of complex lemmas into smaller, verifiable sub-lemmas. When automated tools are insufficient, users can provide lightweight natural language guidance to steer the proof process effectively. Our evaluation demonstrates that transiently applying modular restructuring to the code substantially improves the AI's effectiveness in verifying individual components. This approach successfully verifies 86% of tasks in our curated dataset, compared to 68% for the baseline. Gains are more pronounced with increasing specification complexity, improving from 30% to 69%, and when integrating proof outlines for complex programs, from 25% to 87%.
☆ Understanding the Characteristics of LLM-Generated Property-Based Tests in Exploring Edge Cases
As Large Language Models (LLMs) increasingly generate code in software development, ensuring the quality of LLM-generated code has become important. Traditional testing approaches using Example-based Testing (EBT) often miss edge cases -- defects that occur at boundary values, special input patterns, or extreme conditions. This research investigates the characteristics of LLM-generated Property-based Testing (PBT) compared to EBT for exploring edge cases. We analyze 16 HumanEval problems where standard solutions failed on extended test cases, generating both PBT and EBT test codes using Claude-4-sonnet. Our experimental results reveal that while each method individually achieved a 68.75\% bug detection rate, combining both approaches improved detection to 81.25\%. The analysis demonstrates complementary characteristics: PBT effectively detects performance issues and edge cases through extensive input space exploration, while EBT effectively detects specific boundary conditions and special patterns. These findings suggest that a hybrid approach leveraging both testing methods can improve the reliability of LLM-generated code, providing guidance for test generation strategies in LLM-based code generation.
comment: Accepted for publication in 2nd IEEE/ACM international conference on AI-powered Software (AIware 2025) : 8 pages, 1 table, 8 figures
☆ TECS/Rust: Memory-safe Component Framework for Embedded Systems
As embedded systems grow in complexity and scale due to increased functional diversity, component-based development (CBD) emerges as a solution to streamline their architecture and enhance functionality reuse. CBD typically utilizes the C programming language for its direct hardware access and low-level operations, despite its susceptibility to memory-related issues. To address these concerns, this paper proposes TECS/Rust, a Rust-based framework specifically designed for TECS, which is a component framework for embedded systems. It leverages Rust's compile-time memory-safe features, such as lifetime and borrowing, to mitigate memory vulnerabilities common with C. The proposed framework not only ensures memory safety but also maintains the flexibility of CBD, automates Rust code generation for CBD components, and supports efficient integration with real-time operating systems. An evaluation of the amount of generated code indicates that the code generated by this paper framework accounts for a large percentage of the actual code. Compared to code developed without the proposed framework, the difference in execution time is minimal, indicating that the overhead introduced by the proposed framework is negligible.
comment: 10 pages. This version includes minor lstlisting configuration adjustments for successful compilation. No changes to content or layout. Originally published at IEEE ISORC 2024
☆ TECS/Rust-OE: Optimizing Exclusive Control in Rust-based Component Systems for Embedded Devices
The diversification of functionalities and the development of the IoT are making embedded systems larger and more complex in structure. Ensuring system reliability, especially in terms of security, necessitates selecting an appropriate programming language. As part of existing research, TECS/Rust has been proposed as a framework that combines Rust and component-based development (CBD) to enable scalable system design and enhanced reliability. This framework represents system structures using static mutable variables, but excessive exclusive controls applied to ensure thread safety have led to performance degradation. This paper proposes TECS/Rust-OE, a memory-safe CBD framework utilizing call flows to address these limitations. The proposed Rust code leverages real-time OS exclusive control mechanisms, optimizing performance without compromising reusability. Rust code is automatically generated based on component descriptions. Evaluations demonstrate reduced overhead due to optimized exclusion control and high reusability of the generated code.
comment: 5 pages (layout expanded from the 4-page IEEE version due to minor lstlisting configuration adjustments for compilation). Originally published as a poster paper at IEEE ISORC 2025
☆ Optimizing Knowledge Utilization for Multi-Intent Comment Generation with Large Language Models
Code comment generation aims to produce a generic overview of a code snippet, helping developers understand and maintain code. However, generic summaries alone are insufficient to meet the diverse needs of practitioners; for example, developers expect the implementation insights to be presented in an untangled manner, while users seek clear usage instructions. This highlights the necessity of multi-intent comment generation. With the widespread adoption of Large Language Models (LLMs) for code-related tasks, these models have been leveraged to tackle the challenge of multi-intent comment generation. Despite their successes, state-of-the-art LLM-based approaches often struggle to construct correct relationships among intents, code, and comments within a smaller number of demonstration examples. To mitigate this issue, we propose a framework named KUMIC for multi-intent comment generation. Built upon in-context learning, KUMIC leverages Chain-of-Thought (CoT) to optimize knowledge utilization for LLMs to generate intent-specific comments. Specifically, KUMIC first designs a retrieval mechanism to obtain similar demonstration examples, which exhibit high code-comment consistency. Then, KUMIC leverages CoT to guide LLMs to focus on statements facilitating the derivation of code comments aligned with specific intents. In this context, KUMIC constructs a mapping knowledge chain, linking code to intent-specific statements to comments, which enables LLMs to follow similar reasoning steps when generating the desired comments. We conduct extensive experiments to evaluate KUMIC, and the results demonstrate that KUMIC outperforms state-of-the-art baselines by 14.49\%, 22.41\%, 20.72\%, and 12.94\% in terms of BLEU, METEOR, ROUGE-L, and SBERT, respectively.
☆ Automated Program Repair Based on REST API Specifications Using Large Language Models
Many cloud services provide REST API accessible to client applications. However, developers often identify specification violations only during testing, as error messages typically lack the detail necessary for effective diagnosis. Consequently, debugging requires trial and error. This study proposes dcFix, a method for detecting and automatically repairing REST API misuses in client programs. In particular, dcFix identifies non-conforming code fragments, integrates them with the relevant API specifications into prompts, and leverages a Large Language Model (LLM) to produce the corrected code. Our evaluation demonstrates that dcFix accurately detects misuse and outperforms the baseline approach, in which prompts to the LLM omit any indication of code fragments non conforming to REST API specifications.
☆ Adaptive Proof Refinement with LLM-Guided Strategy Selection
Formal verification via theorem proving enables the expressive specification and rigorous proof of software correctness, but it is difficult to scale due to the significant manual effort and expertise required. While Large Language Models (LLMs) show potential in proof generation, they frequently produce incorrect proofs on the first attempt and require additional strategies for iterative refinement. However, existing approaches employ fixed refinement strategies and cannot dynamically choose an effective strategy based on the particular issues in a generated proof, which limits their performance. To overcome this limitation, we introduce Adapt, a novel proof refinement framework that leverages an LLM-guided decision-maker to dynamically select a suitable refinement strategy according to the state of the proof assistant and available context of an incorrect proof. We evaluate Adapt on two benchmarks against four existing methods and find that it significantly outperforms the best baseline on both by proving 16.63% and 18.58% more theorems, respectively. Furthermore, we demonstrate Adapt's generalizability by evaluating it across five different LLMs. We also conduct ablation studies to measure the contribution of each component and compare the trade-offs of alternative decision-maker designs.
comment: 11 pages, 11 figures
☆ Same Same But Different: Preventing Refactoring Attacks on Software Plagiarism Detection ICSE'26
Plagiarism detection in programming education faces growing challenges due to increasingly sophisticated obfuscation techniques, particularly automated refactoring-based attacks. While code plagiarism detection systems used in education practice are resilient against basic obfuscation, they struggle against structural modifications that preserve program behavior, especially caused by refactoring-based obfuscation. This paper presents a novel and extensible framework that enhances state-of-the-art detectors by leveraging code property graphs and graph transformations to counteract refactoring-based obfuscation. Our comprehensive evaluation of real-world student submissions, obfuscated using both algorithmic and AI-based obfuscation attacks, demonstrates a significant improvement in detecting plagiarized code.
comment: To be published at ICSE'26. 13 pages, 6 figures
☆ From Queries to Insights: Agentic LLM Pipelines for Spatio-Temporal Text-to-SQL SP
Natural-language-to-SQL (NL-to-SQL) systems hold promise for democratizing access to structured data, allowing users to query databases without learning SQL. Yet existing systems struggle with realistic spatio-temporal queries, where success requires aligning vague user phrasing with schema-specific categories, handling temporal reasoning, and choosing appropriate outputs. We present an agentic pipeline that extends a naive text-to-SQL baseline (llama-3-sqlcoder-8b) with orchestration by a Mistral-based ReAct agent. The agent can plan, decompose, and adapt queries through schema inspection, SQL generation, execution, and visualization tools. We evaluate on 35 natural-language queries over the NYC and Tokyo check-in dataset, covering spatial, temporal, and multi-dataset reasoning. The agent achieves substantially higher accuracy than the naive baseline 91.4% vs. 28.6% and enhances usability through maps, plots, and structured natural-language summaries. Crucially, our design enables more natural human-database interaction, supporting users who lack SQL expertise, detailed schema knowledge, or prompting skill. We conclude that agentic orchestration, rather than stronger SQL generators alone, is a promising foundation for interactive geospatial assistants.
comment: 8 pages, 5 figures, GeoGenAgent'25 - ACM SIGSPATIAL
☆ A Process Mining-Based System For The Analysis and Prediction of Software Development Workflows
CodeSight is an end-to-end system designed to anticipate deadline compliance in software development workflows. It captures development and deployment data directly from GitHub, transforming it into process mining logs for detailed analysis. From these logs, the system generates metrics and dashboards that provide actionable insights into PR activity patterns and workflow efficiency. Building on this structured representation, CodeSight employs an LSTM model that predicts remaining PR resolution times based on sequential activity traces and static features, enabling early identification of potential deadline breaches. In tests, the system demonstrates high precision and F1 scores in predicting deadline compliance, illustrating the value of integrating process mining with machine learning for proactive software project management.
comment: 16 pages, 7 figures, 4 tables
☆ PRISM: Proof-Carrying Artifact Generation through LLM x MDE Synergy and Stratified Constraints
PRISM unifies Large Language Models with Model-Driven Engineering to generate regulator-ready artifacts and machine-checkable evidence for safety- and compliance-critical domains. PRISM integrates three pillars: a Unified Meta-Model (UMM) reconciles heterogeneous schemas and regulatory text into a single semantic space; an Integrated Constraint Model (ICM) compiles structural and semantic requirements into enforcement artifacts including generation-time automata (GBNF, DFA) and post-generation validators (e.g., SHACL, SMT); and Constraint-Guided Verifiable Generation (CVG) applies these through two-layer enforcement - structural constraints drive prefix-safe decoding while semantic/logical validation produces machine-checkable certificates. When violations occur, PRISM performs audit-guided repair and records generation traces for compliance review. We evaluate PRISM in automotive software engineering (AUTOSAR) and cross-border legal jurisdiction (Brussels I bis). PRISM produces structurally valid, auditable artifacts that integrate with existing tooling and substantially reduce manual remediation effort, providing a practical path toward automated artifact generation with built-in assurance.
comment: 45 pages, 9 figures
☆ Internal Vulnerabilities, External Threats: A Grounded Framework for Enterprise Open Source Risk Governance
Enterprise engagement with open source has evolved from tactical adoption to strategic deep integration, exposing them to a complex risk landscape far beyond mere code. However, traditional risk management, narrowly focused on technical tools, is structurally inadequate for systemic threats like upstream "silent fixes", community conflicts, or sudden license changes, creating a dangerous governance blind spot. To address this governance vacuum and enable the necessary shift from tactical risk management to holistic risk governance, we conducted a grounded theory study with 15 practitioners to develop a holistic risk governance framework. Our study formalizes an analytical framework built on a foundational risk principle: an uncontrollable External Threat (e.g., a sudden license change in a key dependency) only becomes a critical risk when it exploits a controllable Internal Vulnerability (e.g., an undefined risk appetite for single-vendor projects), which then amplifies the impact.The framework operationalizes this principle through a clear logical chain: "Objectives -> Threats -> Vulnerabilities -> Mitigation" (OTVM). This provides a holistic decision model that transcends mere technical checklists. Based on this logic, our contributions are: (1) a "Strategic Objectives Matrix" to clarify goals; (2) a systematic dual taxonomy of External Threats (Ex-Tech, Ex-Comm, Ex-Eco) and Internal Vulnerabilities (In-Strat, In-Ops, In-Tech); and (3) an actionable mitigation framework mapping capability-building to these vulnerabilities. The framework's analytical utility was validated by three industry experts through retrospective case studies on real-world incidents. This work provides a novel diagnostic lens and a systematic path for enterprises to shift from reactive "firefighting" to proactively building an organizational "immune system".
☆ ScaleCall -- Agentic Tool Calling at Scale for Fintech: Challenges, Methods, and Deployment Insights
While Large Language Models (LLMs) excel at tool calling, deploying these capabilities in regulated enterprise environments such as fintech presents unique challenges due to on-premises constraints, regulatory compliance requirements, and the need to disambiguate large, functionally overlapping toolsets. In this paper, we present a comprehensive study of tool retrieval methods for enterprise environments through the development and deployment of ScaleCall, a prototype tool-calling framework within Mastercard designed for orchestrating internal APIs and automating data engineering workflows. We systematically evaluate embedding-based retrieval, prompt-based listwise ranking, and hybrid approaches, revealing that method effectiveness depends heavily on domain-specific factors rather than inherent algorithmic superiority. Through empirical investigation on enterprise-derived benchmarks, we find that embedding-based methods offer superior latency for large tool repositories, while listwise ranking provides better disambiguation for overlapping functionalities, with hybrid approaches showing promise in specific contexts. We integrate our findings into ScaleCall's flexible architecture and validate the framework through real-world deployment in Mastercard's regulated environment. Our work provides practical insights into the trade-offs between retrieval accuracy, computational efficiency, and operational requirements, contributing to the understanding of tool-calling system design for enterprise applications in regulated industries.
♻ ☆ RobEthiChor: Automated Context-aware Ethics-based Negotiation for Autonomous Robots
The presence of autonomous systems is growing at a fast pace and it is impacting many aspects of our lives. Designed to learn and act independently, these systems operate and perform decision-making without human intervention. However, they lack the ability to incorporate users' ethical preferences, which are unique for each individual in society and are required to personalize the decision-making processes. This reduces user trust and prevents autonomous systems from behaving according to the moral beliefs of their end-users. When multiple systems interact with differing ethical preferences, they must negotiate to reach an agreement that satisfies the ethical beliefs of all the parties involved and adjust their behavior consequently. To address this challenge, this paper proposes RobEthiChor, an approach that enables autonomous systems to incorporate user ethical preferences and contextual factors into their decision-making through ethics-based negotiation. RobEthiChor features a domain-agnostic reference architecture for designing autonomous systems capable of ethic-based negotiating. The paper also presents RobEthiChor-Ros, an implementation of RobEthiChor within the Robot Operating System (ROS), which can be deployed on robots to provide them with ethics-based negotiation capabilities. To evaluate our approach, we deployed RobEthiChor-Ros on real robots and ran scenarios where a pair of robots negotiate upon resource contention. Experimental results demonstrate the feasibility and effectiveness of the system in realizing ethics-based negotiation. RobEthiChor allowed robots to reach an agreement in more than 73% of the scenarios with an acceptable negotiation time (0.67s on average). Experiments also demonstrate that the negotiation approach implemented in RobEthiChor is scalable.
♻ ☆ Predicting Abandonment of Open Source Software Projects with An Integrated Feature Framework
Open Source Software (OSS) is a cornerstone of contemporary software development, yet the increasing prevalence of OSS project abandonment threatens global software supply chains. Although previous research has explored abandonment prediction methods, these methods often demonstrate unsatisfactory predictive performance, further plagued by imprecise abandonment discrimination, limited interpretability, and a lack of large, generalizable datasets. In this work, we address these challenges by reliably detecting OSS project abandonment through a dual approach: explicit archival status and rigorous semantic analysis of project documentation or description. Leveraging a precise and scalable labeling pipeline, we curate a comprehensive longitudinal dataset of 115,466 GitHub repositories, encompassing 57,733 confirmed abandonment repositories, enriched with detailed, timeline-based behavioral features. Building on this foundation, we introduce an integrated, multi-perspective feature framework for abandonment prediction, capturing user-centric, maintainer-centric, and project evolution features. Survival analysis using an AFT model yields a high C-index of 0.846, substantially outperforming models confined to surface features. Further, feature ablation and SHAP analyses confirm both the predictive power and interpretability of our approach. We further demonstrate practical deployment of a GBSA classifier for package risk in openEuler. By unifying precise labeling, multi-perspective features, and interpretable modeling, our work provides reproducible, scalable, and practitioner-oriented support for understanding and managing abandonment risk in large OSS ecosystems. Our tool not only predicts abandonment but also enhances program comprehension by providing actionable insights into the health and sustainability of OSS projects.
♻ ☆ The Strength of Weak Ties Between Open-Source Developers
In a real-world social network, weak ties (reflecting low-intensity, infrequent interactions) act as bridges and connect people to different social circles, giving them access to diverse information and opportunities that are not available within one's immediate, close-knit vicinity. Weak ties can be crucial for creativity and innovation, as they introduce ideas and approaches that people can then combine in novel ways, leading to innovative solutions. Do weak ties facilitate creativity in software in similar ways? This paper suggests that the answer is "yes." Concretely, we study the correlation between developers' knowledge acquisition through three distinct interaction networks on GitHub and the innovativeness of the projects they develop, across over 37,000 Python projects hosted on GitHub. Our findings suggest that the topical diversity of projects in which developers engage, rather than the volume, correlates positively with the innovativeness of their future code. Notably, exposure through weak interactions (e.g., starring) emerges as a stronger predictor of future novelty than via strong ones (e.g., committing)
comment: 15 pages, 7 figures
♻ ☆ Securing AI Agent Execution
Large Language Models (LLMs) have evolved into AI agents that interact with external tools and environments to perform complex tasks. The Model Context Protocol (MCP) has become the de facto standard for connecting agents with such resources, but security has lagged behind: thousands of MCP servers execute with unrestricted access to host systems, creating a broad attack surface. In this paper, we introduce AgentBound, the first access control framework for MCP servers. AgentBound combines a declarative policy mechanism, inspired by the Android permission model, with a policy enforcement engine that contains malicious behavior without requiring MCP server modifications. We build a dataset containing the 296 most popular MCP servers, and show that access control policies can be generated automatically from source code with 80.9% accuracy. We also show that AgentBound blocks the majority of security threats in several malicious MCP servers, and that policy enforcement engine introduces negligible overhead. Our contributions provide developers and project managers with a practical foundation for securing MCP servers while maintaining productivity, enabling researchers and tool builders to explore new directions for declarative access control and MCP security.
♻ ☆ Qualitative Analysis of the Teacher and Student Roles in Pair Programming
Background: Pair programming is a well-established and versatile agile practice. Previous research has found it to involve far more different roles than the well-known Driver and Observer/Navigator roles. Pair programming often involves heavy knowledge transfer from mainly one partner to the other. Objective: Understand how to fill the ensuing Teacher and Student roles well (positive behavioral patterns). Understand how they may break (anti-patterns). Method: Open coding and axial coding of 17 recorded pair programming sessions with 18 developers from 5 German software companies, plus interviews with 6 different developers from 4 other German companies. Results: We describe six facets of effective Teacher behavior (e.g. Prioritizing Knowledge Transfer) and two facets of effective Student behavior (e.g. Expressing Knowledge Wants). We describe four harmful would-be-Teacher behaviors (e.g. Pushing Unwanted Knowledge), and one harmful would-be-Student behavior (Failing to Provide a Back Channel). Conclusions: The role facets can serve as learning goals and to-do list for developers who want to develop strong pair programming skill. The anti-patterns can serve as warnings for one's own general behavior and as triggers for immediate meta-discussion if they occur within a pairing session.
♻ ☆ OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents NeurIPS 2025
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.
comment: NeurIPS 2025 Datasets & Benchmarks Track (Spotlight)
♻ ☆ Efficient Function Orchestration for Large Language Models
Function calling is a fundamental capability of today's large language models, but sequential function calling posed efficiency problems. Recent studies have proposed to request function calls with parallelism support in order to alleviate this issue. However, they either delegate the concurrent function calls to users for execution which are conversely executed sequentially, or overlook the relations among various function calls, rending limited efficiency. This paper introduces LLMOrch, an advanced framework for automated, parallel function calling in large language models. The key principle behind LLMOrch is to identify an available processor to execute a function call while preventing any single processor from becoming overburdened. To this end, LLMOrch models the data relations (i.e., def-use) among different function calls and coordinates their executions by their control relations (i.e., mutual-exclusion) as well as the working status of the underlying processors. When comparing with state-of-the-art techniques, LLMOrch demonstrated comparable efficiency improvements in orchestrating I/O-intensive functions, while significantly outperforming (2$\times$) them with compute-intensive functions. LLMOrch's performance even showed a linear correlation to the number of allocated processors. We believe that these results highlight the potential of LLMOrch as an efficient solution for parallel function orchestration in the context of large language models.
comment: Accepted for publication in IEEE Transactions on Software Engineering (TSE), 2025
♻ ☆ Causes and Canonicalization for Unreproducible Builds in Java
The increasing complexity of software supply chains and the rise of supply chain attacks have elevated concerns around software integrity. Users and stakeholders face significant challenges in validating that a given software artifact corresponds to its declared source. Reproducible Builds address this challenge by ensuring that independently performed builds from identical source code produce identical binaries. However, achieving reproducibility at scale remains difficult, especially in Java, due to a range of non-deterministic factors and caveats in the build process. In this work, we focus on reproducibility in Java-based software, archetypal of enterprise applications. We introduce a conceptual framework for reproducible builds, we analyze a large dataset from Reproducible Central, and we develop a novel taxonomy of six root causes of unreproducibility. We study actionable mitigations: artifact and bytecode canonicalization using OSS-Rebuild and jNorm respectively. Finally, we present Chains-Rebuild, a tool that achieve successfulcanonicalization for 26.60% on 12,803 unreproducible artifacts To sum up, our contributions are the first large-scale taxonomy of build unreproducibility causes in Java, a publicly available dataset of unreproducible builds, and Chains-Rebuild, a canonicalization tool for mitigating unreproducible builds in Java.
comment: Accepted to IEEE Transactions on Software Engineering
♻ ☆ Maven-Hijack: Software Supply Chain Attack Exploiting Packaging Order
Java projects frequently rely on package managers such as Maven to manage complex webs of external dependencies. While these tools streamline development, they also introduce subtle risks to the software supply chain. In this paper, we present Maven-Hijack, a novel attack that exploits the order in which Maven packages dependencies and the way the Java Virtual Machine resolves classes at runtime. By injecting a malicious class with the same fully qualified name as a legitimate one into a dependency that is packaged earlier, an attacker can silently override core application behavior without modifying the main codebase or library names. We demonstrate the real-world feasibility of this attack by compromising the Corona-Warn-App, a widely used open-source COVID-19 contact tracing system, and gaining control over its database connection logic. We evaluate three mitigation strategies, such as sealed JARs, Java Modules, and the Maven Enforcer plugin. Our results show that, while Java Modules offer strong protection, the Maven Enforcer plugin with duplicate class detection provides the most practical and effective defense for current Java projects. These findings highlight the urgent need for improved safeguards in Java's build and dependency management processes to prevent stealthy supply chain attacks.
Systems and Control 35
☆ Over 3 kV and Ultra-Low leakage Vertical (011) \b{eta}-Ga2O3 Power Diodes with Engineered Schottky Contact and High-permittivity Dielectric Field Plate
We report over 3 kV breakdown voltage and ultra-low leakage (011) \b{eta}-Ga2O3 power devices utilizing Schottky barrier engineering and high-permittivity (\k{appa}) dielectric (ZrO2) field plate. The (011) orientation of \b{eta}-Ga2O3 enabled low background doping and thick drift layers which are promising to support kV-class vertical \b{eta}-Ga2O3 power switches. The Schottky barrier engineering was performed with a composite Pt cap/PtOx/Pt (1.5 nm) anode contact to take advantage of the enhanced reverse blocking capabilities enabled by PtOx while allowing low turn-on voltage by the interfacing thin Pt layer. We also performed a systematic study using a co-processed Pt/(011) \b{eta}-Ga2O3 Schottky barrier diodes (SBDs) on the same wafer. The bare SBDs revealed a breakdown voltage of ~1.5 kV, while the field-plate Pt/(011) \b{eta}-Ga2O3 SBDs achieved an increased breakdown voltage of 2.75 kV owing to the edge field management. Further enhancement of the breakdown voltage was achieved by tunneling leakage management using composite Pt cap/PtOx/Pt (1.5 nm) Schottky contacts that ultimately enabled breakdown voltage of 3.7 kV for the field-plate diodes. Remarkably, the Pt cap/PtOx/Pt (1.5 nm) Schottky contacts maintained similar turn-on voltage as the Pt/(011) \b{eta}-Ga2O3 SBDs. The combination of efficient tunneling leakage management by composite Pt cap/PtOx/Pt (1.5 nm) contacts with similar turn-on voltage, edge field reduction by high-\k{appa} dielectric ZrO2 field plate, as well as the advantageous material properties offered by (011) \b{eta}-Ga2O3 demonstrate a promising strategy for developing ultra-low leakage and multi-kV class vertical (011) \b{eta}-Ga2O3 power devices.
☆ An OPF-based Control Framework for Hybrid AC-MTDC Power Systems under Uncertainty
The increasing integration of renewable energy, particularly offshore wind, introduces significant uncertainty into hybrid AC-HVDC systems due to forecast errors and power fluctuations. Conventional control strategies typically rely on fixed setpoints and neglect frequency deviations, which can compromise system stability under rapid renewable variations. To address this challenge, this paper presents a forecast-integrated, optimal power flow (OPF)-based adaptive control framework. Wind speed forecasts generated using a Random Forest model are incorporated into a time-coupled OPF to determine baseline converter setpoints in anticipation of wind fluctuations, which are further adjusted in real time based on actual operating conditions. An adaptive droop control scheme is developed that jointly considers DC voltage and AC frequency deviations. The effectiveness of the proposed control framework is validated through hardware-in-the-loop (HIL) simulations, demonstrating its capability to ensure stable and robust operation of hybrid AC-HVDC systems under high penetration of renewable energy.
☆ Incorporating Social Awareness into Control of Unknown Multi-Agent Systems: A Real-Time Spatiotemporal Tubes Approach
This paper presents a decentralized control framework that incorporates social awareness into multi-agent systems with unknown dynamics to achieve prescribed-time reach-avoid-stay tasks in dynamic environments. Each agent is assigned a social awareness index that quantifies its level of cooperation or self-interest, allowing heterogeneous social behaviors within the system. Building on the spatiotemporal tube (STT) framework, we propose a real-time STT framework that synthesizes tubes online for each agent while capturing its social interactions with others. A closed-form, approximation-free control law is derived to ensure that each agent remains within its evolving STT, thereby avoiding dynamic obstacles while also preventing inter-agent collisions in a socially aware manner, and reaching the target within a prescribed time. The proposed approach provides formal guarantees on safety and timing, and is computationally lightweight, model-free, and robust to unknown disturbances. The effectiveness and scalability of the framework are validated through simulation and hardware experiments on a 2D omnidirectional
☆ Optimal and Heuristic Approaches for Platooning Systems with Deadlines
Efficient truck platooning is a key strategy for reducing freight costs, lowering fuel consumption, and mitigating emissions. Deadlines are critical in this context, as trucks must depart within specific time windows to meet delivery requirements and avoid penalties. In this paper, we investigate the optimal formation and dispatch of truck platoons at a highway station with finite capacity $L$ and deadline constraints $T$. The system operates in discrete time, with each arriving truck assigned a deadline of $T$ slot units. The objective is to leverage the efficiency gains from forming large platoons while accounting for waiting costs and deadline violations. We formulate the problem as a Markov decision process and analyze the structure of the optimal policy $\pi^\star$ for $L = 3$, extending insights to arbitrary $L$. We prove that the $\pi^\star$ is monotone in the state space $\mathcal{S}$ and identify classes of unreachable states. Moreover, since $\mathcal{S}$ grows exponentially with $L$ and $T$, we propose heuristics-including conditional and deep-learning based approaches-that exploit these structural insights while maintaining low computational complexity.
☆ Deep Reinforcement Learning-Based Cooperative Rate Splitting for Satellite-to-Underground Communication Networks
Reliable downlink communication in satellite-to-underground networks remains challenging due to severe signal attenuation caused by underground soil and refraction in the air-soil interface. To address this, we propose a novel cooperative rate-splitting (CRS)-aided transmission framework, where an aboveground relay decodes and forwards the common stream to underground devices (UDs). Based on this framework, we formulate a max-min fairness optimization problem that jointly optimizes power allocation, message splitting, and time slot scheduling to maximize the minimum achievable rate across UDs. To solve this high-dimensional non-convex problem under uncertain channels, we develop a deep reinforcement learning solution framework based on the proximal policy optimization (PPO) algorithm that integrates distribution-aware action modeling and a multi-branch actor network. Simulation results under a realistic underground pipeline monitoring scenario demonstrate that the proposed approach achieves average max-min rate gains exceeding $167\%$ over conventional benchmark strategies across various numbers of UDs and underground conditions.
comment: 6 pages, 3 figures, 1 table, and submitted to IEEE TVT
☆ Sum-of-Squares Certificates for Almost-Sure Reachability of Stochastic Polynomial Systems
In this paper, we present a computational approach to certify almost sure reachability for discrete-time polynomial stochastic systems by turning drift--variant criteria into sum-of-squares (SOS) programs solved with standard semidefinite solvers. Specifically, we provide an SOS method based on two complementary certificates: (i) a drift certificate that enforces a radially unbounded function to be non-increasing in expectation outside a compact set of states; and (ii) a variant certificate that guarantees a one-step decrease with positive probability and ensures the target contains its nonpositive sublevel set. We transform these conditions to SOS constraints. For the variant condition, we enforce a robust decrease over a parameterized disturbance ball with nonzero probability and encode the constraints via an S-procedure with polynomial multipliers. The resulting bilinearities are handled by an alternating scheme that alternates between optimizing multipliers and updating the variant and radius until a positive slack is obtained. Two case studies illustrate the workflow and certifies almost-sure reachability.
comment: 8 Pages, 8 Figs
☆ A New Neural Network Paradigm for Scalable and Generalizable Stability Analysis of Power Systems
This paper presents a new neural network (NN) paradigm for scalable and generalizable stability analysis of power systems. The paradigm consists of two parts: the neural stability descriptor and the sample-augmented iterative training scheme. The first part, based on system decomposition, constructs the object (such as a stability function or condition) for stability analysis as a scalable aggregation of multiple NNs. These NNs remain fixed across varying power system structures and parameters, and are repeatedly shared within each system instance defined by these variations, thereby enabling the generalization of the neural stability descriptor across a class of power systems. The second part learns the neural stability descriptor by iteratively training the NNs with sample augmentation, guided by the tailored conservativeness-aware loss function. The training set is strategically constructed to promote the descriptor's generalizability, which is systematically evaluated by verification and validation during the training process. Specifically, the proposed NN paradigm is implemented for large-disturbance stability analysis of the bulk power grid and small-disturbance stability conditions of the microgrid system. Finally, numerical studies for the two implementations demonstrate the applicability and effectiveness of the proposed NN paradigm.
☆ Combining Moving Mass Actuators and Manoeuvring Models for Underwater Vehicles: A Lagrangian Approach
In this paper, we present a Newton-Euler formulation of the equations of motion for underwater vehicles with an interntal moving mass actuator. Furthermore, the moving mass dynamics are expressed as an extension to the manoeuvring model for underwater vehicles, originally introduced by Fossen (1991). The influence of the moving mass is described in body-frame and included as states in both an additional kinematic equation and as part of the coupled rigid-body kinetics of the underwater vehicle. The Coriolis-centripetal effects are derived from Kirchhoff's equations and the hydrostatics are derived using first principals. The proposed Newton-Euler model is validated through simulation and compared with the traditional Hamiltonian internal moving mass actuator formulation.
comment: \c{opyright} 2025 Alexander Rambech, Ivar Saksvik and Vahid Hassani. Accepted by IFAC for publication under a Creative Commons License CC-BY-NC-ND
☆ Data-Driven Stabilization Using Prior Knowledge on Stabilizability and Controllability
In this work, we study data-driven stabilization of linear time-invariant systems using prior knowledge of system-theoretic properties, specifically stabilizability and controllability. To formalize this, we extend the concept of data informativity by requiring the existence of a controller that stabilizes all systems consistent with the data and the prior knowledge. We show that if the system is controllable, then incorporating this as prior knowledge does not relax the conditions required for data-driven stabilization. Remarkably, however, we show that if the system is stabilizable, then using this as prior knowledge leads to necessary and sufficient conditions that are weaker than those for data-driven stabilization without prior knowledge. In other words, data-driven stabilization is easier if one knows that the underlying system is stabilizable. We also provide new data-driven control design methods in terms of linear matrix inequalities that complement the conditions for informativity.
comment: 6 pages
☆ Quantum-Resilient Threat Modelling for Secure RIS-Assisted ISAC in 6G UAV Corridors
The rapid deployment of unmanned aerial vehicle (UAV) corridors in sixth-generation (6G) networks requires safe, intelligence-driven integrated sensing and communications (ISAC). Reconfigurable intelligent surfaces (RIS) enhance spectrum efficiency, localisation accuracy, and situational awareness, while introducing new vulnerabilities. The rise of quantum computing increases the risks associated with harvest-now-decrypt-later strategies and quantum-enhanced spoofing. We propose a Quantum-Resilient Threat Modelling (QRTM) framework for RIS-assisted ISAC in UAV corridors to address these challenges. QRTM integrates classical, quantum-ready, and quantum-aided adversaries, countered using post-quantum cryptographic (PQC) primitives: ML-KEM for key establishment and Falcon for authentication, both embedded within RIS control signalling and UAV coordination. To strengthen security sensing, the framework introduces RIS-coded scene watermarking validated through a generalised likelihood ratio test (GLRT), with its detection probability characterised by the Marcum Q function. Furthermore, a Secure ISAC Utility (SIU) jointly optimises secrecy rate, spoofing detection, and throughput under RIS constraints, enabled by a scheduler with computational complexity of O(n^2). Monte Carlo evaluations using 3GPP Release 19 mid-band urban-canyon models (7-15 GHz) demonstrate a spoof-detection probability approaching 0.99 at a false-alarm rate of 1e-3, secrecy-rate retention exceeding 90 percent against quantum-capable adversaries, and signal-interference utilisation improvements of about 25 percent compared with baselines. These results show a standards-compliant path towards reliable, quantum-resilient ISAC for UAV corridors in smart cities and non-terrestrial networks.
comment: 6 Pages, 5figures
☆ Integrating Legal and Logical Specifications in Perception, Prediction, and Planning for Automated Driving: A Survey of Methods
This survey provides an analysis of current methodologies integrating legal and logical specifications into the perception, prediction, and planning modules of automated driving systems. We systematically explore techniques ranging from logic-based frameworks to computational legal reasoning approaches, emphasizing their capability to ensure regulatory compliance and interpretability in dynamic and uncertain driving environments. A central finding is that significant challenges arise at the intersection of perceptual reliability, legal compliance, and decision-making justifiability. To systematically analyze these challenges, we introduce a taxonomy categorizing existing approaches by their theoretical foundations, architectural implementations, and validation strategies. We particularly focus on methods that address perceptual uncertainty and incorporate explicit legal norms, facilitating decisions that are both technically robust and legally defensible. The review covers neural-symbolic integration methods for perception, logic-driven rule representation, and norm-aware prediction strategies, all contributing toward transparent and accountable autonomous vehicle operation. We highlight critical open questions and practical trade-offs that must be addressed, offering multidisciplinary insights from engineering, logic, and law to guide future developments in legally compliant autonomous driving systems.
comment: Accepted to 2025 IEEE International Automated Vehicle Validation Conference (IAVVC)
☆ Lightweight Federated Learning in Mobile Edge Computing with Statistical and Device Heterogeneity Awareness
Federated learning enables collaborative machine learning while preserving data privacy, but high communication and computation costs, exacerbated by statistical and device heterogeneity, limit its practicality in mobile edge computing. Existing compression methods like sparsification and pruning reduce per-round costs but may increase training rounds and thus the total training cost, especially under heterogeneous environments. We propose a lightweight personalized FL framework built on parameter decoupling, which separates the model into shared and private subspaces, enabling us to uniquely apply gradient sparsification to the shared component and model pruning to the private one. This structural separation confines communication compression to global knowledge exchange and computation reduction to local personalization, protecting personalization quality while adapting to heterogeneous client resources. We theoretically analyze convergence under the combined effects of sparsification and pruning, revealing a sparsity-pruning trade-off that links to the iteration complexity. Guided by this analysis, we formulate a joint optimization that selects per-client sparsity and pruning rates and wireless bandwidth to reduce end-to-end training time. Simulation results demonstrate faster convergence and substantial reductions in overall communication and computation costs with negligible accuracy loss, validating the benefits of coordinated and resource-aware personalization in resource-constrained heterogeneous environments.
☆ Tight Collision Avoidance for Stochastic Optimal Control: with Applications in Learning-based, Interactive Motion Planning
Trajectory planning in dense, interactive traffic scenarios presents significant challenges for autonomous vehicles, primarily due to the uncertainty of human driver behavior and the non-convex nature of collision avoidance constraints. This paper introduces a stochastic optimal control framework to address these issues simultaneously, without excessively conservative approximations. We opt to model human driver decisions as a Markov Decision Process and propose a method for handling collision avoidance between non-convex vehicle shapes by imposing a positive distance constraint between compact sets. In this framework, we investigate three alternative chance constraint formulations. To ensure computational tractability, we introduce tight, continuously differentiable reformulations of both the non-convex distance constraints and the chance constraints. The efficacy of our approach is demonstrated through simulation studies of two challenging interactive scenarios: an unregulated intersection crossing and a highway lane change in dense traffic.
comment: Preprint article, submitted for publication
☆ Data-Enabled Predictive Control and Guidance for Autonomous Underwater Vehicles
This paper presents a fully data-driven control framework for autonomous underwater vehicles (AUVs) based on Data-Enabled Predictive Control (DeePC). The approach eliminates the need for explicit hydrodynamic modeling by exploiting measured input-output data to predict and optimize future system behavior. Classic DeePC was employed in the heading control, while a cascaded DeePC architecture is proposed for depth regulation, incorporating a loop-frequency separation to handle the different dynamic modes of input and output. For 3-D waypoint path following, the Adaptive Line-of-Sight algorithm is extended to a predictive formulation and integrated with DeePC. All methods are validated in extensive simulation on the REMUS 100 AUV and compared with classical PI/PID control. The results demonstrate superior tracking performance and robustness of DeePC under ocean-current disturbances and nonlinear operating conditions, while significantly reducing modeling effort.
comment: 12 pages, 6 figures
☆ Shared Control for Vehicle Lane-Changing with Uncertain Driver Behaviors
Lane changes are common yet challenging driving maneuvers that require continuous decision-making and dynamic interaction with surrounding vehicles. Relying solely on human drivers for lane-changing can lead to traffic disturbances due to the stochastic nature of human behavior and its variability under different task demands. Such uncertainties may significantly degrade traffic string stability, which is critical for suppressing disturbance propagation and ensuring smooth merging of the lane-changing vehicles. This paper presents a human-automation shared lane-changing control framework that preserves driver authority while allowing automated assistance to achieve stable maneuvers in the presence of driver's behavioral uncertainty. Human driving behavior is modeled as a Markov jump process with transitions driven by task difficulty, providing a tractable representation of stochastic state switching. Based on this model, we first design a nominal stabilizing controller that guarantees stochastic ${L}_2$ string stability under imperfect mode estimation. To further balance performance and automated effort, we then develop a Minimal Intervention Controller (MIC) that retains acceptable stability while limiting automation. Simulations using lane-changing data from the NGSIM dataset verify that the nominal controller reduces speed perturbations and shorten lane-changing time, while the MIC further reduces automated effort and enhances comfort but with moderate stability and efficiency loss. Validations on the TGSIM dataset with SAE Level 2 vehicles show that the MIC enables earlier lane changes than Level 2 control while preserving driver authority with a slight stability compromise. These findings highlight the potential of shared control strategies to balance stability, efficiency, and driver acceptance.
☆ Photoacoustics on the go: An Embedded Photoacoustic Sensing Platform
Several centimeters below the skin lie multiple biomarkers, such as glucose, oxygenation, and blood flow. Monitoring these biomarkers regularly and in a non-invasive manner would enable early insight into metabolic status and vascular health. Currently, there are only a handful of non-invasive monitoring systems. Optical methods offer molecular specificity (i.e., multi-biomarker monitoring) but have shallow reach (a few millimeters); ultrasound penetrates deeper but lacks specificity; and MRI is large, slow, and costly. Photoacoustic (PA) sensing combines the best of optical and ultrasound methods. A laser transmitter emits pulses that are absorbed by different molecules, providing specificity. These light pulses generate pressure changes that are captured by an ultrasound receiver, providing depth. Photoacoustic sensing is promising, but the current platforms are bulky, complex, and costly. We propose the first embedded PA platform. Our contributions are fourfold. First, inspired by LiDAR technology, we propose a novel transmitter that emits pulses similar to those in the state-of-the-art (SoA), but instead of using high-voltage sources and complex electronic interfaces, we use a simple low-power microcontroller (MCU). Second, we carry out a thorough analysis of our custom transmitter and a commercial system. Third, we build a basic ultrasound receiver that is able to process the faint signal generated by our transmitter. Lastly, we compare the performance of our platform against a SoA commercial system, and show that we can detect glucose and (de)oxygenated hemoglobin in two controlled solution studies. The resulting signal characteristics indicate a plausible path toward noninvasive, real-time, at-home sensing relevant to diabetes care. More broadly, this platform lays the groundwork for translating the promise of PA sensing into a broader practical reality.
☆ Minimum time consensus for damped second order agents using Gröbner basis
A problem of achieving minimum time consensus for a set of $N$ second-order LTI system agents with bounded inputs and fuel constraints is considered. Unlike our other works, here the damping effect in agent dynamics is included. First, the attainable set for each agent with fuel budget constraints is characterized, and its boundary equations are derived. Then, using the convexity property, the minimum time at which attainable sets of all agents have a non-empty intersection is computed. By applying Helly's theorem, the computation reduces to finding the minimum time to consensus and the corresponding consensus point for each of the triplets separately.
☆ Silicon-based Josephson junction field-effect transistors enabling cryogenic logic and quantum technologies
The continuous miniaturisation of metal-oxide-semiconductor field-effect transistors (MOSFETs) from long- to short-channel architectures has advanced beyond the predictions of Moore's Law. Continued advances in semiconductor electronics, even near current scaling and performance boundaries under cryogenic conditions, are driving the development of innovative device paradigms that enable ultra-low-power and high-speed functionality. Among emerging candidates, the Josephson Junction Field-Effect Transistor (JJFET or JoFET) provides an alternative by integrating superconducting source and drain electrodes for efficient, phase-coherent operation at ultra-low temperatures. These hybrid devices have the potential to bridge conventional semiconductor electronics with cryogenic logic and quantum circuits, enabling energy-efficient and high-coherence signal processing across temperature domains. This review traces the evolution from Josephson junctions to field-effect transistors, emphasising the structural and functional innovations that underpin modern device scalability. The performance and material compatibility of JJFETs fabricated on Si, GaAs, and InGaAs substrates are analysed, alongside an assessment of their switching dynamics and material compatibility. Particular attention is given to superconductor-silicon-superconductor Josephson junctions as the active core of JJFET architectures. By unfolding more than four decades of experimental progress, this work highlights the promise of JJFETs as foundational building blocks for next-generation cryogenic logic and quantum electronic systems.
☆ Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers
In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine learning (ML) and optimization is considered in this paper. Given a set of data distributed over a network of computing-nodes/servers, the idea is to optimally assign the CPU (central processing unit) usage while simultaneously training each computing node locally via its own share of data. This formulates the problem as a co-optimization setup to (i) optimize the data processing and (ii) optimally allocate the computing resources. The information-sharing network among the nodes might be time-varying, but with balanced weights to ensure consensus-type convergence of the algorithm. The algorithm is all-time feasible, which implies that the computing resource-demand balance constraint holds at all iterations of the proposed solution. Moreover, the solution allows addressing possible log-scale quantization over the information-sharing channels to exchange log-quantized data. For some example applications, distributed support-vector-machine (SVM) and regression are considered as the ML training models. Results from perturbation theory, along with Lyapunov stability and eigen-spectrum analysis, are used to prove the convergence towards the optimal case. As compared to existing CPU scheduling solutions, the proposed algorithm improves the cost optimality gap by more than $50\%$.
comment: EAAI Journal
☆ The Waterbed Effect on Quasiperiodic Disturbance Observer: Avoidance of Sensitivity Tradeoff with Time Delays
In linear time-invariant systems, the sensitivity function to disturbances is designed under a sensitivity tradeoff known as the waterbed effect. To compensate for a quasiperiodic disturbance, a quasiperiodic disturbance observer using time delays was proposed. Its sensitivity function avoids the sensitivity tradeoff, achieving wideband harmonic suppression without amplifying aperiodic disturbances or shifting harmonic suppression frequencies. However, its open-loop transfer function is not rational and does not satisfy the assumptions of existing Bode sensitivity integrals due to its time delays. This paper provides Bode-like sensitivity integrals for the quasiperiodic disturbance observer in both continuous-time and discrete-time representations and clarifies the avoided sensitivity tradeoff with time delays.
☆ Stochastic Long-Term Joint Decarbonization Planning for Power Systems and Data Centers: A Case Study in PJM
With the rapid growth of artificial intelligence (AI) and cloud services, data centers have become critical infrastructures driving digital economies, with increasing energy demand heightening concerns over electricity use and carbon emissions, emphasizing the need for carbon-aware infrastructure planning. Most studies assume static power systems, focus only on operational emissions, and overlook co-optimization. This paper proposes a dynamic joint planning framework that co-optimizes long-term data center and power system development over 15 years. The model determines siting, capacity, and type of data centers alongside power generation expansion, storage deployment, and retirements, accounting for both operational and embodied emissions. To handle multi-scale uncertainty, a large-scale two-stage stochastic program is formulated and solved via an enhanced Benders decomposition. Applied to the PJM Interconnection, with curated datasets released on GitHub, results show the system can support up to 55 GW peak data center demand, with Virginia (DOM) and Northern Illinois (ComEd) as optimal hosts. Compared to non-joint planning, the framework cuts investment cost by 12.6%, operational cost by 8.25%, and emissions by 5.63%. Including lifecycle emissions further raises renewable deployment by 25.5%, highlighting embodied carbon's role in deeper decarbonization.
☆ Control Synthesis with Reinforcement Learning: A Modeling Perspective
Controllers designed with reinforcement learning can be sensitive to model mismatch. We demonstrate that designing such controllers in a virtual simulation environment with an inaccurate model is not suitable for deployment in a physical setup. Controllers designed using an accurate model is robust against disturbance and small mismatch between the physical setup and the mathematical model derived from first principles; while a poor model results in a controller that performs well in simulation but fails in physical experiments. Sensitivity analysis is used to justify these discrepancies and an empirical region of attraction estimation help us visualize their robustness.
☆ A Parallelized Cutting-Plane Algorithm for Computationally Efficient Modelling to Generate Alternatives
Contemporary macro energy systems modelling is characterized by the need to represent strategic and operational decisions with high temporal and spatial resolution and represent discrete investment and retirement decisions. This drive towards greater fidelity, however, conflicts with a simultaneous push towards greater model representation of inherent complexity in decision making, including methods like Modelling to Generate Alternatives (MGA). MGA aims to map the feasible space of a model within a cost slack by varying investment parameters without changing the operational constraints, a process which frequently requires hundreds of solutions. For large, detailed energy system models this is impossible with traditional methods, leading researchers to reduce complexity with linearized investments and zonal or temporal aggregation. This research presents a new solution method for MGA type problems using cutting-plane methods based on a tailored reformulation of Benders Decomposition. We accelerate the algorithm by sharing cuts between MGA master problems and grouping MGA objectives. We find that our new solution method consistently solves MGA problems times faster and requires less memory than existing monolithic Modelling to Generate Alternatives solution methods on linear problems, enabling rapid computation of a greater number of solutions to highly resolved models. We also show that our novel cutting-plane algorithm enables the solution of very large MGA problems with integer investment decisions.
☆ A New Type of Axis-Angle Attitude Control Law for Rotational Systems: Synthesis, Analysis, and Experiments
Over the past few decades, continuous quaternion-based attitude control has been proven highly effective for driving rotational systems that can be modeled as rigid bodies, such as satellites and drones. However, methods rooted in this approach do not enforce the existence of a unique closed-loop (CL) equilibrium attitude-error quaternion (AEQ); and, for rotational errors about the attitude-error Euler axis larger than {\pi}rad, their proportional-control effect diminishes as the system state moves away from the stable equilibrium of the CL rotational dynamics. In this paper, we introduce a new type of attitude control law that more effectively leverages the attitude-error Euler axis-angle information to guarantee a unique CL equilibrium AEQ and to provide greater flexibility in the use of proportional-control efforts. Furthermore, using two different control laws as examples-through the construction of a strict Lyapunov function for the CL dynamics-we demonstrate that the resulting unique equilibrium of the CL rotational system can be enforced to be uniformly asymptotically stable. To assess and demonstrate the functionality and performance of the proposed approach, we performed numerical simulations and executed dozens of real-time tumble-recovery maneuvers using a small quadrotor. These simulations and flight tests compellingly demonstrate that the proposed axis-angle-based method achieves superior flight performance-compared with that obtained using a high-performance quaternion-based controller-in terms of stabilization time.
comment: 2025 International Conference on Advanced Robotics (ICAR)
☆ Risk-Aware Safety Filters with Poisson Safety Functions and Laplace Guidance Fields
Robotic systems navigating in real-world settings require a semantic understanding of their environment to properly determine safe actions. This work aims to develop the mathematical underpinnings of such a representation -- specifically, the goal is to develop safety filters that are risk-aware. To this end, we take a two step approach: encoding an understanding of the environment via Poisson's equation, and associated risk via Laplace guidance fields. That is, we first solve a Dirichlet problem for Poisson's equation to generate a safety function that encodes system safety as its 0-superlevel set. We then separately solve a Dirichlet problem for Laplace's equation to synthesize a safe \textit{guidance field} that encodes variable levels of caution around obstacles -- by enforcing a tunable flux boundary condition. The safety function and guidance fields are then combined to define a safety constraint and used to synthesize a risk-aware safety filter which, given a semantic understanding of an environment with associated risk levels of environmental features, guarantees safety while prioritizing avoidance of higher risk obstacles. We demonstrate this method in simulation and discuss how \textit{a priori} understandings of obstacle risk can be directly incorporated into the safety filter to generate safe behaviors that are risk-aware.
☆ Targeted Resilient Zoning for High Impact Events via Multi Circuit Polelines
The increasing frequency and severity of High Impact and Low Probability events such as hurricanes and windstorms pose significant challenges to the resilience of electrical power distribution systems, particularly in regions of New England where there is a significant amount of overhead infrastructure in areas where vegetation is predominant. Traditional reliability-focused planning is insufficient to address the systemic vulnerabilities exposed by such extreme events. This paper presents a novel risk based framework for long term resilience planning of active overhead distribution systems, with a specific focus on mitigating the impacts of high wind and hurricane induced outages.
♻ ☆ Distribution System Reconfiguration to Mitigate Load Altering Attacks via Stackelberg Games
The widespread integration of IoT-controllable devices (e.g., smart EV charging stations and heat pumps) into modern power systems enhances capabilities but introduces critical cybersecurity risks. Specifically, these devices are susceptible to load-altering attacks (LAAs) that can compromise power system safety. This paper quantifies the impact of LAAs on nodal voltage constraint violations in distribution networks (DNs). We first present closed-form expressions to analytically characterize LAA effects and quantify the minimum number of compromised devices for a successful LAA. Based on these insights, we propose a reactive defense mechanism that mitigates LAAs through DN reconfiguration. To address strategic adversaries, we then formulate defense strategies using a non-cooperative sequential game, which models the knowledgeable and strategic attacker, accounting for the worst-case scenario and enabling the reactive defender to devise an efficient and robust defense. Further, our formulation also accounts for uncertainties in attack localization. A novel Bayesian optimization approach is introduced to compute the Stackelberg equilibrium, significantly reducing computational burden efficiently. The game-theoretic strategy effectively mitigates the attack's impact while ensuring minimal system reconfiguration.
♻ ☆ Dual-Regularized Riccati Recursions for Interior-Point Optimal Control
We derive closed-form extensions of Riccati's recursions (both sequential and parallel) for solving dual-regularized LQR problems. We show how these methods can be used to solve general constrained, non-convex, discrete-time optimal control problems via a regularized interior point method, while guaranteeing that each step is a descent direction of an Augmented Barrier-Lagrangian merit function. We provide MIT-licensed implementations of our methods in C++ and JAX.
♻ ☆ Continuity Conditions for Piecewise Quadratic Functions on Simplicial Conic Partitions are Equivalent
Analysis of continuous-time piecewise linear systems based on piecewise quadratic (PWQ) Lyapunov functions typically requires continuity of these functions over a partition of the state space. Several conditions for guaranteeing continuity of PWQ functions over state space partitions can be found in the literature. In this technical note, we show that these continuity conditions are equivalent over so-called simplicial conic partitions. As a consequence, the choice of which condition to impose can be based solely on practical considerations such as specific application or numerical aspects, without introducing additional conservatism in the analysis.
comment: 8 pages, 3 figures
♻ ☆ Binary Decision Process in Pre-Evacuation Behavior
In crowd evacuation the time interval before decisive movement towards a safe place is defined as the pre-evacuation phase, and it has crucial impact on the total time required for safe egress. This process mainly refers to situation awareness and response to an external stressors, e.g., fire alarms. Due to the complexity of human cognitive process, simulation is used to study this important time interval. In this paper a binary decision process is formulated to simulate pre-evacuation time of many evacuees in a given social context. The model combines the classic opinion dynamics (the French-DeGroot model) with binary phase transition to describe how group pre-evacuation time emerges from individual interaction. The model parameters are quantitatively meaningful to human factors research within socio-psychological background, e.g., whether an individual is stubborn or open-minded, or what kind of the social topology exists among the individuals and how it matters in aggregating individuals into social groups. The modeling framework also describes collective motion of many evacuee agents in a planar space, and the resulting multi-agent system is partly similar to the Vicsek flocking model, and it is meaningful to explore complex social behavior during phase transition of a non-equilibrium process.
comment: 5 pages
♻ ☆ Multi-robot Motion Planning based on Nets-within-Nets Modeling and Simulation
This paper focuses on designing motion plans for a heterogeneous team of robots that must cooperate to fulfill a global mission. Robots move in an environment that contains some regions of interest, while the specification for the entire team can include avoidance, visits, or sequencing of these regions of interest. The mission is expressed in terms of a Petri net corresponding to an automaton, while each robot is also modeled by a state machine Petri net. The current work brings about the following contributions with respect to existing solutions for related problems. First, we propose a novel model, denoted High-Level robot team Petri Net (HLrtPN) system, to incorporate the specification and robot models into the Nets-within-Nets paradigm. A guard function, named Global Enabling Function, is designed to synchronize the firing of transitions so that robot motions do not violate the specification. Then, the solution is found by simulating the HLrtPN system in a specific software tool that accommodates Nets-within-Nets. Illustrative examples based on Linear Temporal Logic missions support the computational feasibility of the proposed framework.
comment: [Note for readers] This paper has been extended from a previous submission to 62nd IEEE Conference on Decision and Control, Dec. 13-15, 2023. This work has been submitted to the IEEE for possible publication
♻ ☆ ES-HPC-MPC: Exponentially Stable Hybrid Perception Constrained MPC for Quadrotor with Suspended Payloads
Aerial transportation using quadrotors with cable-suspended payloads holds great potential for applications in disaster response, logistics, and infrastructure maintenance. However, their hybrid and underactuated dynamics pose significant control and perception challenges. Traditional approaches often assume a taut cable condition, limiting their effectiveness in real-world applications where slack-to-taut transitions occur due to disturbances. We introduce ES-HPC-MPC, a model predictive control framework that enforces exponential stability and perception-constrained control under hybrid dynamics. Our method leverages Exponentially Stabilizing Control Lyapunov Functions (ES-CLFs) to enforce stability during the tasks and Control Barrier Functions (CBFs) to maintain the payload within the onboard camera's field of view (FoV). We validate our method through both simulation and real-world experiments, demonstrating stable trajectory tracking and reliable payload perception. We validate that our method maintains stability and satisfies perception constraints while tracking dynamically infeasible trajectories and when the system is subjected to hybrid mode transitions caused by unexpected disturbances.
comment: Accepted to IEEE Robotics and Automation Letters
♻ ☆ Privacy Preservation by Local Design in Cooperative Networked Control Systems
In this paper, we study the privacy preservation problem in a cooperative networked control system, which has closed-loop dynamics, working for the task of linear quadratic Guassian (LQG) control. The system consists of a user and a server: the user owns the plant to control, while the server provides computation capability, and the user employs the server to compute control inputs for it. To enable the server's computation, the user needs to provide the measurements of the plant states to the server, who then calculates estimates of the states, based on which the control inputs are computed. However, the user regards the states as privacy, and makes an interesting request: the user wants the server to have "incorrect" knowledge of the state estimates rather than the true values. Regarding that, we propose a novel design methodology for the privacy preservation, in which the privacy scheme is locally equipped at the user side not open to the server, which manages to create a deviation in the server's knowledge of the state estimates from the true values. However, this methodology also raises significant challenges: in a closed-loop dynamic system, when the server's seized knowledge is incorrect, the system's behavior becomes complex to analyze; even the stability of the system becomes questionable, as the incorrectness will accumulate through the closed loop as time evolves. In this paper, we succeed in showing that the performance loss in LQG control caused by the proposed privacy scheme is bounded by rigorous mathematical proofs, which convinces the availability of the proposed design methodology. We also propose an associated novel privacy metric and obtain the analytical result on evaluating the privacy performance. Finally, we study the performance trade-off between privacy and control, where the accordingly proposed optimization problems are solved by numerical methods efficiently.
comment: 14 pages, 7 figures
♻ ☆ Cryo-CMOS Antenna for Wireless Communications within a Quantum Computer Cryostat
Scaling quantum computers from a few qubits to large numbers remains one of the critical challenges in realizing practical quantum advantage. Multi-core quantum architectures have emerged as a promising solution, enabling scalability through distributed quantum processing units (QPUs) interconnected via classical and quantum links. However, the bottleneck of wired connections persists, as densely packed wired interconnects, both vertically across temperature stages and horizontally within the same layer, introduce spatial constraints, power dissipation, and latency, which could hinder performance as the number of QPUs increases. To overcome these limitations, this work proposes a cryo-compatible on-chip differential dipole antenna operating at 28 GHz to enable short-range wireless communication within a quantum computer cryostat. Temperature-dependent material properties are incorporated to accurately capture antenna behavior at 4 K. Moreover, by embedding the antenna in a realistic cryostat structure, we evaluate the feasibility of antenna operation within the cryogenic environment. The proposed antenna achieves a reflection coefficient of -20.8 dB in free space and -18.38 dB within the cryostat, demonstrating efficient impedance matching.
♻ ☆ Data-Efficient Excavation Force Estimation for Wheel Loaders
Accurate prediction of excavation forces is critical for enabling autonomous operation and optimizing control strategies in earthmoving machinery. Conventional approaches often depend on extensive data collection or computationally expensive simulations across multiple soil types, which limits their scalability and adaptability. This study presents a data-efficient framework that calibrates soil parameters using force measurements from the preceding bucket-loading cycle. The proposed method is based on an analytical soil-tool interaction model formulated through the fundamental earthmoving equation, and employs a multi-stage optimization procedure during the loading phase to identify relevant soil parameters. These estimated parameters are then used to predict excavation forces in the subsequent cycle, allowing the system to adapt its control inputs without relying on large-scale datasets or machine learning model training. The framework is validated through high-fidelity simulations in the Algoryx Dynamics engine under different soil types and excavation trajectories, achieving root-mean-square prediction errors between 10% and 15%. This cycle-to-cycle adaptation demonstrates strong potential for scalable, online force estimation and efficient path planning in wheel loader operations.
Graphics 8
☆ FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.
☆ mitransient: Transient light transport in Mitsuba 3
mitransient is a light transport simulation tool that extends Mitsuba 3 with support for time-resolved simulations. In essence, mitransient extends conventional rendering by adding a temporal dimension which accounts for the time of flight of light. This allows rapid prototyping of novel transient imaging systems without the need of costly or difficult-to-operate hardware. Our code is trivially easy to install through pip, and consists of Python modules that can run both in CPU and GPU by leveraging the JIT capabilities of Mitsuba 3. It provides physically-based simulations of complex phenomena, including a wide variety of realistic materials and participating media such as fog or smoke. In addition, we extend Mitsuba 3's functionality to support time-resolved polarization tracking of light and transient differentiable rendering. Finally, we also include tools that simplify the use of our simulations for non-line-of-sight imaging, enabling realistic scene setups with capture noise to be simulated in just seconds of minutes. Altogether, we hope that mitransient will support the research community in developing novel algorithms for transient imaging.
comment: 3 pages, 3 figures. For further documentation for mitransient see https://mitransient.readthedocs.io
☆ 4-Doodle: Text to 3D Sketches that Move!
We present a novel task: text-to-3D sketch animation, which aims to bring freeform sketches to life in dynamic 3D space. Unlike prior works focused on photorealistic content generation, we target sparse, stylized, and view-consistent 3D vector sketches, a lightweight and interpretable medium well-suited for visual communication and prototyping. However, this task is very challenging: (i) no paired dataset exists for text and 3D (or 4D) sketches; (ii) sketches require structural abstraction that is difficult to model with conventional 3D representations like NeRFs or point clouds; and (iii) animating such sketches demands temporal coherence and multi-view consistency, which current pipelines do not address. Therefore, we propose 4-Doodle, the first training-free framework for generating dynamic 3D sketches from text. It leverages pretrained image and video diffusion models through a dual-space distillation scheme: one space captures multi-view-consistent geometry using differentiable B\'ezier curves, while the other encodes motion dynamics via temporally-aware priors. Unlike prior work (e.g., DreamFusion), which optimizes from a single view per step, our multi-view optimization ensures structural alignment and avoids view ambiguity, critical for sparse sketches. Furthermore, we introduce a structure-aware motion module that separates shape-preserving trajectories from deformation-aware changes, enabling expressive motion such as flipping, rotation, and articulated movement. Extensive experiments show that our method produces temporally realistic and structurally stable 3D sketch animations, outperforming existing baselines in both fidelity and controllability. We hope this work serves as a step toward more intuitive and accessible 4D content creation.
☆ Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation
Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.
comment: 18 pages, 6 figures, accepted to ICXR 2025 conference
☆ Fast and Robust Point Containment Queries on Trimmed Surface
Point containment queries on trimmed surfaces are fundamental to CAD modeling, solid geometry processing, and surface tessellation. Existing approaches such as ray casting and generalized winding numbers often face limitations in robustness and computational efficiency. We propose a fast and numerically stable method for performing containment queries on trimmed surfaces, including those with periodic parameterizations. Our approach introduces a recursive winding number computation scheme that replaces costly curve subdivision with an ellipse-based bound for Bezier segments, enabling linear-time evaluation. For periodic surfaces, we lift trimming curves to the universal covering space, allowing accurate and consistent winding number computation even for non-contractible or discontinuous loops in parameter domain. Experiments show that our method achieves substantial speedups over existing winding-number algorithms while maintaining high robustness in the presence of geometric noise, open boundaries, and periodic topologies. We further demonstrate its effectiveness in processing real B-Rep models and in robust tessellation of trimmed surfaces.
☆ Off-Centered WoS-Type Solvers with Statistical Weighting SIGGRAPH
Stochastic PDE solvers have emerged as a powerful alternative to traditional discretization-based methods for solving partial differential equations (PDEs), especially in geometry processing and graphics. While off-centered estimators enhance sample reuse in WoS-type Monte Carlo solvers, they introduce correlation artifacts and bias when Green's functions are approximated. In this paper, we propose a statistically weighted off-centered WoS-type estimator that leverages local similarity filtering to selectively combine samples across neighboring evaluation points. Our method balances bias and variance through a principled weighting strategy that suppresses unreliable estimators. We demonstrate our approach's effectiveness on various PDEs,including screened Poisson equations and boundary conditions, achieving consistent improvements over existing solvers such as vanilla Walk on Spheres, mean value caching, and boundary value caching. Our method also naturally extends to gradient field estimation and mixed boundary problems.
comment: SIGGRAPH Asia 2025 conference paper
♻ ☆ HRM^2Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans SIGGRAPH
We present HRM$^2$Avatar, a framework for creating high-fidelity avatars from monocular phone scans, which can be rendered and animated in real time on mobile devices. Monocular capture with smartphones provides a low-cost alternative to studio-grade multi-camera rigs, making avatar digitization accessible to non-expert users. Reconstructing high-fidelity avatars from single-view video sequences poses challenges due to limited visual and geometric data. To address these limitations, at the data level, our method leverages two types of data captured with smartphones: static pose sequences for texture reconstruction and dynamic motion sequences for learning pose-dependent deformations and lighting changes. At the representation level, we employ a lightweight yet expressive representation to reconstruct high-fidelity digital humans from sparse monocular data. We extract garment meshes from monocular data to model clothing deformations effectively, and attach illumination-aware Gaussians to the mesh surface, enabling high-fidelity rendering and capturing pose-dependent lighting. This representation efficiently learns high-resolution and dynamic information from monocular data, enabling the creation of detailed avatars. At the rendering level, real-time performance is critical for animating high-fidelity avatars in AR/VR, social gaming, and on-device creation. Our GPU-driven rendering pipeline delivers 120 FPS on mobile devices and 90 FPS on standalone VR devices at 2K resolution, over $2.7\times$ faster than representative mobile-engine baselines. Experiments show that HRM$^2$Avatar delivers superior visual realism and real-time interactivity, outperforming state-of-the-art monocular methods.
comment: SIGGRAPH Asia 2025, Project Page: https://acennr-engine.github.io/HRM2Avatar
♻ ☆ Neural Atlas Graphs for Dynamic Scene Decomposition and Editing
Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing - the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset - by 5 dB PSNR increase compared to existing methods - and make environmental editing possible in high resolution and visual quality - creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably - by more than 7 dB in PSNR - to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes. Project Page: https://princeton-computational-imaging.github.io/nag/
Computation and Language 126
☆ Gaperon: A Peppered English-French Generative Language Model Suite
We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.
☆ Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models
Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
comment: Post-hoc attribution
☆ DiagramEval: Evaluating LLM-Generated Diagrams via Graphs
Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.
☆ Task Completion Agents are Not Ideal Collaborators
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
comment: 22 pages, 5 figures, 3 tables
☆ Scaling Latent Reasoning via Looped Language Models
Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: http://ouro-llm.github.io.
☆ The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework
Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.
comment: 14 pages, 11 figures
☆ The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
comment: Website: https://toolathlon.xyz/
☆ Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML? CIKM 2025
Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.
comment: 8 pages, 6 figures, 3 tables, CIKM 2025 FinFAI workshop
☆ Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.
☆ PairUni: Pairwise Training for Unified Multimodal Language Models
Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}
☆ ZK-SenseLM: Verifiable Large-Model Wireless Sensing with Selective Abstention and Zero-Knowledge Attestation
ZK-SenseLM is a secure and auditable wireless sensing framework that pairs a large-model encoder for Wi-Fi channel state information (and optionally mmWave radar or RFID) with a policy-grounded decision layer and end-to-end zero-knowledge proofs of inference. The encoder uses masked spectral pretraining with phase-consistency regularization, plus a light cross-modal alignment that ties RF features to compact, human-interpretable policy tokens. To reduce unsafe actions under distribution shift, we add a calibrated selective-abstention head; the chosen risk-coverage operating point is registered and bound into the proof. We implement a four-stage proving pipeline: (C1) feature sanity and commitment, (C2) threshold and version binding, (C3) time-window binding, and (C4) PLONK-style proofs that the quantized network, given the committed window, produced the logged action and confidence. Micro-batched proving amortizes cost across adjacent windows, and a gateway option offloads proofs from low-power devices. The system integrates with differentially private federated learning and on-device personalization without weakening verifiability: model hashes and the registered threshold are part of each public statement. Across activity, presence or intrusion, respiratory proxy, and RF fingerprinting tasks, ZK-SenseLM improves macro-F1 and calibration, yields favorable coverage-risk curves under perturbations, and rejects tamper and replay with compact proofs and fast verification.
comment: 45 pages
☆ EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis
Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.
☆ Are Language Models Efficient Reasoners? A Perspective from Logic Programming NeurIPS 2025
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.
comment: Accepted to NeurIPS 2025
☆ Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks EMNLP
Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming \citep{snell2024scaling, chen2024more}, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
comment: Accepted to EMNLP - NLLP Workshop
☆ FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering
The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.
comment: 37 pages, 5 figures, 10 tables. Keywords: Retrieval-Augmented Generation (RAG), Question Answering (QA), Islamic Knowledge Base, Faithful AI, Persian NLP, Multi-hop Reasoning, Large Language Models (LLMs)
☆ Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry ICML 2025
While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents' ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. https://github.com/Roihn/EinsteinPuzzles
comment: Workshop on Multi-Agent System @ ICML 2025
☆ Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models LREC 2026
Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.
comment: 8 pages, 3 figures, 4 tables, submitted to LREC 2026
☆ Hybrid Quantum-Classical Recurrent Neural Networks
We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the entire recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC, residing in an exponentially large Hilbert space $\mathbb{C}^{2^n}$. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit measurements, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling, adopting projective measurements as a limiting case to obtain mid-circuit readouts while maintaining a coherent recurrent quantum memory. We further devise a soft attention mechanism over the mid-circuit readouts in a sequence-to-sequence model and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
☆ TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.
comment: Main paper: 11 pages, 3 figures, 6 tables. Appendix: 28 pages. Bangde Du and Minghao Guo contributed equally. Corresponding authors: Ziyi Ye (ziyiye@fudan.edu.cn), Qingyao Ai (aiqy@tsinghua.edu.cn)
☆ Fine-Tuned Language Models for Domain-Specific Summarization and Tagging
This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.
☆ Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.
comment: 27 pages, 5 figures
☆ More than a Moment: Towards Coherent Sequences of Audio Descriptions
Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.
☆ A Critical Study of Automatic Evaluation in Sign Language Translation LREC 2026
Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.
comment: Submitted to the LREC 2026 conference
☆ Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research
Large language models (LLMs) are increasingly utilized by researchers across a wide range of domains, and qualitative social science is no exception; however, this adoption faces persistent challenges, including interpretive bias, low reliability, and weak auditability. We introduce a framework that situates LLM usage along two dimensions, interpretive depth and autonomy, thereby offering a straightforward way to classify LLM applications in qualitative research and to derive practical design recommendations. We present the state of the literature with respect to these two dimensions, based on all published social science papers available on Web of Science that use LLMs as a tool and not strictly as the subject of study. Rather than granting models expansive freedom, our approach encourages researchers to decompose tasks into manageable segments, much as they would when delegating work to capable undergraduate research assistants. By maintaining low levels of autonomy and selectively increasing interpretive depth only where warranted and under supervision, one can plausibly reap the benefits of LLMs while preserving transparency and reliability.
comment: Presented at the Annual Meeting of the American Political Science Association, Vancouver, BC, September 11--14 2025
☆ RLMEval: Evaluating Research-Level Neural Theorem Proving EMNLP 2025
Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.
comment: Accepted to EMNLP 2025 Findings. RLMEval benchmark released: https://github.com/augustepoiroux/RLMEval
☆ Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction
The rapid advancement of Large Language Models (LLMs) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines LLMs' ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced communication. Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.
comment: The manuscript is approximately 7360 words and contains 12 figures and 6 tables
☆ Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.
comment: Accepted by RANLP 2025
☆ Serve Programs, Not Prompts SOSP 2025
Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.
comment: HotOS 2025. Follow-up implementation work (SOSP 2025) is available at arXiv:2510.24051
☆ BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.
☆ Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires
The development of AI for mental health is hindered by a lack of authentic therapy dialogues, due to strict privacy regulations and the fact that clinical sessions were historically rarely recorded. We present an LLM-driven pipeline that generates synthetic counseling dialogues based on structured client profiles and psychological questionnaires. Grounded on the principles of Cognitive Behavioral Therapy (CBT), our method creates synthetic therapeutic conversations for clinical disorders such as anxiety and depression. Our framework, SQPsych (Structured Questionnaire-based Psychotherapy), converts structured psychological input into natural language dialogues through therapist-client simulations. Due to data governance policies and privacy restrictions prohibiting the transmission of clinical questionnaire data to third-party services, previous methodologies relying on proprietary models are infeasible in our setting. We address this limitation by generating a high-quality corpus using open-weight LLMs, validated through human expert evaluation and LLM-based assessments. Our SQPsychLLM models fine-tuned on SQPsychConv achieve strong performance on counseling benchmarks, surpassing baselines in key therapeutic skills. Our findings highlight the potential of synthetic data to enable scalable, data-secure, and clinically informed AI for mental health support. We will release our code, models, and corpus at https://ai-mh.github.io/SQPsych
☆ Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy
Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in bibliographic recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM's ability to correctly produce bibliographic information depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the training corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record is repeatedly represented in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 bibliographic records across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) hallucination rates vary across research domains, (ii) citation count is strongly correlated with factual accuracy, and (iii) bibliographic information becomes almost verbatimly memorized beyond approximately 1,000 citations. These findings suggest that highly cited papers are nearly verbatimly retained in the model, indicating a threshold where generalization shifts into memorization.
☆ Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs
Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017--2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis.
☆ CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs EMNLP2025
This work investigates whether small-scale LMs can benefit from instruction tuning. We compare conversational and question-answering instruction tuning datasets, applied either in a merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluation spans both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, and psycholinguistic correlation) settings. Results show that instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data; however, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization. These results highlight both the potential and the constraints of adapting human-inspired learning strategies to low-resource LMs, and point toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.
comment: Paper accepted for oral presentation at the BabyLM Challange 2025 (EMNLP2025)
☆ Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments
Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.
☆ CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories
Recent years have witnessed the rapid development of LLM-based agents, which shed light on using language agents to solve complex real-world problems. A prominent application lies in business agents, which interact with databases and internal knowledge bases via tool calls to fulfill diverse user requirements. However, this domain is characterized by intricate data relationships and a wide range of heterogeneous tasks, from statistical data queries to knowledge-based question-answering. To address these challenges, we propose CRMWeaver, a novel approach that enhances business agents in such complex settings. To acclimate the agentic model to intricate business environments, we employ a synthesis data generation and RL-based paradigm during training, which significantly improves the model's ability to handle complex data and varied tasks. During inference, a shared memories mechanism is introduced, prompting the agent to learn from task guidelines in similar problems, thereby further boosting its effectiveness and generalization, especially in unseen scenarios. We validate the efficacy of our approach on the CRMArena-Pro dataset, where our lightweight model achieves competitive results in both B2B and B2C business scenarios, underscoring its practical value for real-world applications.
☆ GAP: Graph-Based Agent Planning with Parallel Tool Use and Reinforcement Learning
Autonomous agents powered by large language models (LLMs) have shown impressive capabilities in tool manipulation for complex task-solving. However, existing paradigms such as ReAct rely on sequential reasoning and execution, failing to exploit the inherent parallelism among independent sub-tasks. This sequential bottleneck leads to inefficient tool utilization and suboptimal performance in multi-step reasoning scenarios. We introduce Graph-based Agent Planning (GAP), a novel framework that explicitly models inter-task dependencies through graph-based planning to enable adaptive parallel and serial tool execution. Our approach trains agent foundation models to decompose complex tasks into dependency-aware sub-task graphs, autonomously determining which tools can be executed in parallel and which must follow sequential dependencies. This dependency-aware orchestration achieves substantial improvements in both execution efficiency and task accuracy. To train GAP, we construct a high-quality dataset of graph-based planning traces derived from the Multi-Hop Question Answering (MHQA) benchmark. We employ a two-stage training strategy: supervised fine-tuning (SFT) on the curated dataset, followed by reinforcement learning (RL) with a correctness-based reward function on strategically sampled queries where tool-based reasoning provides maximum value. Experimental results on MHQA datasets demonstrate that GAP significantly outperforms traditional ReAct baselines, particularly on multi-step retrieval tasks, while achieving dramatic improvements in tool invocation efficiency through intelligent parallelization. The project page is available at: https://github.com/WJQ7777/Graph-Agent-Planning.
☆ Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning
Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.
☆ Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student
Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model's performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.
☆ Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA
Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.
comment: Accepted at the Forum for Information Retrieval Evaluation 2025 (VATIKA Track)
☆ From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.
☆ ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation
While Large Language Models (LLMs) are increasingly used in agentic frameworks to assist individual users, there is a growing need for agents that can proactively manage complex, multi-party collaboration. Systematic evaluation methods for such proactive agents remain scarce, limiting progress in developing AI that can effectively support multiple people together. Negotiation offers a demanding testbed for this challenge, requiring socio-cognitive intelligence to navigate conflicting interests between multiple participants and multiple topics and build consensus. Here, we present ProMediate, the first framework for evaluating proactive AI mediator agents in complex, multi-topic, multi-party negotiations. ProMediate consists of two core components: (i) a simulation testbed based on realistic negotiation cases and theory-driven difficulty levels (ProMediate-Easy, ProMediate-Medium, and ProMediate-Hard), with a plug-and-play proactive AI mediator grounded in socio-cognitive mediation theories, capable of flexibly deciding when and how to intervene; and (ii) a socio-cognitive evaluation framework with a new suite of metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence. Together, these components establish a systematic framework for assessing the socio-cognitive intelligence of proactive AI agents in multi-party settings. Our results show that a socially intelligent mediator agent outperforms a generic baseline, via faster, better-targeted interventions. In the ProMediate-Hard setting, our social mediator increases consensus change by 3.6 percentage points compared to the generic baseline (10.65\% vs 7.01\%) while being 77\% faster in response (15.98s vs. 3.71s). In conclusion, ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents.
☆ RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models
Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.
comment: 17 pages, 11 figures
☆ Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction
While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling the most. The story became even more interesting when we introduced Chain-of-Thought (CoT) prompting. For the struggling LLaMA 3, CoT acted as a helpful guide, significantly boosting its accuracy. However, for the more capable GPT-4 and Gemini, the same technique often backfired, leading to a kind of "overthinking" that hurt their results in the cross-lingual context. This reveals that Chain-of-Thought is not a universal solution; its effectiveness depends heavily on the model's baseline capability and the specific context of the task. Our framework pinpoints LLM weaknesses, highlights when CoT helps or hinders cross-lingual NSP performance, and factors influencing their decisions.
☆ Model-Document Protocol for AI Search
AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.
comment: 10 pages
☆ Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR SC 2025
Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.
comment: Awarded Best Student Paper at APSIPA ASC 2025
☆ A Survey on Unlearning in Large Language Models
The advancement of Large Language Models (LLMs) has revolutionized natural language processing, yet their training on massive corpora poses significant risks, including the memorization of sensitive personal data, copyrighted material, and knowledge that could facilitate malicious activities. To mitigate these issues and align with legal and ethical standards such as the "right to be forgotten", machine unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021, focusing exclusively on large-scale generative models. Distinct from prior surveys, we introduce novel taxonomies for both unlearning methods and evaluations. We clearly categorize methods into training-time, post-training, and inference-time based on the training stage at which unlearning is applied. For evaluations, we not only systematically compile existing datasets and metrics but also critically analyze their advantages, disadvantages, and applicability, providing practical guidance to the research community. In addition, we discuss key challenges and promising future research directions. Our comprehensive overview aims to inform and guide the ongoing development of secure and reliable LLMs.
Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation
This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights into effective pretraining strategies for low-resource machine translation, helping to bridge the performance gap between high-resource and low-resource languages. The results contribute to the broader goal of developing more inclusive and accurate NLP models for marginalized communities and underrepresented populations. The code and datasets used in this study are publicly available to facilitate further research and ensure reproducibility, with the exception of certain data that may no longer be accessible due to changes in public availability.
comment: 8 pages, 1. figure
☆ DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates
Accurately modeling opinion change through social interactions is crucial for addressing issues like misinformation and polarization. While role-playing large language models (LLMs) offer a promising way to simulate human-like interactions, existing research shows that single-agent alignment does not guarantee authentic multi-agent group dynamics. Current LLM role-play setups often produce unnatural dynamics (e.g., premature convergence), without an empirical benchmark to measure authentic human opinion trajectories. To bridge this gap, we introduce DEBATE, the first large-scale empirical benchmark explicitly designed to evaluate the authenticity of the interaction between multi-agent role-playing LLMs. DEBATE contains 29,417 messages from multi-round debate conversations among over 2,792 U.S.-based participants discussing 107 controversial topics, capturing both publicly-expressed messages and privately-reported opinions. Using DEBATE, we systematically evaluate and identify critical discrepancies between simulated and authentic group dynamics. We further demonstrate DEBATE's utility for aligning LLMs with human behavior through supervised fine-tuning, achieving improvements in surface-level metrics (e.g., ROUGE-L and message length) while highlighting limitations in deeper semantic alignment (e.g., semantic similarity). Our findings highlight both the potential and current limitations of role-playing LLM agents for realistically simulating human-like social dynamics.
☆ KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA
Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.
☆ BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs
Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.
☆ TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors
Traditional approaches to semantic polarity in computational linguistics treat sentiment as a unidimensional scale, overlooking the multidimensional structure of language. This work introduces TOPol (Topic-Orientation POLarity), a semi-unsupervised framework for reconstructing and interpreting multidimensional narrative polarity fields under human-on-the-loop (HoTL) defined contextual boundaries (CBs). The framework embeds documents using a transformer-based large language model (tLLM), applies neighbor-tuned UMAP projection, and segments topics via Leiden partitioning. Given a CB between discourse regimes A and B, TOPol computes directional vectors between corresponding topic-boundary centroids, yielding a polarity field that quantifies fine-grained semantic displacement during regime shifts. This vectorial representation enables assessing CB quality and detecting polarity changes, guiding HoTL CB refinement. To interpret identified polarity vectors, the tLLM compares their extreme points and produces contrastive labels with estimated coverage. Robustness analyses show that only CB definitions (the main HoTL-tunable parameter) significantly affect results, confirming methodological stability. We evaluate TOPol on two corpora: (i) U.S. Central Bank speeches around a macroeconomic breakpoint, capturing non-affective semantic shifts, and (ii) Amazon product reviews across rating strata, where affective polarity aligns with NRC valence. Results demonstrate that TOPol consistently captures both affective and non-affective polarity transitions, providing a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis.
comment: 7 pages, 3 figures and 2 tables
☆ Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?
Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.
☆ GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models
Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf{\small TABI}, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.
☆ Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech
Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.
comment: This work has been submitted to the IEEE for possible publication
☆ Rethinking Cross-lingual Alignment: Balancing Transfer and Cultural Erasure in Multilingual LLMs
Cross-lingual alignment (CLA) aims to align multilingual representations, enabling Large Language Models (LLMs) to seamlessly transfer knowledge across languages. While intuitive, we hypothesize, this pursuit of representational convergence can inadvertently cause "cultural erasure", the functional loss of providing culturally-situated responses that should diverge based on the query language. In this work, we systematically analyze this trade-off by introducing a holistic evaluation framework, the transfer-localization plane, which quantifies both desirable knowledge transfer and undesirable cultural erasure. Using this framework, we re-evaluate recent CLA approaches and find that they consistently improve factual transfer at the direct cost of cultural localization across all six languages studied. Our investigation into the internal representations of these models reveals a key insight: universal factual transfer and culturally-specific knowledge are optimally steerable at different model layers. Based on this finding, we propose Surgical Steering, a novel inference-time method that disentangles these two objectives. By applying targeted activation steering to distinct layers, our approach achieves a better balance between the two competing dimensions, effectively overcoming the limitations of current alignment techniques.
☆ PORTool: Tool-Use LLM Training with Rewarded Tree
Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.
☆ CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.
☆ Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
☆ AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.
comment: 10 pages, 6 figures, submitted to Ninth Annual Conference on Machine Learning and Systems (MLSys'26)
☆ NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium EuroSys'26
AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.
comment: 12 pages, 8 figures, submitted to the Proceedings of the Twenty-First European Conference on Computer Systems (EuroSys'26)
☆ SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation
Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.
☆ Semantic Label Drift in Cross-Cultural Translation
Machine Translation (MT) is widely employed to address resource scarcity in low-resource languages by generating synthetic data from high-resource counterparts. While sentiment preservation in translation has long been studied, a critical but underexplored factor is the role of cultural alignment between source and target languages. In this paper, we hypothesize that semantic labels are drifted or altered during MT due to cultural divergence. Through a series of experiments across culturally sensitive and neutral domains, we establish three key findings: (1) MT systems, including modern Large Language Models (LLMs), induce label drift during translation, particularly in culturally sensitive domains; (2) unlike earlier statistical MT tools, LLMs encode cultural knowledge, and leveraging this knowledge can amplify label drift; and (3) cultural similarity or dissimilarity between source and target languages is a crucial determinant of label preservation. Our findings highlight that neglecting cultural factors in MT not only undermines label fidelity but also risks misinterpretation and cultural conflict in downstream applications.
☆ Revisiting Multilingual Data Mixtures in Language Model Pretraining
The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits across language families, and contrary to expectations, selecting a pivot language from within a specific family does not consistently improve performance for languages within that family. Lastly, we do not observe a significant "curse of multilinguality" as the number of training languages increases in models at this scale. Our findings suggest that multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings
comment: Under Review
☆ RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.
☆ FakeZero: Real-Time, Privacy-Preserving Misinformation Detection for Facebook and X
Social platforms distribute information at unprecedented speed, which in turn accelerates the spread of misinformation and threatens public discourse. We present FakeZero, a fully client-side, cross-platform browser extension that flags unreliable posts on Facebook and X (formerly Twitter) while the user scrolls. All computation, DOM scraping, tokenisation, Transformer inference, and UI rendering run locally through the Chromium messaging API, so no personal data leaves the device.FakeZero employs a three-stage training curriculum: baseline fine-tuning and domain-adaptive training enhanced with focal loss, adversarial augmentation, and post-training quantisation. Evaluated on a dataset of 239,000 posts, the DistilBERT-Quant model (67.6 MB) reaches 97.1% macro-F1, 97.4% accuracy, and an AUROC of 0.996, with a median latency of approximately 103 ms on a commodity laptop. A memory-efficient TinyBERT-Quant variant retains 95.7% macro-F1 and 96.1% accuracy while shrinking the model to 14.7 MB and lowering latency to approximately 40 ms, showing that high-quality fake-news detection is feasible under tight resource budgets with only modest performance loss.By providing inline credibility cues, the extension can serve as a valuable tool for policymakers seeking to curb the spread of misinformation across social networks. With user consent, FakeZero also opens the door for researchers to collect large-scale datasets of fake news in the wild, enabling deeper analysis and the development of more robust detection techniques.
comment: Accepted for publication in the Proceedings of the 24th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom 2025) Privacy track, 11 pages, 8 figures
☆ Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation
The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and dataset is a reality. Nonetheless, despite the potential of such tools for linguistic research, comprehensive evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage and diversity in three experimental settings: manual, automatic and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time.
☆ Approximating Human Preferences Using a Multi-Judge Learned System
Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).
☆ Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.
♻ ☆ Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation
Generative AI systems for music and video commonly use text-based filters to prevent the regurgitation of copyrighted material. We expose a fundamental flaw in this approach by introducing Adversarial PhoneTic Prompting (APT), a novel attack that bypasses these safeguards by exploiting phonetic memorization. The APT attack replaces iconic lyrics with homophonic but semantically unrelated alternatives (e.g., "mom's spaghetti" becomes "Bob's confetti"), preserving acoustic structure while altering meaning; we identify high-fidelity phonetic matches using CMU pronouncing dictionary. We demonstrate that leading Lyrics-to-Song (L2S) models like SUNO and YuE regenerate songs with striking melodic and rhythmic similarity to their copyrighted originals when prompted with these altered lyrics. More surprisingly, this vulnerability extends across modalities. When prompted with phonetically modified lyrics from a song, a Text-to-Video (T2V) model like Veo 3 reconstructs visual scenes from the original music video-including specific settings and character archetypes-despite the absence of any visual cues in the prompt. Our findings reveal that models memorize deep, structural patterns tied to acoustics, not just verbatim text. This phonetic-to-visual leakage represents a critical vulnerability in transcript-conditioned generative models, rendering simple copyright filters ineffective and raising urgent concerns about the secure deployment of multimodal AI systems. Demo examples are available at our project page (https://jrohsc.github.io/music_attack/).
♻ ☆ LLMs are Better Than You Think: Label-Guided In-Context Learning for Named Entity Recognition EMNLP 2025
In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. However, in Named Entity Recognition (NER), existing ICL methods typically rely on task-agnostic semantic similarity for demonstration retrieval, which often yields less relevant examples and leads to inferior results. We introduce DEER, a training-free ICL approach that enables LLMs to make more informed entity predictions through the use of label-grounded statistics. DEER leverages token-level statistics from training labels to identify tokens most informative for entity recognition, enabling entity-focused demonstrations. It further uses these statistics to detect and refine error-prone tokens through a targeted reflection step. Evaluated on five NER datasets across four LLMs, DEER consistently outperforms existing ICL methods and achieves performance comparable to supervised fine-tuning. Further analyses demonstrate that DEER improves example retrieval, remains effective on both seen and unseen entities, and exhibits strong robustness in low-resource settings.
comment: Accepted to EMNLP 2025
♻ ☆ Spontaneous Giving and Calculated Greed in Language Models EMNLP 2025
Large language models demonstrate strong problem-solving abilities through reasoning techniques such as chain-of-thought prompting and reflection. However, it remains unclear whether these reasoning capabilities extend to a form of social intelligence: making effective decisions in cooperative contexts. We examine this question using economic games that simulate social dilemmas. First, we apply chain-of-thought and reflection prompting to GPT-4o in a Public Goods Game. We then evaluate multiple off-the-shelf models across six cooperation and punishment games, comparing those with and without explicit reasoning mechanisms. We find that reasoning models consistently reduce cooperation and norm enforcement, favoring individual rationality. In repeated interactions, groups with more reasoning agents exhibit lower collective gains. These behaviors mirror human patterns of "spontaneous giving and calculated greed." Our findings underscore the need for LLM architectures that incorporate social intelligence alongside reasoning, to help address--rather than reinforce--the challenges of collective action.
comment: Accepted to EMNLP 2025 main conference and selected as an Oral Presentation
♻ ☆ Precise In-Parameter Concept Erasure in Large Language Models EMNLP 2025
Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.
comment: Accepted to EMNLP 2025 Main Conference
♻ ☆ SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation NeurIPS 2025
Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.
comment: NeurIPS 2025 poster
♻ ☆ SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.
comment: technical report
♻ ☆ OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning
Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model's internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.
♻ ☆ RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
Traditional feedback learning for hallucination reduction relies on labor-intensive manual labeling or expensive proprietary models. This leaves the community without foundational knowledge about how to build high-quality feedback with open-source MLLMs. In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm. RLAIF-V maximally explores open-source MLLMs from two perspectives, including high-quality feedback data generation for preference learning and self-feedback guidance for inference-time scaling. Extensive experiments on six benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models at both preference learning and inference time. RLAIF-V 7B reduces object hallucination by 80.7\% and overall hallucination by 33.7\%. Remarkably, RLAIF-V 12B further reveals the self-alignment potential of open-source MLLMs, where the model can learn from feedback of itself to achieve super GPT-4V trustworthiness.
comment: Project Website: https://github.com/RLHF-V/RLAIF-V
♻ ☆ Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.
♻ ☆ Quantum Transformer: Accelerating model inference via quantum linear algebra
Powerful generative artificial intelligence from large language models (LLMs) harnesses extensive computational resources for inference. In this work, we investigate the transformer architecture, a key component of these models, under the lens of fault-tolerant quantum computing. We develop quantum subroutines to construct the building blocks in the transformer, including the self-attention, residual connection with layer normalization, and feed-forward network. As an important subroutine, we show how to efficiently implement the Hadamard product and element-wise functions of matrices on quantum computers. Our algorithm prepares an amplitude encoding of the transformer output, which can be measured for prediction or use in the next layer. We find that the matrix norm of the input sequence plays a dominant role in the quantum complexity. With numerical experiments on open-source LLMs, including for bio-informatics applications, we demonstrate the potential of a quantum speedup for transformer inference in practical regimes.
comment: 45 pages
♻ ☆ GradeSQL: Test-Time Inference with Outcome Reward Models for Text-to-SQL Generation from Large Language Models
Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries. To address this limitation, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can produce correct answers after multiple attempts. However, these methods rely on surface-level heuristics, selecting the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated one through Majority Voting. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising reinforcement learning approach for improving model alignment. We argue that ORMs could serve as an effective new test-time heuristic, although their application in this context remains largely underexplored. In this work, we propose a unified framework for training ORMs tailored to the Text-to-SQL task and assess their effectiveness as a test-time heuristic within the BoN strategy. We benchmark ORMs against ex-BoN and Maj across the BIRD and Spider datasets, fine-tuning diverse open-source LLMs from the Qwen2, Granite3, and Llama3 families. Results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj.
♻ ☆ Reinforcement Learning Teachers of Test Time Scaling NeurIPS 2025
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT
comment: Accepted at NeurIPS 2025
♻ ☆ Many LLMs Are More Utilitarian Than One NeurIPS 2025
Moral judgment is integral to large language models' (LLMs) social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function when collaborating compared to operating as individual agents. In human moral judgment, group deliberation leads to a Utilitarian Boost: a tendency to endorse norm violations that inflict harm but maximize benefits for the greatest number of people. We study whether a similar dynamic emerges in multi-agent LLM systems. We test six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reason independently, and (2) Group, where they engage in multi-turn discussions in pairs or triads. In personal dilemmas, where agents decide whether to directly harm an individual for the benefit of others, all models rated moral violations as more acceptable when part of a group, demonstrating a Utilitarian Boost similar to that observed in humans. However, the mechanism for the Boost in LLMs differed: While humans in groups become more utilitarian due to heightened sensitivity to decision outcomes, LLM groups showed either reduced sensitivity to norms or enhanced impartiality. We report model differences in when and how strongly the Boost manifests. We also discuss prompt and agent compositions that enhance or mitigate the effect. We end with a discussion of the implications for AI alignment, multi-agent design, and artificial moral reasoning. Code available at: https://github.com/baltaci-r/MoralAgents
comment: Accepted to the Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
♻ ☆ A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation
Large Language Models (LLMs) can generate human-like disinformation, yet their ability to personalise such content across languages and demographics remains underexplored. This study presents the first large-scale, multilingual analysis of persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we prompt eight state-of-the-art LLMs with 324 false narratives and 150 demographic personas (combinations of country, generation, and political orientation) across four languages--English, Russian, Portuguese, and Hindi--resulting in AI-TRAITS, a comprehensive dataset of 1.6 million personalised disinformation texts. Results show that the use of even simple personalisation prompts significantly increases the likelihood of jailbreaks across all studied LLMs, up to 10 percentage points, and alters linguistic and rhetorical patterns that enhance narrative persuasiveness. Models such as Grok and GPT exhibited jailbreak rates and personalisation scores both exceeding 85%. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.
♻ ☆ Reliable Evaluation and Benchmarks for Statement Autoformalization EMNLP 2025
Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.
comment: Accepted to EMNLP 2025. New benchmarks released, see https://github.com/augustepoiroux/RLMEval , https://huggingface.co/datasets/PAug/ProofNetSharp , and https://huggingface.co/datasets/PAug/ProofNetVerif . For code, see https://github.com/augustepoiroux/LeanInteract
♻ ☆ When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks but remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks. Existing mitigation strategies rely on injecting heuristic safety signals during training, which often suppress reasoning ability and fail to resolve the safety-reasoning trade-off. To systematically investigate this issue, we analyze the reasoning trajectories of diverse LRMs and uncover a phenomenon we term Self-Jailbreak, where models override their own risk assessments and justify responding to unsafe prompts. This finding reveals that LRMs inherently possess the ability to reject unsafe queries, but this ability is compromised, resulting in harmful outputs. Building on these insights, we propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps, steering the model back onto safe trajectories while preserving valid reasoning chains. Extensive experiments across multiple reasoning and safety benchmarks demonstrate that CoG substantially improves the safety of current LRMs while preserving comparable reasoning ability, significantly outperforming prior methods that suffer from severe safety-reasoning trade-offs.
comment: First two authors contributed equally. The main text is 10 pages, with an appendix of 19 pages. The paper contains 18 figures and 16 tables
♻ ☆ Robust Preference Optimization via Dynamic Target Margins ACL 2025
The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $\gamma$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $\gamma$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $\gamma$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $\gamma$-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $\gamma$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{https://github.com/sunjie279/gammaPO}{https://github.com/sunjie279/gammaPO}.
comment: 18 pages, 6 figures, accepted to Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)
♻ ☆ Differential Mamba AACL 2025
Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available: https://github.com/NadavSc/Diff-Mamba
comment: AACL 2025. We provide the code at https://github.com/NadavSc/Diff-Mamba
♻ ☆ Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?
Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available
comment: The manuscript has to be withdrawn to address an authorship and intellectual property clarification
♻ ☆ ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards
Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.
comment: 19 pages
♻ ☆ Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation NeurIPS 2025
In recent years, integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our findings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, configurations, and platform to enable other researchers to reproduce and build upon our experimental results.
comment: NeurIPS 2025 DB Track Accepted Paper
♻ ☆ MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs
The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
comment: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-025-51369-x}
♻ ☆ NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging
Debugging is a critical aspect of LLM's coding ability. Early debugging efforts primarily focused on code-level analysis, which often falls short when addressing complex programming errors that require a deeper understanding of algorithmic logic. Recent advancements in large language models (LLMs) have shifted attention toward leveraging natural language reasoning to enhance code-related tasks. However, two fundamental questions remain unanswered: What type of natural language format is most effective for debugging tasks? And what specific benefits does natural language reasoning bring to the debugging process? In this paper, we introduce NL-DEBUGGING, a novel framework that employs natural language as an intermediate representation to improve code debugging. By debugging at a natural language level, we demonstrate that NL-DEBUGGING outperforms traditional debugging methods and enables a broader modification space through direct refinement guided by execution feedback. Our findings highlight the potential of natural language reasoning to advance automated code debugging and address complex programming challenges.
♻ ☆ WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction
In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/
♻ ☆ Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection
Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Unlike prior methods that treat external knowledge as static input, MiDRE selectively adapts to when such knowledge is beneficial, mitigating the risks of hallucinated or irrelevant signals from large models. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.
♻ ☆ Blind Spot Navigation in Large Language Model Reasoning with Thought Space Explorer
Large language models have shown strong reasoning capabilities through chain-structured methods such as Chain-of-Thought. Recent studies optimize thought structures by generating parallel or tree-like structures, switching between long and short reasoning modes, or aligning reasoning steps with task performance. However, these approaches mainly rely on previously generated logical directions of the chains, which ignore the unexplored regions of the solution space. Such a phenomenon is defined as blind spots, which limit the diversity and effectiveness of the reasoning process. To this end, we propose the ``Thought Space Explorer'' (TSE), a framework for navigating and expanding thought structures to overcome blind spots in LLM reasoning. Our TSE first identifies key nodes with high impact, then generates new nodes by integrating information from multiple chains. Finally, it extends new branches through connection strategies. We conduct a series of experiments on math and QA benchmarks. Compared with existing baseline methods, TSE improves the accuracy of both the final answer and intermediate reasoning steps, while maintaining a better effectiveness-efficiency trade-off for practical deployment.
♻ ☆ Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.
♻ ☆ Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM EMNLP2025
This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker's thought before articulating the utterance. We expect "persona LLMs" trained with TAU-augmented data can mimic the speaker's personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers' Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM's performance.
comment: 8 pages, 1 figure. Accepted at the First Workshop on Tailoring AI: Exploring Active and Passive LLM Personalization (PALS2025@EMNLP2025)
♻ ☆ The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.
♻ ☆ Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.
♻ ☆ Adapter-state Sharing CLIP for Parameter-efficient Multimodal Sarcasm Detection
The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining systems. Existing approaches rely on full fine-tuning of large models, making them unsuitable to adapt under resource-constrained settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS-CLIP (Adapter-state Sharing in CLIP), a lightweight framework built on CLIP that inserts adapters only in the upper layers to preserve low-level unimodal representations in the lower layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones to promote efficient cross-modal learning in the upper layers. Experiments on two public benchmarks demonstrate that AdS-CLIP not only outperforms standard PEFT methods but also existing multimodal baselines with significantly fewer trainable parameters.
♻ ☆ Face the Facts! Evaluating RAG-based Pipelines for Professional Fact-Checking
Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, following professional fact-checking practices, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.
comment: Code and data at https://github.com/drusso98/face-the-facts - Accepted for publication at INLG 2025
♻ ☆ S'MoRE: Structural Mixture of Residual Experts for Parameter-Efficient LLM Fine-tuning NeurIPS 2025
Fine-tuning pre-trained large language models (LLMs) presents a dual challenge of balancing parameter efficiency and model capacity. Existing methods like low-rank adaptations (LoRA) are efficient but lack flexibility, while Mixture-of-Experts (MoE) enhance model capacity at the cost of more & under-utilized parameters. To address these limitations, we propose Structural Mixture of Residual Experts (S'MoRE), a novel framework that seamlessly integrates the efficiency of LoRA with the flexibility of MoE. Conceptually, S'MoRE employs hierarchical low-rank decomposition of expert weights, yielding residuals of varying orders interconnected in a multi-layer structure. By routing input tokens through sub-trees of residuals, S'MoRE emulates the capacity of numerous experts by instantiating and assembling just a few low-rank matrices. We craft the inter-layer propagation of S'MoRE's residuals as a special type of Graph Neural Network (GNN), and prove that under similar parameter budget, S'MoRE improves structural flexibility of traditional MoE (or Mixture-of-LoRA) by exponential order. Comprehensive theoretical analysis and empirical results demonstrate that S'MoRE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation. Our implementation is available at: https://github.com/ZimpleX/SMoRE-LLM.
comment: NeurIPS 2025
♻ ☆ Steering Information Utility in Key-Value Memory for Language Model Post-Training NeurIPS 2025
Recent advancements in language models (LMs) have marked a shift toward the growing importance of post-training. Yet, post-training approaches such as supervised fine-tuning (SFT) do not guarantee the effective use of knowledge acquired during pretraining. We therefore introduce InfoSteer, a lightweight method that encourages parametric information utilization in LMs during post-training. Specifically, InfoSteer treats the feed-forward network (FFN) layer as associate key-value memory and promotes the use of stored memory vectors via forward-pass interventions or regularization during backpropagation. This simple guidance during post-training phase yields consistent performance improvements across diverse model families -- including Qwen, Gemma and Llama -- spanning 15 downstream tasks in both in-distribution (ID) and out-of-distribution (OOD) evaluations. Beyond performance gains, we also find that steered LMs can adaptively allocate information by placing more emphasis on generating semantically meaningful tokens, while using fewer resources on simple transition ones (e.g., `\texttt{,}' or `\texttt{and}'). Our work underscores that vanilla post-training does not fully exploit the potential gained during pre-training, and that steering LMs in latent representation space offers a promising approach to enhance both performance and interpretability. The code is available at: https://github.com/chili-lab/InfoSteer.
comment: NeurIPS 2025
♻ ☆ OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models
As large language models (LLMs) are increasingly integrated into real-world applications, ensuring their safety, robustness, and privacy compliance has become critical. We present OpenGuardrails, the first fully open-source platform that unifies large-model-based safety detection, manipulation defense, and deployable guardrail infrastructure. OpenGuardrails protects against three major classes of risks: (1) content-safety violations such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information. Unlike prior modular or rule-based frameworks, OpenGuardrails introduces three core innovations: (1) a Configurable Policy Adaptation mechanism that allows per-request customization of unsafe categories and sensitivity thresholds; (2) a Unified LLM-based Guard Architecture that performs both content-safety and manipulation detection within a single model; and (3) a Quantized, Scalable Model Design that compresses a 14B dense base model to 3.3B via GPTQ while preserving over 98 of benchmark accuracy. The system supports 119 languages, achieves state-of-the-art performance across multilingual safety benchmarks, and can be deployed as a secure gateway or API-based service for enterprise use. All models, datasets, and deployment scripts are released under the Apache 2.0 license.
♻ ☆ AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System EMNLP-2025
Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.
comment: Accepted by EMNLP-2025
♻ ☆ FT-MDT: Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method EMNLP-2025
Knowledge of the medical decision process, which can be modeled as medical decision trees (MDTs), is critical to building clinical decision support systems. However, current MDT construction methods rely heavily on time-consuming and laborious manual annotation. To address this challenge, we propose PI-LoRA (Path-Integrated LoRA), a novel low-rank adaptation method for automatically extracting MDTs from clinical guidelines and textbooks. We integrate gradient path information to capture synergistic effects between different modules, enabling more effective and reliable rank allocation. This framework ensures that the most critical modules receive appropriate rank allocations while less important ones are pruned, resulting in a more efficient and accurate model for extracting medical decision trees from clinical texts. Extensive experiments on medical guideline datasets demonstrate that our PI-LoRA method significantly outperforms existing parameter-efficient fine-tuning approaches for the Text2MDT task, achieving better accuracy with substantially reduced model complexity. The proposed method achieves state-of-the-art results while maintaining a lightweight architecture, making it particularly suitable for clinical decision support systems where computational resources may be limited.
comment: Accepted by EMNLP-2025
♻ ☆ Consistency of Responses and Continuations Generated by Large Language Models on Social Media AAAI
Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using three open-source models: Gemma, Llama3 and Llama3.3 and one commercial Model:Claude. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic consistency between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: these models show a strong tendency to moderate negative emotions. When the input text carries negative emotions such as anger, disgust, fear, or sadness, LLM tends to generate content with more neutral emotions, or even convert them into positive emotions such as joy or surprise. At the same time, we compared the LLM-generated content with human-authored content. The four models systematically generated responses with reduced emotional intensity and showed a preference for neutral rational emotions in the response task. In addition, these models all maintained a high semantic similarity with the original text, although their performance in the continuation task and the response task was different. These findings provide deep insights into the emotion and semantic processing capabilities of LLM, which are of great significance for its deployment in social media environments and human-computer interaction design.
comment: This paper has been accepted by the International AAAI Conference on Web and Social Media (ICWSM) 2026 (Los Angeles, California, U.S.)
♻ ☆ PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions NeurIPS 2025
Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluate eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3 70B, is validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare. The code is available at https://github.com/dek924/PatientSim.
comment: Accepted as a Spotlight at NeurIPS 2025 Datasets and Benchmarks Track (10 pages for main text, 4 pages for references, 36 pages for supplementary materials)
♻ ☆ DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans
The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF). DPRF aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences. We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews. DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios. Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.
comment: In Submission
♻ ☆ ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups
Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.
comment: For associated code repository, see http://github.com/banyasp/consistencyAI For user-friendly web app, see http://v0-llm-comparison-webapp.vercel.app/
♻ ☆ AutoLibra: Agent Metric Induction from Open-Ended Human Feedback
Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose **AutoLibra**, a framework for agent evaluation, that transforms open-ended human feedback *e.g.* "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own" into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self-regulation. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.
comment: https://github.com/Open-Social-World/autolibra
♻ ☆ Reward Collapse in Aligning Large Language Models
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results in an \textit{identical} reward distribution \textit{regardless} of the prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like ``write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like ``what is the capital of New Zealand'' should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime. To overcome reward collapse, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
comment: Accepted for publication in the Journal of Data Science (JDS), reference JDS1201
♻ ☆ Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It
Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.
♻ ☆ Language Models can Self-Improve at State-Value Estimation for Better Search
Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive, particularly in interactive domains such as web tasks. We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model-based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain-of-thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language - predicting the next action, resulting state, and rationale for its value, thereby refining value estimates without any labeled data. This self-supervised procedure yields more accurate state-value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL-trained value models built on moderately sized (8B parameter) open-weight LLMs boost web agent success rates by 39%, achieving comparable performance with proprietary models. STL also generalizes to multi-hop QA and math puzzles. We find that STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.
♻ ☆ TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields NeurIPS 2025
While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees. However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Tabular Foundation Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.
comment: Accepted to NeurIPS 2025
♻ ☆ This Candidate is [MASK]. Prompt-based Sentiment Extraction and Reference Letters
I propose a relatively simple way to deploy pre-trained large language models (LLMs) in order to extract sentiment and other useful features from text data. The method, which I refer to as prompt-based sentiment extraction, offers multiple advantages over other methods used in economics and finance. In particular, it accepts the text input as is (without pre-processing) and produces a sentiment score that has a probability interpretation. Unlike other LLM-based approaches, it does not require any fine-tuning or labeled data. I apply my prompt-based strategy to a hand-collected corpus of confidential reference letters (RLs). I show that the sentiment contents of RLs are clearly reflected in job market outcomes. Candidates with higher average sentiment in their RLs perform markedly better regardless of the measure of success chosen. Moreover, I show that sentiment dispersion among letter writers negatively affects the job market candidate's performance. I compare my sentiment extraction approach to other commonly used methods for sentiment analysis: `bag-of-words' approaches, fine-tuned language models, and querying advanced chatbots. No other method can fully reproduce the results obtained by prompt-based sentiment extraction. Finally, I slightly modify the method to obtain `gendered' sentiment scores (as in Eberhardt et al., 2023). I show that RLs written for female candidates emphasize `grindstone' personality traits, whereas male candidates' letters emphasize `standout' traits. These gender differences negatively affect women's job market outcomes.
♻ ☆ M-Prometheus: A Suite of Open Multilingual LLM Judges
The use of language models for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on synthetic multilingual feedback data instead of translated data. We release our models, training dataset, and code.
♻ ☆ Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets -- which are expensive to create -- saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.
♻ ☆ Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study
Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored. In this work, we address this gap by introducing a framework inspired by cognitive psychology and education. Specifically, we decompose general learning ability into three distinct, complementary dimensions: Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs' general learning abilities across three learning cognition dimensions. It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models.
♻ ☆ The LSCD Benchmark: a Testbed for Diachronic Word Meaning Tasks
Lexical Semantic Change Detection (LSCD) is a complex, lemma-level task, which is usually operationalized based on two subsequently applied usage-level tasks: First, Word-in-Context (WiC) labels are derived for pairs of usages. Then, these labels are represented in a graph on which Word Sense Induction (WSI) is applied to derive sense clusters. Finally, LSCD labels are derived by comparing sense clusters over time. This modularity is reflected in most LSCD datasets and models. It also leads to a large heterogeneity in modeling options and task definitions, which is exacerbated by a variety of dataset versions, preprocessing options and evaluation metrics. This heterogeneity makes it difficult to evaluate models under comparable conditions, to choose optimal model combinations or to reproduce results. Hence, we provide a benchmark repository standardizing LSCD evaluation. Through transparent implementation results become easily reproducible and by standardization different components can be freely combined. The repository reflects the task's modularity by allowing model evaluation for WiC, WSI and LSCD. This allows for careful evaluation of increasingly complex model components providing new ways of model optimization. We use the implemented benchmark to conduct a number of experiments with recent models and systematically improve the state-of-the-art.
♻ ☆ GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors USENIX Security'25
In this paper, we introduce GradEscape, the first gradient-based evader designed to attack AI-generated text (AIGT) detectors. GradEscape overcomes the undifferentiable computation problem, caused by the discrete nature of text, by introducing a novel approach to construct weighted embeddings for the detector input. It then updates the evader model parameters using feedback from victim detectors, achieving high attack success with minimal text modification. To address the issue of tokenizer mismatch between the evader and the detector, we introduce a warm-started evader method, enabling GradEscape to adapt to detectors across any language model architecture. Moreover, we employ novel tokenizer inference and model extraction techniques, facilitating effective evasion even in query-only access. We evaluate GradEscape on four datasets and three widely-used language models, benchmarking it against four state-of-the-art AIGT evaders. Experimental results demonstrate that GradEscape outperforms existing evaders in various scenarios, including with an 11B paraphrase model, while utilizing only 139M parameters. We have successfully applied GradEscape to two real-world commercial AIGT detectors. Our analysis reveals that the primary vulnerability stems from disparity in text expression styles within the training data. We also propose a potential defense strategy to mitigate the threat of AIGT evaders. We open-source our GradEscape for developing more robust AIGT detectors.
comment: Accepted by USENIX Security'25; Update badges and Artifact Appendix
♻ ☆ VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.
comment: Accepted by EMNLP 2025
♻ ☆ Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text NeurIPS 2025
The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.
comment: NeurIPS 2025
♻ ☆ Latent Chain-of-Thought for Visual Reasoning NeurIPS 2025
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
comment: NeurIPS 2025
♻ ☆ AI Debate Aids Assessment of Controversial Claims
As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides-especially on consequential topics where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI systems remain truthful even when their capabilities exceed those of their evaluators. Yet when humans serve as evaluators, their own beliefs and biases can impair judgment. We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial factuality claims on COVID-19 and climate change where people hold strong prior beliefs. We conduct two studies. Study I recruits human judges with either mainstream or skeptical beliefs who evaluate claims through two protocols: debate (interaction with two AI advisors arguing opposing sides) or consultancy (interaction with a single AI advisor). Study II uses AI judges with and without human-like personas to evaluate the same protocols. In Study I, debate consistently improves human judgment accuracy and confidence calibration, outperforming consultancy by 4-10% across COVID-19 and climate change claims. The improvement is most significant for judges with mainstream beliefs (up to +15.2% accuracy on COVID-19 claims), though debate also helps skeptical judges who initially misjudge claims move toward accurate views (+4.7% accuracy). In Study II, AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%), suggesting their potential for supervising frontier AI models. These findings highlight AI debate as a promising path toward scalable, bias-resilient oversight in contested domains.
♻ ☆ Fuzzy, Symbolic, and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding
We study how prompt-level inductive biases influence the cognitive behavior of large language models (LLMs) in instructional dialogue. We introduce a symbolic scaffolding method paired with a short-term memory schema designed to promote adaptive, structured reasoning in Socratic tutoring. Using controlled ablation across five system variants, we evaluate model outputs via expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory. We present preliminary results using an LLM-based evaluation framework aligned to a cognitively grounded rubric. This enables scalable, systematic comparisons across architectural variants in early-stage experimentation. The preliminary results show that our full system consistently outperforms baseline variants. Analysis reveals that removing memory or symbolic structure degrades key cognitive behaviors, including abstraction, adaptive probing, and conceptual continuity. These findings support a processing-level account in which prompt-level cognitive scaffolds can reliably shape emergent instructional strategies in LLMs.
Information Retrieval 22
☆ Retrieval-Augmented Search for Large-Scale Map Collections with ColPali
Multimodal approaches have shown great promise for searching and navigating digital collections held by libraries, archives, and museums. In this paper, we introduce map-RAS: a retrieval-augmented search system for historic maps. In addition to introducing our framework, we detail our publicly-hosted demo for searching 101,233 map images held by the Library of Congress. With our system, users can multimodally query the map collection via ColPali, summarize search results using Llama 3.2, and upload their own collections to perform inter-collection search. We articulate potential use cases for archivists, curators, and end-users, as well as future work with our system in both machine learning and the digital humanities. Our demo can be viewed at: http://www.mapras.com.
comment: 5 pages, 5 figures
☆ MMQ-v2: Align, Denoise, and Amplify: Adaptive Behavior Mining for Semantic IDs Learning in Recommendation
Industrial recommender systems rely on unique Item Identifiers (ItemIDs). However, this method struggles with scalability and generalization in large, dynamic datasets that have sparse long-tail data.Content-based Semantic IDs (SIDs) address this by sharing knowledge through content quantization. However, by ignoring dynamic behavioral properties, purely content-based SIDs have limited expressive power. Existing methods attempt to incorporate behavioral information but overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed and diverse, creating a vast information gap in quality and quantity between popular and long-tail items. This oversight leads to two critical limitations: (1) Noise Corruption: Indiscriminate behavior-content alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2)Signal Obscurity: The equal-weighting scheme for SIDs fails to reflect the varying importance of different behavioral signals, making it difficult for downstream tasks to distinguish important SIDs from uninformative ones. To tackle these issues, we propose a mixture-of-quantization framework, MMQ-v2, to adaptively Align, Denoise, and Amplify multimodal information from content and behavior modalities for semantic IDs learning. The semantic IDs generated by this framework named ADA-SID. It introduces two innovations: an adaptive behavior-content alignment that is aware of information richness to shield representations from noise, and a dynamic behavioral router to amplify critical signals by applying different weights to SIDs. Extensive experiments on public and large-scale industrial datasets demonstrate ADA-SID's significant superiority in both generative and discriminative recommendation tasks.
☆ FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering
The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.
comment: 37 pages, 5 figures, 10 tables. Keywords: Retrieval-Augmented Generation (RAG), Question Answering (QA), Islamic Knowledge Base, Faithful AI, Persian NLP, Multi-hop Reasoning, Large Language Models (LLMs)
☆ Generalized Pseudo-Relevance Feedback
Query rewriting is a fundamental technique in information retrieval (IR). It typically employs the retrieval result as relevance feedback to refine the query and thereby addresses the vocabulary mismatch between user queries and relevant documents. Traditional pseudo-relevance feedback (PRF) and its vector-based extension (VPRF) improve retrieval performance by leveraging top-retrieved documents as relevance feedback. However, they are constructed based on two major hypotheses: the relevance assumption (top documents are relevant) and the model assumption (rewriting methods need to be designed specifically for particular model architectures). While recent large language models (LLMs)-based generative relevance feedback (GRF) enables model-free query reformulation, it either suffers from severe LLM hallucination or, again, relies on the relevance assumption to guarantee the effectiveness of rewriting quality. To overcome these limitations, we introduce an assumption-relaxed framework: \textit{Generalized Pseudo Relevance Feedback} (GPRF), which performs model-free, natural language rewriting based on retrieved documents, not only eliminating the model assumption but also reducing dependence on the relevance assumption. Specifically, we design a utility-oriented training pipeline with reinforcement learning to ensure robustness against noisy feedback. Extensive experiments across multiple benchmarks and retrievers demonstrate that GPRF consistently outperforms strong baselines, establishing it as an effective and generalizable framework for query rewriting.
☆ Alibaba International E-commerce Product Search Competition DcuRAGONs Team Technical Report CIKM 2025
This report details our methodology and results developed for the Multilingual E-commerce Search Competition. The problem aims to recognize relevance between user queries versus product items in a multilingual context and improve recommendation performance on e-commerce platforms. Utilizing Large Language Models (LLMs) and their capabilities in other tasks, our data-centric method achieved the highest score compared to other solutions during the competition. Final leaderboard is publised at https://alibaba-international-cikm2025.github.io. The source code for our project is published at https://github.com/nhtlongcs/e-commerce-product-search.
comment: Alibaba International E-commerce Product Search Competition @ CIKM 2025
☆ Towards Automated Quality Assurance of Patent Specifications: A Multi-Dimensional LLM Framework
Despite the surge in patent applications and emergence of AI drafting tools, systematic evaluation of patent content quality has received limited research attention. To address this gap, We propose to evaluate patents using regulatory compliance, technical coherence, and figure-reference consistency detection modules, and then generate improvement suggestions via an integration module. The framework is validated on a comprehensive dataset comprising 80 human-authored and 80 AI-generated patents from two patent drafting tools. Experimental results show balanced accuracies of 99.74\%, 82.12\%, and 91.2\% respectively across the three detection modules when validated against expert annotations. Additional analysis was conducted to examine defect distributions across patent sections, technical domains, and authoring sources. Section-based analysis indicates that figure-text consistency and technical detail precision require particular attention. Mechanical Engineering and Construction show more claim-specification inconsistencies due to complex technical documentation requirements. AI-generated patents show a significant gap compared to human-authored ones. While human-authored patents primarily contain surface-level errors like typos, AI-generated patents exhibit more structural defects in figure-text alignment and cross-references.
☆ Revisiting scalable sequential recommendation with Multi-Embedding Approach and Mixture-of-Experts
In recommendation systems, how to effectively scale up recommendation models has been an essential research topic. While significant progress has been made in developing advanced and scalable architectures for sequential recommendation(SR) models, there are still challenges due to items' multi-faceted characteristics and dynamic item relevance in the user context. To address these issues, we propose Fuxi-MME, a framework that integrates a multi-embedding strategy with a Mixture-of-Experts (MoE) architecture. Specifically, to efficiently capture diverse item characteristics in a decoupled manner, we decompose the conventional single embedding matrix into several lower-dimensional embedding matrices. Additionally, by substituting relevant parameters in the Fuxi Block with an MoE layer, our model achieves adaptive and specialized transformation of the enriched representations. Empirical results on public datasets show that our proposed framework outperforms several competitive baselines.
☆ Measuring the Research Output and Performance of the University of Ibadan from 2014 to 2023: A Scientometric Analysis
This study employs scientometric methods to assess the research output and performance of the University of Ibadan from 2014 to 2023. By analyzing publication trends, citation patterns, and collaboration networks, the research aims to comprehensively evaluate the university's research productivity, impact, and disciplinary focus. This article's endeavors are characterized by innovation, interdisciplinary collaboration, and commitment to excellence, making the University of Ibadan a significant hub for cutting-edge research in Nigeria and beyond. The goal of the current study is to ascertain the influence of the university's research output and publication patterns between 2014 and 2023. The study focuses on the departments at the University of Ibadan that contribute the most, the best journals for publishing, the nations that collaborate, the impact of citations both locally and globally, well-known authors and their total production, and the research output broken down by year. According to the university's ten-year publication data, 7159 papers with an h-index of 75 were published between 2014 and 2023, garnering 218572 citations. Furthermore, the VOSviewer software mapping approach is used to illustrate the stenographical mapping of data through graphs. The findings of this study will contribute to understanding the university's research strengths, weaknesses, and potential areas for improvement. Additionally, the results will inform evidence-based decision-making for enhancing research strategies and policies at the University of Ibadan.
comment: 16 pages, 5 figures, Research Paper
☆ TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation NeurIPS 2025
Recently, convolutional filters have been increasingly adopted in sequential recommendation for their ability to capture local sequential patterns. However, most of these models complement convolutional filters with self-attention. This is because convolutional filters alone, generally fixed filters, struggle to capture global interactions necessary for accurate recommendation. We propose Time-Variant Convolutional Filters for Sequential Recommendation (TV-Rec), a model inspired by graph signal processing, where time-variant graph filters capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention with time-variant filters, TV-Rec achieves higher expressive power and better captures complex interaction patterns in user behavior. This design not only eliminates the need for self-attention but also reduces computation while accelerating inference. Extensive experiments on six public benchmarks show that TV-Rec outperforms state-of-the-art baselines by an average of 7.49%.
comment: The 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ GReF: A Unified Generative Framework for Efficient Reranking via Ordered Multi-token Prediction CIKM 2025
In a multi-stage recommendation system, reranking plays a crucial role in modeling intra-list correlations among items. A key challenge lies in exploring optimal sequences within the combinatorial space of permutations. Recent research follows a two-stage (generator-evaluator) paradigm, where a generator produces multiple feasible sequences, and an evaluator selects the best one. In practice, the generator is typically implemented as an autoregressive model. However, these two-stage methods face two main challenges. First, the separation of the generator and evaluator hinders end-to-end training. Second, autoregressive generators suffer from inference efficiency. In this work, we propose a Unified Generative Efficient Reranking Framework (GReF) to address the two primary challenges. Specifically, we introduce Gen-Reranker, an autoregressive generator featuring a bidirectional encoder and a dynamic autoregressive decoder to generate causal reranking sequences. Subsequently, we pre-train Gen-Reranker on the item exposure order for high-quality parameter initialization. To eliminate the need for the evaluator while integrating sequence-level evaluation during training for end-to-end optimization, we propose post-training the model through Rerank-DPO. Moreover, for efficient autoregressive inference, we introduce ordered multi-token prediction (OMTP), which trains Gen-Reranker to simultaneously generate multiple future items while preserving their order, ensuring practical deployment in real-time recommender systems. Extensive offline experiments demonstrate that GReF outperforms state-of-the-art reranking methods while achieving latency that is nearly comparable to non-autoregressive models. Additionally, GReF has also been deployed in a real-world video app Kuaishou with over 300 million daily active users, significantly improving online recommendation quality.
comment: Accepted by CIKM 2025
☆ Model-Document Protocol for AI Search
AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.
comment: 10 pages
☆ Continual Low-Rank Adapters for LLM-based Generative Recommender Systems
While large language models (LLMs) achieve strong performance in recommendation, they face challenges in continual learning as users, items, and user preferences evolve over time. Existing LoRA-based continual methods primarily focus on preserving performance on previous tasks, but this overlooks the unique nature of recommendation: the goal is not to predict past preferences, and outdated preferences can even harm performance when current interests shift significantly. To address this, we propose PESO (Proximally rEgularized Single evolving lOra, a continual adaptation method for LoRA in recommendation. PESO introduces a proximal regularizer that anchors the current adapter to its most recent frozen state, enabling the model to flexibly balance adaptation and preservation, and to better capture recent user behaviors. Theoretically, we show that this proximal design provides data-aware, direction-wise guidance in the LoRA subspace. Empirically, PESO consistently outperforms existing LoRA-based continual learning methods.
☆ The Quest for Reliable Metrics of Responsible AI
The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.
comment: Accepted for presentation at the AI in Science Summit 2025
☆ DGAI: Decoupled On-Disk Graph-Based ANN Index for Efficient Updates and Queries
On-disk graph-based indexes are widely used in approximate nearest neighbor (ANN) search systems for large-scale, high-dimensional vectors. However, traditional coupled storage methods, which store vectors within the index, are inefficient for index updates. Coupled storage incurs excessive redundant vector reads and writes when updating the graph topology, leading to significant invalid I/O. To address this issue, we propose a decoupled storage architecture. While a decoupled architecture reduces query performance. To overcome this limitation, we design two tailored strategies: (i) a three-stage query mechanism that leverages multiple PQ compressed vectors to filter invalid I/O and computations, and (ii) an incremental page-level topological reordering strategy that incrementally inserts new nodes into pages containing their most similar neighbors to mitigate read amplification. Together, these techniques substantially reduce both I/O and computational overhead during ANN search. Experimental results show that the decoupled architecture improves update speed by 10.05x for insertions and 6.89x for deletions, while the three-stage query and incremental reordering enhance query efficiency by 2.66x compared to the traditional coupled architecture.
comment: 12 pages
☆ Forecasting Occupational Survivability of Rickshaw Pullers in a Changing Climate with Wearable Data
Cycle rickshaw pullers are highly vulnerable to extreme heat, yet little is known about how their physiological biomarkers respond under such conditions. This study collected real-time weather and physiological data using wearable sensors from 100 rickshaw pullers in Dhaka, Bangladesh. In addition, interviews with 12 pullers explored their knowledge, perceptions, and experiences related to climate change. We developed a Linear Gaussian Bayesian Network (LGBN) regression model to predict key physiological biomarkers based on activity, weather, and demographic features. The model achieved normalized mean absolute error values of 0.82, 0.47, 0.65, and 0.67 for skin temperature, relative cardiac cost, skin conductance response, and skin conductance level, respectively. Using projections from 18 CMIP6 climate models, we layered the LGBN on future climate forecasts to analyze survivability for current (2023-2025) and future years (2026-2100). Based on thresholds of WBGT above 31.1{\deg}C and skin temperature above 35{\deg}C, 32% of rickshaw pullers already face high heat exposure risk. By 2026-2030, this percentage may rise to 37% with average exposure lasting nearly 12 minutes, or about two-thirds of the trip duration. A thematic analysis of interviews complements these findings, showing that rickshaw pullers recognize their increasing climate vulnerability and express concern about its effects on health and occupational survivability.
comment: This is a preprint version of a manuscript accepted and to be published in the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)
☆ LookSync: Large-Scale Visual Product Search System for AI-Generated Fashion Looks KDD
Generative AI is reshaping fashion by enabling virtual looks and avatars making it essential to find real products that best match AI-generated styles. We propose an end-to-end product search system that has been deployed in a real-world, internet scale which ensures that AI-generated looks presented to users are matched with the most visually and semantically similar products from the indexed vector space. The search pipeline is composed of four key components: query generation, vectorization, candidate retrieval, and reranking based on AI-generated looks. Recommendation quality is evaluated using human-judged accuracy scores. The system currently serves more than 350,000 AI Looks in production per day, covering diverse product categories across global markets of over 12 million products. In our experiments, we observed that across multiple annotators and categories, CLIP outperformed alternative models by a small relative margin of 3--7\% in mean opinion scores. These improvements, though modest in absolute numbers, resulted in noticeably better user perception matches, establishing CLIP as the most reliable backbone for production deployment.
comment: 4 pages, 5 figures. Accepted at the International Conference on Data Science (IKDD CODS 2025), Demonstration Track. Demo video: https://youtu.be/DZdlWmTUwjc
♻ ☆ HyMiRec: A Hybrid Multi-interest Learning Framework for LLM-based Sequential Recommendation
Large language models (LLMs) have recently demonstrated strong potential for sequential recommendation. However, current LLM-based approaches face critical limitations in modeling users' long-term and diverse interests. First, due to inference latency and feature fetching bandwidth constraints, existing methods typically truncate user behavior sequences to include only the most recent interactions, resulting in the loss of valuable long-range preference signals. Second, most current methods rely on next-item prediction with a single predicted embedding, overlooking the multifaceted nature of user interests and limiting recommendation diversity. To address these challenges, we propose HyMiRec, a hybrid multi-interest sequential recommendation framework, which leverages a lightweight recommender to extracts coarse interest embeddings from long user sequences and an LLM-based recommender to captures refined interest embeddings. To alleviate the overhead of fetching features, we introduce a residual codebook based on cosine similarity, enabling efficient compression and reuse of user history embeddings. To model the diverse preferences of users, we design a disentangled multi-interest learning module, which leverages multiple interest queries to learn disentangles multiple interest signals adaptively, allowing the model to capture different facets of user intent. Extensive experiments are conducted on both benchmark datasets and a collected industrial dataset, demonstrating our effectiveness over existing state-of-the-art methods. Furthermore, online A/B testing shows that HyMiRec brings consistent improvements in real-world recommendation systems. Code is available at https://github.com/FireRedTeam/FireRedSeqRec.
♻ ☆ Who You Are Matters: Bridging Topics and Social Roles via LLM-Enhanced Logical Recommendation NeurIPS 2025
Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e.g., categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, TagCF exploits the (Multi-modal) LLM's world knowledge and logic inference ability to extract realistic tag-based virtual logic graphs that reveal dynamic and expressive knowledge of users, refining our understanding of user behaviors. On the other hand, TagCF presents empirically effective integration modules that take advantage of the extracted tag-logic information, augmenting the recommendation performance. We conduct both online experiments and offline experiments with industrial and public datasets as verification of TagCF's effectiveness, and we empirically show that the user role modeling strategy is potentially a better choice than the modeling of item topics. Additionally, we provide evidence that the extracted logic graphs are empirically a general and transferable knowledge that can benefit a wide range of recommendation tasks. Our code is available in https://github.com/Code2Q/TagCF.
comment: to be published in NeurIPS 2025
♻ ☆ Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation NeurIPS 2025
In recent years, integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our findings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, configurations, and platform to enable other researchers to reproduce and build upon our experimental results.
comment: NeurIPS 2025 DB Track Accepted Paper
♻ ☆ Capturing User Interests from Data Streams for Continual Sequential Recommendation WSDM'26
Transformer-based sequential recommendation (SR) models excel at modeling long-range dependencies in user behavior via self-attention. However, updating them with continuously arriving behavior sequences incurs high computational costs or leads to catastrophic forgetting. Although continual learning, a standard approach for non-stationary data streams, has recently been applied to recommendation, existing methods gradually forget long-term user preferences and remain underexplored in SR. In this paper, we introduce Continual Sequential Transformer for Recommendation (CSTRec). CSTRec is designed to effectively adapt to current interests by leveraging well-preserved historical ones, thus capturing the trajectory of user interests over time. The core of CSTRec is Continual Sequential Attention (CSA), a linear attention tailored for continual SR, which enables CSTRec to partially retain historical knowledge without direct access to prior data. CSA has two key components: (1) Cauchy-Schwarz Normalization that stabilizes learning over time under uneven user interaction frequencies; (2) Collaborative Interest Enrichment that alleviates forgetting through shared, learnable interest pools. In addition, we introduce a new technique to facilitate the adaptation of new users by transferring historical knowledge from existing users with similar interests. Extensive experiments on three real-world datasets show that CSTRec outperforms state-of-the-art models in both knowledge retention and acquisition.
comment: WSDM'26
♻ ☆ Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks SIGIR
Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ``ground truth'' using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff's $\alpha$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure. This evaluation method tests whether an LLM can blend into a group of human annotators without being distinguishable. We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former ($p = 0.004$), but not in the latter ($p = 0.155$), highlighting task-dependent differences. It also enables early evaluation on a small sample of human data to inform whether LLMs are suitable for large-scale annotation in a given application.
comment: Accepted at SIGIR-AP 2025
♻ ☆ Large Language Models for Few-Shot Named Entity Recognition
Named entity recognition (NER) is a fundamental task in numerous downstream applications. Recently, researchers have employed pre-trained language models (PLMs) and large language models (LLMs) to address this task. However, fully leveraging the capabilities of PLMs and LLMs with minimal human effort remains challenging. In this paper, we propose GPT4NER, a method that prompts LLMs to resolve the few-shot NER task. GPT4NER constructs effective prompts using three key components: entity definition, few-shot examples, and chain-of-thought. By prompting LLMs with these effective prompts, GPT4NER transforms few-shot NER, which is traditionally considered as a sequence-labeling problem, into a sequence-generation problem. We conduct experiments on two benchmark datasets, CoNLL2003 and OntoNotes5.0, and compare the performance of GPT4NER to representative state-of-the-art models in both few-shot and fully supervised settings. Experimental results demonstrate that GPT4NER achieves the $F_1$ of 83.15\% on CoNLL2003 and 70.37\% on OntoNotes5.0, significantly outperforming few-shot baselines by an average margin of 7 points. Compared to fully-supervised baselines, GPT4NER achieves 87.9\% of their best performance on CoNLL2003 and 76.4\% of their best performance on OntoNotes5.0. We also utilize a relaxed-match metric for evaluation and report performance in the sub-task of named entity extraction (NEE), and experiments demonstrate their usefulness to help better understand model behaviors in the NER task.
comment: 17 pages, 2 figures. Accepted by AI, Computer Science and Robotics Technology (ACRT)
Multimedia 8
☆ ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.
☆ PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models
Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.
☆ Energy consumption assessment of a Virtual Reality Remote Rendering application over 5G networks
This paper investigates the energy implications of remote rendering for Virtual Reality (VR) applications within a real 5G testbed. Remote rendering enables lightweight devices to access high-performance graphical content by offloading computationally intensive tasks to Cloud-native Network Functions (CNFs) running on remote servers. However, this approach raises concerns regarding energy consumption across the various network components involved, including the remote computing node, the 5G Core, the Radio Access Network (RAN), and the User Equipment (UE). This work proposes and evaluates two complementary energy monitoring solutions, one hardware-based and one software-based, to measure energy consumption at different system levels. A VR remote renderer, deployed as CNF and leveraging the Media over QUIC (MoQ) protocol, is used as test case for assessing its energy footprint under different multimedia and network configurations. The results provide critical insights into the trade-off between energy consumption and performance of a real-world VR application running in a 5G environment.
☆ Hallucination Localization in Video Captioning
We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. To establish a benchmark for hallucination localization, we construct HLVC-Dataset, a carefully curated dataset created by manually annotating 1,167 video-caption pairs from VideoLLM-generated captions. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.
comment: under review
☆ Performance Evaluation of Multimedia Traffic in Cloud Storage Services over Wi-Fi and LTE Networks
The performance of Dropbox, Google Drive, and OneDrive cloud storage services was evaluated under Wi-Fi and LTE network conditions during multimedia file uploads. Traffic was captured using Wireshark, and key metrics (including delay, jitter, bandwidth, and packet loss) were analyzed. Google Drive maintained the most consistent performance across both types of networks, showing low latency and reduced jitter. Dropbox showed efficient bandwidth utilization, but experienced a longer delay over LTE, attributed to a greater number of intermediate hops. OneDrive presented variable behavior, with elevated packet rates and increased sensitivity to fluctuations in the mobile network. A bimodal distribution of packet sizes was observed and modeled using a dual Poisson function. In general, Wi-Fi connections provided greater stability for multimedia transfers, while LTE performance varied depending on platform-specific implementations. The results contribute to a better understanding of traffic behavior in cloud-based storage applications and suggest further analysis with larger datasets and heterogeneous access networks.
comment: 2025 20th Iberian Conference on Information Systems and Technologies (CISTI), Lecture Notes in Networks and Systems
☆ Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.
♻ ☆ Artificial Neural Networks Trained on Noisy Speech Exhibit the McGurk Effect
Humans are able to fuse information from both auditory and visual modalities to help with understanding speech. This is demonstrated through a phenomenon known as the McGurk Effect, during which a listener is presented with incongruent auditory and visual speech that fuse together into the percept of illusory intermediate phonemes. Building on a recent framework that proposes how to address developmental 'why' questions using artificial neural networks, we evaluated a set of recent artificial neural networks trained on audiovisual speech by testing them with audiovisually incongruent words designed to elicit the McGurk effect. We show that networks trained entirely on congruent audiovisual speech nevertheless exhibit the McGurk percept. We further investigated 'why' by comparing networks trained on clean speech to those trained on noisy speech, and discovered that training with noisy speech led to a pronounced increase in both visual responses and McGurk responses across all models. Furthermore, we observed that systematically increasing the level of auditory noise during ANN training also increased the amount of audiovisual integration up to a point, but at extreme noise levels, this integration failed to develop. These results suggest that excessive noise exposure during critical periods of audiovisual learning may negatively influence the development of audiovisual speech integration. This work also demonstrates that the McGurk effect reliably emerges untrained from the behaviour of both supervised and unsupervised networks, even networks trained only on congruent speech. This supports the notion that artificial neural networks might be useful models for certain aspects of perception and cognition.
♻ ☆ XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark
Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24\% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.
Image and Video Processing 14
☆ Physics-Guided Conditional Diffusion Networks for Microwave Image Reconstruction
A conditional latent-diffusion based framework for solving the electromagnetic inverse scattering problem associated with microwave imaging is introduced. This generative machine-learning model explicitly mirrors the non-uniqueness of the ill-posed inverse problem. Unlike existing inverse solvers utilizing deterministic machine learning techniques that produce a single reconstruction, the proposed latent-diffusion model generates multiple plausible permittivity maps conditioned on measured scattered-field data, thereby generating several potential instances in the range-space of the non-unique inverse mapping. A forward electromagnetic solver is integrated into the reconstruction pipeline as a physics-based evaluation mechanism. The space of candidate reconstructions form a distribution of possibilities consistent with the conditioning data and the member of this space yielding the lowest scattered-field data discrepancy between the predicted and measured scattered fields is reported as the final solution. Synthetic and experimental labeled datasets are used for training and evaluation of the model. An innovative labeled synthetic dataset is created that exemplifies a varied set of scattering features. Training of the model using this new dataset produces high quality permittivity reconstructions achieving improved generalization with excellent fidelity to shape recognition. The results highlight the potential of hybrid generative physics frameworks as a promising direction for robust, data-driven microwave imaging.
☆ Improving Temporal Consistency and Fidelity at Inference-time in Perceptual Video Restoration by Zero-shot Image-based Diffusion Models
Diffusion models have emerged as powerful priors for single-image restoration, but their application to zero-shot video restoration suffers from temporal inconsistencies due to the stochastic nature of sampling and complexity of incorporating explicit temporal modeling. In this work, we address the challenge of improving temporal coherence in video restoration using zero-shot image-based diffusion models without retraining or modifying their architecture. We propose two complementary inference-time strategies: (1) Perceptual Straightening Guidance (PSG) based on the neuroscience-inspired perceptual straightening hypothesis, which steers the diffusion denoising process towards smoother temporal evolution by incorporating a curvature penalty in a perceptual space to improve temporal perceptual scores, such as Fr\'echet Video Distance (FVD) and perceptual straightness; and (2) Multi-Path Ensemble Sampling (MPES), which aims at reducing stochastic variation by ensembling multiple diffusion trajectories to improve fidelity (distortion) scores, such as PSNR and SSIM, without sacrificing sharpness. Together, these training-free techniques provide a practical path toward temporally stable high-fidelity perceptual video restoration using large pretrained diffusion models. We performed extensive experiments over multiple datasets and degradation types, systematically evaluating each strategy to understand their strengths and limitations. Our results show that while PSG enhances temporal naturalness, particularly in case of temporal blur, MPES consistently improves fidelity and spatio-temporal perception--distortion trade-off across all tasks.
☆ Energy consumption assessment of a Virtual Reality Remote Rendering application over 5G networks
This paper investigates the energy implications of remote rendering for Virtual Reality (VR) applications within a real 5G testbed. Remote rendering enables lightweight devices to access high-performance graphical content by offloading computationally intensive tasks to Cloud-native Network Functions (CNFs) running on remote servers. However, this approach raises concerns regarding energy consumption across the various network components involved, including the remote computing node, the 5G Core, the Radio Access Network (RAN), and the User Equipment (UE). This work proposes and evaluates two complementary energy monitoring solutions, one hardware-based and one software-based, to measure energy consumption at different system levels. A VR remote renderer, deployed as CNF and leveraging the Media over QUIC (MoQ) protocol, is used as test case for assessing its energy footprint under different multimedia and network configurations. The results provide critical insights into the trade-off between energy consumption and performance of a real-world VR application running in a 5G environment.
☆ Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design
Achieving high-fidelity, compact RGBD imaging presents a dual challenge: conventional compact optics struggle with RGB sharpness across the entire depth-of-field, while software-only Monocular Depth Estimation (MDE) is an ill-posed problem reliant on unreliable semantic priors. While deep optics with elements like DOEs can encode depth, they introduce trade-offs in fabrication complexity and chromatic aberrations, compromising simplicity. To address this, we first introduce a novel bio-inspired all-spherical monocentric lens, around which we build the Bionic Monocentric Imaging (BMI) framework, a holistic co-design. This optical design naturally encodes depth into its depth-varying Point Spread Functions (PSFs) without requiring complex diffractive or freeform elements. We establish a rigorous physically-based forward model to generate a synthetic dataset by precisely simulating the optical degradation process. This simulation pipeline is co-designed with a dual-head, multi-scale reconstruction network that employs a shared encoder to jointly recover a high-fidelity All-in-Focus (AiF) image and a precise depth map from a single coded capture. Extensive experiments validate the state-of-the-art performance of the proposed framework. In depth estimation, the method attains an Abs Rel of 0.026 and an RMSE of 0.130, markedly outperforming leading software-only approaches and other deep optics systems. For image restoration, the system achieves an SSIM of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior balance between image fidelity and depth accuracy. This study illustrates that the integration of bio-inspired, fully spherical optics with a joint reconstruction algorithm constitutes an effective strategy for addressing the intrinsic challenges in high-performance compact RGBD imaging. Source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI.
comment: The source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI
Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
comment: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis
☆ Neighborhood Feature Pooling for Remote Sensing Image Classification WACV 2026
In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.
comment: 9 pages, 5 figures. Accepted to WACV 2026 (Winter Conference on Applications of Computer Vision)
☆ Groupwise Registration with Physics-Informed Test-Time Adaptation on Multi-parametric Cardiac MRI
Multiparametric mapping MRI has become a viable tool for myocardial tissue characterization. However, misalignment between multiparametric maps makes pixel-wise analysis challenging. To address this challenge, we developed a generalizable physics-informed deep-learning model using test-time adaptation to enable group image registration across contrast weighted images acquired from multiple physical models (e.g., a T1 mapping model and T2 mapping model). The physics-informed adaptation utilized the synthetic images from specific physics model as registration reference, allows for transductive learning for various tissue contrast. We validated the model in healthy volunteers with various MRI sequences, demonstrating its improvement for multi-modal registration with a wide range of image contrast variability.
☆ Diffusion-Driven Generation of Minimally Preprocessed Brain MRI
The purpose of this study is to present and compare three denoising diffusion probabilistic models (DDPMs) that generate 3D $T_1$-weighted MRI human brain images. Three DDPMs were trained using 80,675 image volumes from 42,406 subjects spanning 38 publicly available brain MRI datasets. These images had approximately 1 mm isotropic resolution and were manually inspected by three human experts to exclude those with poor quality, field-of-view issues, and excessive pathology. The images were minimally preprocessed to preserve the visual variability of the data. Furthermore, to enable the DDPMs to produce images with natural orientation variations and inhomogeneity, the images were neither registered to a common coordinate system nor bias field corrected. Evaluations included segmentation, Frechet Inception Distance (FID), and qualitative inspection. Regarding results, all three DDPMs generated coherent MR brain volumes. The velocity and flow prediction models achieved lower FIDs than the sample prediction model. However, all three models had higher FIDs compared to real images across multiple cohorts. In a permutation experiment, the generated brain regional volume distributions differed statistically from real data. However, the velocity and flow prediction models had fewer statistically different volume distributions in the thalamus and putamen. In conclusion this work presents and releases the first 3D non-latent diffusion model for brain data without skullstripping or registration. Despite the negative results in statistical testing, the presented DDPMs are capable of generating high-resolution 3D $T_1$-weighted brain images. All model weights and corresponding inference code are publicly available at https://github.com/piksl-research/medforj .
☆ R3GAN-based Optimal Strategy for Augmenting Small Medical Dataset
Medical image analysis often suffers from data scarcity and class imbalance, limiting the effectiveness of deep learning models in clinical applications. Using human embryo time-lapse imaging (TLI) as a case study, this work investigates how generative adversarial networks (GANs) can be optimized for small datasets to generate realistic and diagnostically meaningful images. Based on systematic experiments with R3GAN, we established effective training strategies and designed an optimized configuration for 256x256-resolution datasets, featuring a full burn-in phase and a low, gradually increasing gamma range (5 -> 40). The generated samples were used to balance an imbalanced embryo dataset, leading to substantial improvement in classification performance. The recall and F1-score of t3 increased from 0.06 to 0.69 and 0.11 to 0.60, respectively, without compromising other classes. These results demonstrate that tailored R3GAN training strategies can effectively alleviate data scarcity and improve model robustness in small-scale medical imaging tasks.
☆ UP2D: Uncertainty-aware Progressive Pseudo-label Denoising for Source-Free Domain Adaptive Medical Image Segmentation
Medical image segmentation models face severe performance drops under domain shifts, especially when data sharing constraints prevent access to source images. We present a novel Uncertainty-aware Progressive Pseudo-label Denoising (UP2D) framework for source-free domain adaptation (SFDA), designed to mitigate noisy pseudo-labels and class imbalance during adaptation. UP2D integrates three key components: (i) a Refined Prototype Filtering module that suppresses uninformative regions and constructs reliable class prototypes to denoise pseudo-labels, (ii) an Uncertainty-Guided EMA (UG-EMA) strategy that selectively updates the teacher model based on spatially weighted boundary uncertainty, and (iii) a quantile-based entropy minimization scheme that focuses learning on ambiguous regions while avoiding overconfidence on easy pixels. This single-stage student-teacher framework progressively improves pseudo-label quality and reduces confirmation bias. Extensive experiments on three challenging retinal fundus benchmarks demonstrate that UP2D achieves state-of-the-art performance across both standard and open-domain settings, outperforming prior UDA and SFDA approaches while maintaining superior boundary precision.
♻ ☆ GENRE-CMR: Generalizable Deep Learning for Diverse Multi-Domain Cardiac MRI Reconstruction
Accelerated Cardiovascular Magnetic Resonance (CMR) image reconstruction remains a critical challenge due to the trade-off between scan time and image quality, particularly when generalizing across diverse acquisition settings. We propose GENRE-CMR, a generative adversarial network (GAN)-based architecture employing a residual deep unrolled reconstruction framework to enhance reconstruction fidelity and generalization. The architecture unrolls iterative optimization into a cascade of convolutional subnetworks, enriched with residual connections to enable progressive feature propagation from shallow to deeper stages. To further improve performance, we integrate two loss functions: (1) an Edge-Aware Region (EAR) loss, which guides the network to focus on structurally informative regions and helps prevent common reconstruction blurriness; and (2) a Statistical Distribution Alignment (SDA) loss, which regularizes the feature space across diverse data distributions via a symmetric KL divergence formulation. Extensive experiments confirm that GENRE-CMR surpasses state-of-the-art methods on training and unseen data, achieving 0.9552 SSIM and 38.90 dB PSNR on unseen distributions across various acceleration factors and sampling trajectories. Ablation studies confirm the contribution of each proposed component to reconstruction quality and generalization. Our framework presents a unified and robust solution for high-quality CMR reconstruction, paving the way for clinically adaptable deployment across heterogeneous acquisition protocols.
♻ ☆ Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behaviour classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviours such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioural analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework's reliability and adaptability across varying road and environmental conditions.
♻ ☆ AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets
Background: Development of artificial intelligence (AI) models for lung cancer screening requires large, well-annotated low-dose computed tomography (CT) datasets and rigorous performance benchmarks. Purpose: To create a reproducible benchmarking resource leveraging the Duke Lung Cancer Screening (DLCS) and multiple public datasets to develop and evaluate models for nodule detection and classification. Materials & Methods: This retrospective study uses the DLCS dataset (1,613 patients; 2,487 nodules) and external datasets including LUNA16, LUNA25, and NLST-3D. For detection, MONAI RetinaNet models were trained on DLCS (DLCS-De) and LUNA16 (LUNA16-De) and evaluated using the Competition Performance Metric (CPM). For nodule-level classification, we compare five strategies: pretrained models (Models Genesis, Med3D), a self-supervised foundation model (FMCB), and ResNet50 with random initialization versus Strategic Warm-Start (ResNet50-SWS) pretrained with detection-derived candidate patches stratified by confidence. Results: For detection on the DLCS test set, DLCS-De achieved sensitivity 0.82 at 2 false positives/scan (CPM 0.63) versus LUNA16-De (0.62, CPM 0.45). For external validation on NLST-3D, DLCS-De (sensitivity 0.72, CPM 0.58) also outperformed LUNA16-De (sensitivity 0.64, CPM 0.49). For classification across multiple datasets, ResNet50-SWS attained AUCs of 0.71 (DLCS; 95% CI, 0.61-0.81), 0.90 (LUNA16; 0.87-0.93), 0.81 (NLST-3D; 0.79-0.82), and 0.80 (LUNA25; 0.78-0.82), matching or exceeding pretrained/self-supervised baselines. Performance differences reflected dataset label standards. Conclusion: This work establishes a standardized benchmarking resource for lung cancer AI research, supporting model development, validation, and translation. All code, models, and data are publicly released to promote reproducibility.
comment: 2 tables, 5 figures
♻ ☆ Caption-Driven Explainability: Probing CNNs for Bias via CLIP ICIP 2025
Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at https://github.com/patch0816/caption-driven-xai
comment: Accepted and presented at the IEEE ICIP 2025 Satellite Workshop "Generative AI for World Simulations and Communications & Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos", Anchorage, Alaska, USA, September 14, 2025. Camera-ready preprint; the official IEEE Xplore publication will follow. Code: https://github.com/patch0816/caption-driven-xai
Computer Vision and Pattern Recognition 171
☆ Generative View Stitching
Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv\"ard's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.
comment: Project website: https://andrewsonga.github.io/gvs
☆ Uniform Discrete Diffusion with Metric Path for Video Generation
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA
comment: 19 pages, 10 figures
☆ Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.
☆ Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? NeurIPS 2025
Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of "which parts belong together" emerges naturally in a connectionist system.
comment: Accepted as a Spotlight at NeurIPS 2025
☆ MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection
Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.
☆ SAGE: Structure-Aware Generative Video Transitions between Diverse Clips
Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.
comment: Website: https://kan32501.github.io/sage.github.io/
☆ Group Relative Attention Guidance for Image Editing
Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.
☆ Eye-Tracking, Mouse Tracking, Stimulus Tracking,and Decision-Making Datasets in Digital Pathology
Interpretation of giga-pixel whole-slide images (WSIs) is an important but difficult task for pathologists. Their diagnostic accuracy is estimated to average around 70%. Adding a second pathologist does not substantially improve decision consistency. The field lacks adequate behavioral data to explain diagnostic errors and inconsistencies. To fill in this gap, we present PathoGaze1.0, a comprehensive behavioral dataset capturing the dynamic visual search and decision-making processes of the full diagnostic workflow during cancer diagnosis. The dataset comprises 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data (EMSVD) collected from 19 pathologists interpreting 397 WSIs. The data collection process emphasizes ecological validity through an application-grounded testbed, called PTAH. In total, we recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events. In addition, such data could also be used to improve the training of both pathologists and AI systems that might support human experts. All experiments were preregistered at https://osf.io/hj9a7, and the complete dataset along with analysis code is available at https://go.osu.edu/pathogaze.
comment: 16 pages, 9 figures, submitted to Nature Scientific Data
☆ A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries
The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network's learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model's effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.
☆ GroundLoc: Efficient Large-Scale Outdoor LiDAR-Only Localization
In this letter, we introduce GroundLoc, a LiDAR-only localization pipeline designed to localize a mobile robot in large-scale outdoor environments using prior maps. GroundLoc employs a Bird's-Eye View (BEV) image projection focusing on the perceived ground area and utilizes the place recognition network R2D2, or alternatively, the non-learning approach Scale-Invariant Feature Transform (SIFT), to identify and select keypoints for BEV image map registration. Our results demonstrate that GroundLoc outperforms state-of-the-art methods on the SemanticKITTI and HeLiPR datasets across various sensors. In the multi-session localization evaluation, GroundLoc reaches an Average Trajectory Error (ATE) well below 50 cm on all Ouster OS2 128 sequences while meeting online runtime requirements. The system supports various sensor models, as evidenced by evaluations conducted with Velodyne HDL-64E, Ouster OS2 128, Aeva Aeries II, and Livox Avia sensors. The prior maps are stored as 2D raster image maps, which can be created from a single drive and require only 4 MB of storage per square kilometer. The source code is available at https://github.com/dcmlr/groundloc.
☆ Physics-Inspired Gaussian Kolmogorov-Arnold Networks for X-ray Scatter Correction in Cone-Beam CT
Cone-beam CT (CBCT) employs a flat-panel detector to achieve three-dimensional imaging with high spatial resolution. However, CBCT is susceptible to scatter during data acquisition, which introduces CT value bias and reduced tissue contrast in the reconstructed images, ultimately degrading diagnostic accuracy. To address this issue, we propose a deep learning-based scatter artifact correction method inspired by physical prior knowledge. Leveraging the fact that the observed point scatter probability density distribution exhibits rotational symmetry in the projection domain. The method uses Gaussian Radial Basis Functions (RBF) to model the point scatter function and embeds it into the Kolmogorov-Arnold Networks (KAN) layer, which provides efficient nonlinear mapping capabilities for learning high-dimensional scatter features. By incorporating the physical characteristics of the scattered photon distribution together with the complex function mapping capacity of KAN, the model improves its ability to accurately represent scatter. The effectiveness of the method is validated through both synthetic and real-scan experiments. Experimental results show that the model can effectively correct the scatter artifacts in the reconstructed images and is superior to the current methods in terms of quantitative metrics.
comment: 8 pages, 6 figures
☆ OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.
☆ Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.
☆ Local Performance vs. Out-of-Distribution Generalization: An Empirical Analysis of Personalized Federated Learning in Heterogeneous Data Environments
In the context of Federated Learning with heterogeneous data environments, local models tend to converge to their own local model optima during local training steps, deviating from the overall data distributions. Aggregation of these local updates, e.g., with FedAvg, often does not align with the global model optimum (client drift), resulting in an update that is suboptimal for most clients. Personalized Federated Learning approaches address this challenge by exclusively focusing on the average local performances of clients' models on their own data distribution. Generalization to out-of-distribution samples, which is a substantial benefit of FedAvg and represents a significant component of robustness, appears to be inadequately incorporated into the assessment and evaluation processes. This study involves a thorough evaluation of Federated Learning approaches, encompassing both their local performance and their generalization capabilities. Therefore, we examine different stages within a single communication round to enable a more nuanced understanding of the considered metrics. Furthermore, we propose and incorporate a modified approach of FedAvg, designated as Federated Learning with Individualized Updates (FLIU), extending the algorithm by a straightforward individualization step with an adaptive personalization factor. We evaluate and compare the approaches empirically using MNIST and CIFAR-10 under various distributional conditions, including benchmark IID and pathological non-IID, as well as additional novel test environments with Dirichlet distribution specifically developed to stress the algorithms on complex data heterogeneity.
☆ Fast and accurate neural reflectance transformation imaging through knowledge distillation
Reflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). ...
comment: 18 pages
☆ Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling
Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.
☆ Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras
Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.
☆ A Critical Study towards the Detection of Parkinsons Disease using ML Technologies
The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.
☆ Rethinking Visual Intelligence: Insights from Video Pretraining
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
comment: Updated version from preprint arXiv:2506.07280 (Gen2Gen) focused on visual intelligence. This work can be considered as v2
☆ SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.
☆ Deeply-Conditioned Image Compression via Self-Generated Priors
Learned image compression (LIC) has shown great promise for achieving high rate-distortion performance. However, current LIC methods are often limited in their capability to model the complex correlation structures inherent in natural images, particularly the entanglement of invariant global structures with transient local textures within a single monolithic representation. This limitation precipitates severe geometric deformation at low bitrates. To address this, we introduce a framework predicated on functional decomposition, which we term Deeply-Conditioned Image Compression via self-generated priors (DCIC-sgp). Our central idea is to first encode a potent, self-generated prior to encapsulate the image's structural backbone. This prior is subsequently utilized not as mere side-information, but to holistically modulate the entire compression pipeline. This deep conditioning, most critically of the analysis transform, liberates it to dedicate its representational capacity to the residual, high-entropy details. This hierarchical, dependency-driven approach achieves an effective disentanglement of information streams. Our extensive experiments validate this assertion; visual analysis demonstrates that our method substantially mitigates the geometric deformation artifacts that plague conventional codecs at low bitrates. Quantitatively, our framework establishes highly competitive performance, achieving significant BD-rate reductions of 14.4%, 15.7%, and 15.1% against the VVC test model VTM-12.1 on the Kodak, CLIC, and Tecnick datasets.
☆ XAI Evaluation Framework for Semantic Segmentation
Ensuring transparency and trust in artificial intelligence (AI) models is essential, particularly as they are increasingly applied in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge, yet the rigorous evaluation of XAI methods remains crucial for optimizing the trade-offs between model complexity, predictive performance, and interpretability. While extensive progress has been achieved in evaluating XAI techniques for classification tasks, evaluation strategies tailored to semantic segmentation remain relatively underexplored. This work introduces a comprehensive and systematic evaluation framework specifically designed for assessing XAI in semantic segmentation, explicitly accounting for both spatial and contextual task complexities. The framework employs pixel-level evaluation strategies and carefully designed metrics to provide fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings contribute to advancing transparent, trustworthy, and accountable semantic segmentation models.
☆ 50 Years of Water Body Monitoring: The Case of Qaraaoun Reservoir, Lebanon
The sustainable management of the Qaraaoun Reservoir, the largest surface water body in Lebanon located in the Bekaa Plain, depends on reliable monitoring of its storage volume despite frequent sensor malfunctions and limited maintenance capacity. This study introduces a sensor-free approach that integrates open-source satellite imagery, advanced water-extent segmentation, and machine learning to estimate the reservoir surface area and volume in near real time. Sentinel-2 and Landsat images are processed, where surface water is delineated using a newly proposed water segmentation index. A machine learning model based on Support Vector Regression (SVR) is trained on a curated dataset that includes water surface area, water level, and water volume calculations using a reservoir bathymetry survey. The model is then able to estimate reservoir volume relying solely on surface area extracted from satellite imagery, without the need for ground measurements. Water segmentation using the proposed index aligns with ground truth for more than 95 percent of the shoreline. Hyperparameter tuning with GridSearchCV yields an optimized SVR performance with error under 1.5 percent of full reservoir capacity and coefficients of determination exceeding 0.98. These results demonstrate the robustness and cost-effectiveness of the method, offering a practical solution for continuous, sensor-independent monitoring of reservoir storage. The proposed methodology can be replicated for other water bodies, and the resulting 50 years of time-series data is valuable for research on climate change and environmental patterns.
☆ OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.
comment: work in progress
☆ A Hybrid Approach for Visual Multi-Object Tracking
This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2
comment: This work has been submitted to the IEEE for possible publication
☆ GenTrack: A New Generation of Multi-Object Tracking
This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and the first-ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Basic, PSO, and PSO-Social, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack
comment: This work has been submitted to the IEEE for possible publication
☆ Unsupervised Detection of Post-Stroke Brain Abnormalities
Post-stroke MRI not only delineates focal lesions but also reveals secondary structural changes, such as atrophy and ventricular enlargement. These abnormalities, increasingly recognised as imaging biomarkers of recovery and outcome, remain poorly captured by supervised segmentation methods. We evaluate REFLECT, a flow-based generative model, for unsupervised detection of both focal and non-lesional abnormalities in post-stroke patients. Using dual-expert central-slice annotations on ATLAS data, performance was assessed at the object level with Free-Response ROC analysis for anomaly maps. Two models were trained on lesion-free slices from stroke patients (ATLAS) and on healthy controls (IXI) to test the effect of training data. On ATLAS test subjects, the IXI-trained model achieved higher lesion segmentation (Dice = 0.37 vs 0.27) and improved sensitivity to non-lesional abnormalities (FROC = 0.62 vs 0.43). Training on fully healthy anatomy improves the modelling of normal variability, enabling broader and more reliable detection of structural abnormalities.
☆ When are radiology reports useful for training medical image classifiers?
Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it's not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.
☆ A Luminance-Aware Multi-Scale Network for Polarization Image Fusion with a Multi-Scene Dataset
Polarization image fusion combines S0 and DOLP images to reveal surface roughness and material properties through complementary texture features, which has important applications in camouflage recognition, tissue pathology analysis, surface defect detection and other fields. To intergrate coL-Splementary information from different polarized images in complex luminance environment, we propose a luminance-aware multi-scale network (MLSN). In the encoder stage, we propose a multi-scale spatial weight matrix through a brightness-branch , which dynamically weighted inject the luminance into the feature maps, solving the problem of inherent contrast difference in polarized images. The global-local feature fusion mechanism is designed at the bottleneck layer to perform windowed self-attention computation, to balance the global context and local details through residual linking in the feature dimension restructuring stage. In the decoder stage, to further improve the adaptability to complex lighting, we propose a Brightness-Enhancement module, establishing the mapping relationship between luminance distribution and texture features, realizing the nonlinear luminance correction of the fusion result. We also present MSP, an 1000 pairs of polarized images that covers 17 types of indoor and outdoor complex lighting scenes. MSP provides four-direction polarization raw maps, solving the scarcity of high-quality datasets in polarization image fusion. Extensive experiment on MSP, PIF and GAND datasets verify that the proposed MLSN outperms the state-of-the-art methods in subjective and objective evaluations, and the MS-SSIM and SD metircs are higher than the average values of other methods by 8.57%, 60.64%, 10.26%, 63.53%, 22.21%, and 54.31%, respectively. The source code and dataset is avalable at https://github.com/1hzf/MLS-UNet.
☆ Stroke Lesion Segmentation in Clinical Workflows: A Modular, Lightweight, and Deployment-Ready Tool
Deep learning frameworks such as nnU-Net achieve state-of-the-art performance in brain lesion segmentation but remain difficult to deploy clinically due to heavy dependencies and monolithic design. We introduce \textit{StrokeSeg}, a modular and lightweight framework that translates research-grade stroke lesion segmentation models into deployable applications. Preprocessing, inference, and postprocessing are decoupled: preprocessing relies on the Anima toolbox with BIDS-compliant outputs, and inference uses ONNX Runtime with \texttt{Float16} quantisation, reducing model size by about 50\%. \textit{StrokeSeg} provides both graphical and command-line interfaces and is distributed as Python scripts and as a standalone Windows executable. On a held-out set of 300 sub-acute and chronic stroke subjects, segmentation performance was equivalent to the original PyTorch pipeline (Dice difference $<10^{-3}$), demonstrating that high-performing research pipelines can be transformed into portable, clinically usable tools.
☆ Decoupling What to Count and Where to See for Referring Expression Counting
Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.
☆ Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation
Teacher-student frameworks have emerged as a leading approach in semi-supervised medical image segmentation, demonstrating strong performance across various tasks. However, the learning effects are still limited by the strong correlation and unreliable knowledge transfer process between teacher and student networks. To overcome this limitation, we introduce a novel switching Dual-Student architecture that strategically selects the most reliable student at each iteration to enhance dual-student collaboration and prevent error reinforcement. We also introduce a strategy of Loss-Aware Exponential Moving Average to dynamically ensure that the teacher absorbs meaningful information from students, improving the quality of pseudo-labels. Our plug-and-play framework is extensively evaluated on 3D medical image segmentation datasets, where it outperforms state-of-the-art semi-supervised methods, demonstrating its effectiveness in improving segmentation accuracy under limited supervision.
comment: The paper is under review at Pattern Recognition Journal
☆ NVSim: Novel View Synthesis Simulator for Large Scale Indoor Navigation
We present NVSim, a framework that automatically constructs large-scale, navigable indoor simulators from only common image sequences, overcoming the cost and scalability limitations of traditional 3D scanning. Our approach adapts 3D Gaussian Splatting to address visual artifacts on sparsely observed floors a common issue in robotic traversal data. We introduce Floor-Aware Gaussian Splatting to ensure a clean, navigable ground plane, and a novel mesh-free traversability checking algorithm that constructs a topological graph by directly analyzing rendered views. We demonstrate our system's ability to generate valid, large-scale navigation graphs from real-world data. A video demonstration is avilable at https://youtu.be/tTiIQt6nXC8
comment: 9 pages, 10 figures
☆ Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes
Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.
☆ What do vision-language models see in the context? Investigating multimodal in-context learning
In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.
☆ Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning
Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.
☆ ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.
☆ Training-free Source Attribution of AI-generated Images via Resynthesis
Synthetic image source attribution is a challenging task, especially in data scarcity conditions requiring few-shot or zero-shot classification capabilities. We present a new training-free one-shot attribution method based on image resynthesis. A prompt describing the image under analysis is generated, then it is used to resynthesize the image with all the candidate sources. The image is attributed to the model which produced the resynthesis closest to the original image in a proper feature space. We also introduce a new dataset for synthetic image attribution consisting of face images from commercial and open-source text-to-image generators. The dataset provides a challenging attribution framework, useful for developing new attribution models and testing their capabilities on different generative architectures. The dataset structure allows to test approaches based on resynthesis and to compare them to few-shot methods. Results from state-of-the-art few-shot approaches and other baselines show that the proposed resynthesis method outperforms existing techniques when only a few samples are available for training or fine-tuning. The experiments also demonstrate that the new dataset is a challenging one and represents a valuable benchmark for developing and evaluating future few-shot and zero-shot methods.
comment: 14 pages, 4 figures, 1 table, accepted at "The 17th IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS2025)", Perth, Australia
☆ DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation NeurIPS 2025
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
comment: Accepted to NeurIPS 2025
☆ UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation NeurIPS 2025
Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes -- such as fidelity and diversity -- to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies -- such as prompt embeddings and initial noise -- at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ DeshadowMamba: Deshadowing as 1D Sequential Similarity
Recent deep models for image shadow removal often rely on attention-based architectures to capture long-range dependencies. However, their fixed attention patterns tend to mix illumination cues from irrelevant regions, leading to distorted structures and inconsistent colors. In this work, we revisit shadow removal from a sequence modeling perspective and explore the use of Mamba, a selective state space model that propagates global context through directional state transitions. These transitions yield an efficient global receptive field while preserving positional continuity. Despite its potential, directly applying Mamba to image data is suboptimal, since it lacks awareness of shadow-non-shadow semantics and remains susceptible to color interference from nearby regions. To address these limitations, we propose CrossGate, a directional modulation mechanism that injects shadow-aware similarity into Mamba's input gate, allowing selective integration of relevant context along transition axes. To further ensure appearance fidelity, we introduce ColorShift regularization, a contrastive learning objective driven by global color statistics. By synthesizing structured informative negatives, it guides the model to suppress color contamination and achieve robust color restoration. Together, these components adapt sequence modeling to the structural integrity and chromatic consistency required for shadow removal. Extensive experiments on public benchmarks demonstrate that DeshadowMamba achieves state-of-the-art visual quality and strong quantitative performance.
☆ Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy NeurIPS 2025
To improve detection robustness in adverse conditions (e.g., haze and low light), image restoration is commonly applied as a pre-processing step to enhance image quality for the detector. However, the functional mismatch between restoration and detection networks can introduce instability and hinder effective integration -- an issue that remains underexplored. We revisit this limitation through the lens of Lipschitz continuity, analyzing the functional differences between restoration and detection networks in both the input space and the parameter space. Our analysis shows that restoration networks perform smooth, continuous transformations, while object detectors operate with discontinuous decision boundaries, making them highly sensitive to minor perturbations. This mismatch introduces instability in traditional cascade frameworks, where even imperceptible noise from restoration is amplified during detection, disrupting gradient flow and hindering optimization. To address this, we propose Lipschitz-regularized object detection (LROD), a simple yet effective framework that integrates image restoration directly into the detector's feature learning, harmonizing the Lipschitz continuity of both tasks during training. We implement this framework as Lipschitz-regularized YOLO (LR-YOLO), extending seamlessly to existing YOLO detectors. Extensive experiments on haze and low-light benchmarks demonstrate that LR-YOLO consistently improves detection stability, optimization smoothness, and overall accuracy.
comment: NeurIPS 2025
☆ Benchmarking Microsaccade Recognition with Event Cameras: A Novel Dataset and Evaluation BMVC
Microsaccades are small, involuntary eye movements vital for visual perception and neural processing. Traditional microsaccade studies typically use eye trackers or frame-based analysis, which, while precise, are costly and limited in scalability and temporal resolution. Event-based sensing offers a high-speed, low-latency alternative by capturing fine-grained spatiotemporal changes efficiently. This work introduces a pioneering event-based microsaccade dataset to support research on small eye movement dynamics in cognitive computing. Using Blender, we render high-fidelity eye movement scenarios and simulate microsaccades with angular displacements from 0.5 to 2.0 degrees, divided into seven distinct classes. These are converted to event streams using v2e, preserving the natural temporal dynamics of microsaccades, with durations ranging from 0.25 ms to 2.25 ms. We evaluate the dataset using Spiking-VGG11, Spiking-VGG13, and Spiking-VGG16, and propose Spiking-VGG16Flow, an optical-flow-enhanced variant implemented in SpikingJelly. The models achieve around 90 percent average accuracy, successfully classifying microsaccades by angular displacement, independent of event count or duration. These results demonstrate the potential of spiking neural networks for fine motion recognition and establish a benchmark for event-based vision research. The dataset, code, and trained models will be publicly available at https://waseemshariff126.github.io/microsaccades/ .
comment: Accepted in British Machine Vision Conference (BMVC) 2025, Main Conference
☆ SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs NeurIPS 2025
Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.
comment: NeurIPS 2025
☆ Beyond Inference Intervention: Identity-Decoupled Diffusion for Face Anonymization
Face anonymization aims to conceal identity information while preserving non-identity attributes. Mainstream diffusion models rely on inference-time interventions such as negative guidance or energy-based optimization, which are applied post-training to suppress identity features. These interventions often introduce distribution shifts and entangle identity with non-identity attributes, degrading visual fidelity and data utility. To address this, we propose \textbf{ID\textsuperscript{2}Face}, a training-centric anonymization framework that removes the need for inference-time optimization. The rationale of our method is to learn a structured latent space where identity and non-identity information are explicitly disentangled, enabling direct and controllable anonymization at inference. To this end, we design a conditional diffusion model with an identity-masked learning scheme. An Identity-Decoupled Latent Recomposer uses an Identity Variational Autoencoder to model identity features, while non-identity attributes are extracted from same-identity pairs and aligned through bidirectional latent alignment. An Identity-Guided Latent Harmonizer then fuses these representations via soft-gating conditioned on noisy feature prediction. The model is trained with a recomposition-based reconstruction loss to enforce disentanglement. At inference, anonymization is achieved by sampling a random identity vector from the learned identity space. To further suppress identity leakage, we introduce an Orthogonal Identity Mapping strategy that enforces orthogonality between sampled and source identity vectors. Experiments demonstrate that ID\textsuperscript{2}Face outperforms existing methods in visual quality, identity suppression, and utility preservation.
☆ MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration
While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.
☆ CLFSeg: A Fuzzy-Logic based Solution for Boundary Clarity and Uncertainty Reduction in Medical Image Segmentation BMVC
Accurate polyp and cardiac segmentation for early detection and treatment is essential for the diagnosis and treatment planning of cancer-like diseases. Traditional convolutional neural network (CNN) based models have represented limited generalizability, robustness, and inability to handle uncertainty, which affects the segmentation performance. To solve these problems, this paper introduces CLFSeg, an encoder-decoder based framework that aggregates the Fuzzy-Convolutional (FC) module leveraging convolutional layers and fuzzy logic. This module enhances the segmentation performance by identifying local and global features while minimizing the uncertainty, noise, and ambiguity in boundary regions, ensuring computing efficiency. In order to handle class imbalance problem while focusing on the areas of interest with tiny and boundary regions, binary cross-entropy (BCE) with dice loss is incorporated. Our proposed model exhibits exceptional performance on four publicly available datasets, including CVC-ColonDB, CVC-ClinicDB, EtisLaribPolypDB, and ACDC. Extensive experiments and visual studies show CLFSeg surpasses the existing SOTA performance and focuses on relevant regions of interest in anatomical structures. The proposed CLFSeg improves performance while ensuring computing efficiency, which makes it a potential solution for real-world medical diagnostic scenarios. Project page is available at https://visdomlab.github.io/CLFSeg/
comment: The 36th British Machine Vision Conference (BMVC) 2025
☆ Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2 NeurIPS 2025
Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.
comment: Accepted by NeurIPS 2025
☆ Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning IROS 2025
This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.
comment: RoboSense Challenge with IROS 2025
☆ MSRANetV2: An Explainable Deep Learning Architecture for Multi-class Classification of Colorectal Histopathological Images
Colorectal cancer (CRC) is a leading worldwide cause of cancer-related mortality, and the role of prompt precise detection is of paramount interest in improving patient outcomes. Conventional diagnostic methods such as colonoscopy and histological examination routinely exhibit subjectivity, are extremely time-consuming, and are susceptible to variation. Through the development of digital pathology, deep learning algorithms have become a powerful approach in enhancing diagnostic precision and efficiency. In our work, we proposed a convolutional neural network architecture named MSRANetV2, specially optimized for the classification of colorectal tissue images. The model employs a ResNet50V2 backbone, extended with residual attention mechanisms and squeeze-and-excitation (SE) blocks, to extract deep semantic and fine-grained spatial features. With channel alignment and upsampling operations, MSRANetV2 effectively fuses multi-scale representations, thereby enhancing the robustness of the classification. We evaluated our model on a five-fold stratified cross-validation strategy on two publicly available datasets: CRC-VAL-HE-7K and NCT-CRC-HE-100K. The proposed model achieved remarkable average Precision, recall, F1-score, AUC, and test accuracy were 0.9884 plus-minus 0.0151, 0.9900 plus-minus 0.0151, 0.9900 plus-minus 0.0145, 0.9999 plus-minus 0.00006, and 0.9905 plus-minus 0.0025 on the 7K dataset. On the 100K dataset, they were 0.9904 plus-minus 0.0091, 0.9900 plus-minus 0.0071, 0.9900 plus-minus 0.0071, 0.9997 plus-minus 0.00016, and 0.9902 plus-minus 0.0006. Additionally, Grad-CAM visualizations were incorporated to enhance model interpretability by highlighting tissue areas that are medically relevant. These findings validate that MSRANetV2 is a reliable, interpretable, and high-performing architectural model for classifying CRC tissues.
☆ VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.
comment: Accepted by EMNLP 2025
☆ Compositional Image Synthesis with Inference-Time Scaling
Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.
comment: projcet page: https://github.com/gcl-inha/ReFocus
☆ ETC: training-free diffusion models acceleration with Error-aware Trend Consistency
Diffusion models have achieved remarkable generative quality but remain bottlenecked by costly iterative sampling. Recent training-free methods accelerate diffusion process by reusing model outputs. However, these methods ignore denoising trends and lack error control for model-specific tolerance, leading to trajectory deviations under multi-step reuse and exacerbating inconsistencies in the generated results. To address these issues, we introduce Error-aware Trend Consistency (ETC), a framework that (1) introduces a consistent trend predictor that leverages the smooth continuity of diffusion trajectories, projecting historical denoising patterns into stable future directions and progressively distributing them across multiple approximation steps to achieve acceleration without deviating; (2) proposes a model-specific error tolerance search mechanism that derives corrective thresholds by identifying transition points from volatile semantic planning to stable quality refinement. Experiments show that ETC achieves a 2.65x acceleration over FLUX with negligible (-0.074 SSIM score) degradation of consistency.
comment: 17 pages, 10 figures
☆ DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery
We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements for the task of motion recovery from images. DogMo comprises 1.2k motion sequences collected from 10 unique dogs, offering rich variation in both motion and breed. It addresses key limitations of existing dog motion datasets, including the lack of multi-view and real 3D data, as well as limited scale and diversity. Leveraging DogMo, we establish four motion recovery benchmark settings that support systematic evaluation across monocular and multi-view, RGB and RGB-D inputs. To facilitate accurate motion recovery, we further introduce a three-stage, instance-specific optimization pipeline that fits the SMAL model to the motion sequences. Our method progressively refines body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization. Our dataset and method provide a principled foundation for advancing research in dog motion recovery and open up new directions at the intersection of computer vision, computer graphics, and animal behavior modeling.
comment: 19 pages
☆ UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations
Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing cost while maintaining accuracy. In visual applications, where large-scale image models are widely used, KD enables efficient deployment. However, architectural diversity introduces semantic discrepancies that hinder the use of intermediate representations. Most existing KD methods are designed for homogeneous models and degrade in heterogeneous scenarios, especially when intermediate features are involved. Prior studies mainly focus on the logits space, making limited use of the semantic information in intermediate layers. To address this limitation, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed as a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Fourier transform is applied to capture global feature information, alleviating representational discrepancies between heterogeneous teacher-student pairs. A Feature Transformation Module (FTM) produces compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Experiments on CIFAR-100 and ImageNet-1K demonstrate gains of 5.59% and 0.83% over the latest method, highlighting UHKD as an effective approach for unifying heterogeneous representations and enabling efficient utilization of visual knowledge
comment: 14 pages, 4 figures
☆ ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring
End-to-end autonomous driving maps raw sensor inputs directly into ego-vehicle trajectories to avoid cascading errors from perception modules and to leverage rich semantic cues. Existing frameworks largely rely on Imitation Learning (IL), which can be limited by sub-optimal expert demonstrations and covariate shift during deployment. On the other hand, Reinforcement Learning (RL) has recently shown potential in scaling up with simulations, but is typically confined to low-dimensional symbolic inputs (e.g. 3D objects and maps), falling short of full end-to-end learning from raw sensor data. We introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring), a framework that combines the strengths of both worlds: sensor inputs without losing information and RL training for robust planning. To the best of our knowledge, ZTRS is the first framework that eliminates IL entirely by only learning from rewards while operating directly on high-dimensional sensor data. ZTRS utilizes offline reinforcement learning with our proposed Exhaustive Policy Optimization (EPO), a variant of policy gradient tailored for enumerable actions and rewards. ZTRS demonstrates strong performance across three benchmarks: Navtest (generic real-world open-loop planning), Navhard (open-loop planning in challenging real-world and synthetic scenarios), and HUGSIM (simulated closed-loop driving). Specifically, ZTRS achieves the state-of-the-art result on Navhard and outperforms IL-based baselines on HUGSIM. Code will be available at https://github.com/woxihuanjiangguo/ZTRS.
☆ Enhancing Pre-trained Representation Classifiability can Boost its Interpretability ICLR 2025
The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.
comment: ICLR 2025 (Spotlight)
☆ OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation
Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.
comment: The first two authors contributed equally to this work. The last two authors are co-corresponding authors
☆ Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.
☆ Enhancing CLIP Robustness via Cross-Modality Alignment NeurIPS 2025
Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.
comment: NeurIPS 2025 Spotlight
☆ Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models
Parameter-efficient fine-tuning (PEFT) aims to adapt pre-trained vision models to downstream tasks. Among PEFT paradigms, sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks, rather than densely tuning the entire weight matrix. Current methods follow a two-stage paradigm. First, it locates task-relevant weights by gradient information, which overlooks the parameter adjustments during fine-tuning and limits the performance. Second, it updates only the located weights by applying a sparse mask to the gradient of the weight matrix, which results in high memory usage due to the storage of all weight matrices in the optimizer. In this paper, we propose a one-stage method named SNELLA to overcome the above limitations. For memory usage, SNELLA selectively updates the weight matrix by adding it to another sparse matrix that is merged by two low-rank learnable matrices. We extend the low-rank decomposition by introducing nonlinear kernel functions, thereby increasing the rank of the resulting merged matrix to prevent the interdependency among weight updates, enabling better adaptation to downstream tasks. For locating task-relevant weights, we propose an adaptive bi-level sparsity allocation mechanism that encourages weights to compete across and inside layers based on their importance scores in an end-to-end manner. Extensive experiments are conducted on classification, segmentation, and generation tasks using different pre-trained vision models. The results show that SNELLA achieves SOTA performance with low memory usage. Notably, SNELLA obtains 1.8% (91.9% v.s. 90.1%) higher Top-1 accuracy on the FGVC benchmark compared to SPT-LoRA. Compared to previous methods, SNELLA achieves a memory reduction of 31.1%-39.9% across models with parameter scales from 86M to 632M. Our source codes are available at https://github.com/ssfgunner/SNELL.
☆ ResNet: Enabling Deep Convolutional Neural Networks through Residual Learning
Convolutional Neural Networks (CNNs) has revolutionized computer vision, but training very deep networks has been challenging due to the vanishing gradient problem. This paper explores Residual Networks (ResNet), introduced by He et al. (2015), which overcomes this limitation by using skip connections. ResNet enables the training of networks with hundreds of layers by allowing gradients to flow directly through shortcut connections that bypass intermediate layers. In our implementation on the CIFAR-10 dataset, ResNet-18 achieves 89.9% accuracy compared to 84.1% for a traditional deep CNN of similar depth, while also converging faster and training more stably.
comment: 3 pages, 5 figures, 1 table
☆ AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts ICCV 2025
Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).
comment: Accepted by ICCV 2025
☆ Listening without Looking: Modality Bias in Audio-Visual Captioning
Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.
comment: under review
☆ Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks NeurIPS 2025
Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.
comment: Accepted at NeurIPS 2025
☆ Towards the Automatic Segmentation, Modeling and Meshing of the Aortic Vessel Tree from Multicenter Acquisitions: An Overview of the SEG.A. 2023 Segmentation of the Aorta Challenge
The automated analysis of the aortic vessel tree (AVT) from computed tomography angiography (CTA) holds immense clinical potential, but its development has been impeded by a lack of shared, high-quality data. We launched the SEG.A. challenge to catalyze progress in this field by introducing a large, publicly available, multi-institutional dataset for AVT segmentation. The challenge benchmarked automated algorithms on a hidden test set, with subsequent optional tasks in surface meshing for computational simulations. Our findings reveal a clear convergence on deep learning methodologies, with 3D U-Net architectures dominating the top submissions. A key result was that an ensemble of the highest-ranking algorithms significantly outperformed individual models, highlighting the benefits of model fusion. Performance was strongly linked to algorithmic design, particularly the use of customized post-processing steps, and the characteristics of the training data. This initiative not only establishes a new performance benchmark but also provides a lasting resource to drive future innovation toward robust, clinically translatable tools.
☆ AdvBlur: Adversarial Blur for Robust Diabetic Retinopathy Classification and Cross-Domain Generalization
Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, yet early and accurate detection can significantly improve treatment outcomes. While numerous Deep learning (DL) models have been developed to predict DR from fundus images, many face challenges in maintaining robustness due to distributional variations caused by differences in acquisition devices, demographic disparities, and imaging conditions. This paper addresses this critical limitation by proposing a novel DR classification approach, a method called AdvBlur. Our method integrates adversarial blurred images into the dataset and employs a dual-loss function framework to address domain generalization. This approach effectively mitigates the impact of unseen distributional variations, as evidenced by comprehensive evaluations across multiple datasets. Additionally, we conduct extensive experiments to explore the effects of factors such as camera type, low-quality images, and dataset size. Furthermore, we perform ablation studies on blurred images and the loss function to ensure the validity of our choices. The experimental results demonstrate the effectiveness of our proposed method, achieving competitive performance compared to state-of-the-art domain generalization DR models on unseen external datasets.
☆ TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.
☆ Efficient Cost-and-Quality Controllable Arbitrary-scale Super-resolution with Fourier Constraints
Cost-and-Quality (CQ) controllability in arbitrary-scale super-resolution is crucial. Existing methods predict Fourier components one by one using a recurrent neural network. However, this approach leads to performance degradation and inefficiency due to independent prediction. This paper proposes predicting multiple components jointly to improve both quality and efficiency.
comment: 9 pages
☆ Synergistic Neural Forecasting of Air Pollution with Stochastic Sampling
Air pollution remains a leading global health and environmental risk, particularly in regions vulnerable to episodic air pollution spikes due to wildfires, urban haze and dust storms. Accurate forecasting of particulate matter (PM) concentrations is essential to enable timely public health warnings and interventions, yet existing models often underestimate rare but hazardous pollution events. Here, we present SynCast, a high-resolution neural forecasting model that integrates meteorological and air composition data to improve predictions of both average and extreme pollution levels. Built on a regionally adapted transformer backbone and enhanced with a diffusion-based stochastic refinement module, SynCast captures the nonlinear dynamics driving PM spikes more accurately than existing approaches. Leveraging on harmonized ERA5 and CAMS datasets, our model shows substantial gains in forecasting fidelity across multiple PM variables (PM$_1$, PM$_{2.5}$, PM$_{10}$), especially under extreme conditions. We demonstrate that conventional loss functions underrepresent distributional tails (rare pollution events) and show that SynCast, guided by domain-aware objectives and extreme value theory, significantly enhances performance in highly impacted regions without compromising global accuracy. This approach provides a scalable foundation for next-generation air quality early warning systems and supports climate-health risk mitigation in vulnerable regions.
☆ Reasoning Visual Language Model for Chest X-Ray Analysis
Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.
comment: NV-Reason-CXR-3B
☆ SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability
With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at inference time, eliminating the need for retraining while ensuring precise risk assessments and explanations. Recognizing the limitations of existing unsafe image benchmarks, which either lack granularity or cover limited risks, we introduce VisionHarm, a high-quality dataset comprising two subsets: VisionHarm Third-party (VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse harmful categories. Through extensive experiments, we show that SafeVision achieves state-of-the-art performance on different benchmarks. SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while being over 16x faster. SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats.
comment: 42 pages, 9 figures
☆ Neural USD: An object-centric framework for iterative editing and control
Amazing progress has been made in controllable generative modeling, especially over the last few years. However, some challenges remain. One of them is precise and iterative object editing. In many of the current methods, trying to edit the generated image (for example, changing the color of a particular object in the scene or changing the background while keeping other elements unchanged) by changing the conditioning signals often leads to unintended global changes in the scene. In this work, we take the first steps to address the above challenges. Taking inspiration from the Universal Scene Descriptor (USD) standard developed in the computer graphics community, we introduce the "Neural Universal Scene Descriptor" or Neural USD. In this framework, we represent scenes and objects in a structured, hierarchical manner. This accommodates diverse signals, minimizes model-specific constraints, and enables per-object control over appearance, geometry, and pose. We further apply a fine-tuning approach which ensures that the above control signals are disentangled from one another. We evaluate several design considerations for our framework, demonstrating how Neural USD enables iterative and incremental workflows. More information at: https://escontrela.me/neural_usd .
comment: 22 pages, 16 figures, 1 table
☆ Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8 SP
Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.
comment: 6 pages, 8 figures. Presented at 2025 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), August 31 - September 3, 2025, Istanbul, Turkey
☆ Resi-VidTok: An Efficient and Decomposed Progressive Tokenization Framework for Ultra-Low-Rate and Lightweight Video Transmission
Real-time transmission of video over wireless networks remains highly challenging, even with advanced deep models, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose Resi-VidTok, a Resilient Tokenization-Enabled framework designed for ultra-low-rate and lightweight video transmission that delivers strong robustness while preserving perceptual and semantic fidelity on commodity digital hardware. By reorganizing spatio--temporal content into a discrete, importance-ordered token stream composed of key tokens and refinement tokens, Resi-VidTok enables progressive encoding, prefix-decodable reconstruction, and graceful quality degradation under constrained channels. A key contribution is a resilient 1D tokenization pipeline for video that integrates differential temporal token coding, explicitly supporting reliable recovery from incomplete token sets using a single shared framewise decoder--without auxiliary temporal extractors or heavy generative models. Furthermore, stride-controlled frame sparsification combined with a lightweight decoder-side interpolator reduces transmission load while maintaining motion continuity. Finally, a channel-adaptive source--channel coding and modulation scheme dynamically allocates rate and protection according to token importance and channel condition, yielding stable quality across adverse SNRs. Evaluation results indicate robust visual and semantic consistency at channel bandwidth ratios (CBR) as low as 0.0004 and real-time reconstruction at over 30 fps, demonstrating the practicality of Resi-VidTok for energy-efficient, latency-sensitive, and reliability-critical wireless applications.
☆ FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning
Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.
☆ SCOUT: A Lightweight Framework for Scenario Coverage Assessment in Autonomous Driving
Assessing scenario coverage is crucial for evaluating the robustness of autonomous agents, yet existing methods rely on expensive human annotations or computationally intensive Large Vision-Language Models (LVLMs). These approaches are impractical for large-scale deployment due to cost and efficiency constraints. To address these shortcomings, we propose SCOUT (Scenario Coverage Oversight and Understanding Tool), a lightweight surrogate model designed to predict scenario coverage labels directly from an agent's latent sensor representations. SCOUT is trained through a distillation process, learning to approximate LVLM-generated coverage labels while eliminating the need for continuous LVLM inference or human annotation. By leveraging precomputed perception features, SCOUT avoids redundant computations and enables fast, scalable scenario coverage estimation. We evaluate our method across a large dataset of real-life autonomous navigation scenarios, demonstrating that it maintains high accuracy while significantly reducing computational cost. Our results show that SCOUT provides an effective and practical alternative for large-scale coverage analysis. While its performance depends on the quality of LVLM-generated training labels, SCOUT represents a major step toward efficient scenario coverage oversight in autonomous systems.
☆ IBIS: A Powerful Hybrid Architecture for Human Activity Recognition
The increasing interest in Wi-Fi sensing stems from its potential to capture environmental data in a low-cost, non-intrusive way, making it ideal for applications like healthcare, space occupancy analysis, and gesture-based IoT control. However, a major limitation in this field is the common problem of overfitting, where models perform well on training data but fail to generalize to new data. To overcome this, we introduce a novel hybrid architecture that integrates Inception-BiLSTM with a Support Vector Machine (SVM), which we refer to as IBIS. Our IBIS approach is uniquely engineered to improve model generalization and create more robust classification boundaries. By applying this method to Doppler-derived data, we achieve a movement recognition accuracy of nearly 99%. Comprehensive performance metrics and confusion matrices confirm the significant effectiveness of our proposed solution.
comment: 8 pages. 8 figures. Wireless Days Conference, December 2025
☆ Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning
In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities' contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.
☆ Understanding Multi-View Transformers ICCV 2025
Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .
comment: Presented at the ICCV 2025 E2E3D Workshop
☆ VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos
Although recent text-to-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in https://wuqiuche.github.io/VividCamDemoPage/ .
comment: 19 pages, 9 figures
☆ Pixels to Signals: A Real-Time Framework for Traffic Demand Estimation
Traffic congestion is becoming a challenge in the rapidly growing urban cities, resulting in increasing delays and inefficiencies within urban transportation systems. To address this issue a comprehensive methodology is designed to optimize traffic flow and minimize delays. The framework is structured with three primary components: (a) vehicle detection, (b) traffic prediction, and (c) traffic signal optimization. This paper presents the first component, vehicle detection. The methodology involves analyzing multiple sequential frames from a camera feed to compute the background, i.e. the underlying roadway, by averaging pixel values over time. The computed background is then utilized to extract the foreground, where the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is applied to detect vehicles. With its computational efficiency and minimal infrastructure modification requirements, the proposed methodology offers a practical and scalable solution for real-world deployment.
☆ Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS
This paper investigates the feasibility of using lightweight body landmark detection for the recognition of isolated signs in Brazilian Sign Language (LIBRAS). Although the skeleton-based approach by Alves et al. (2024) enabled substantial improvements in recognition performance, the use of OpenPose for landmark extraction hindered time performance. In a preliminary investigation, we observed that simply replacing OpenPose with the lightweight MediaPipe, while improving processing speed, significantly reduced accuracy. To overcome this limitation, we explored landmark subset selection strategies aimed at optimizing recognition performance. Experimental results showed that a proper landmark subset achieves comparable or superior performance to state-of-the-art methods while reducing processing time by more than 5X compared to Alves et al. (2024). As an additional contribution, we demonstrated that spline-based imputation effectively mitigates missing landmark issues, leading to substantial accuracy gains. These findings highlight that careful landmark selection, combined with simple imputation techniques, enables efficient and accurate isolated sign recognition, paving the way for scalable Sign Language Recognition systems.
comment: Submitted to Int. Conf. on Computer Vision Theory and Applications (VISAPP 2026)
☆ FruitProm: Probabilistic Maturity Estimation and Detection of Fruits and Vegetables
Maturity estimation of fruits and vegetables is a critical task for agricultural automation, directly impacting yield prediction and robotic harvesting. Current deep learning approaches predominantly treat maturity as a discrete classification problem (e.g., unripe, ripe, overripe). This rigid formulation, however, fundamentally conflicts with the continuous nature of the biological ripening process, leading to information loss and ambiguous class boundaries. In this paper, we challenge this paradigm by reframing maturity estimation as a continuous, probabilistic learning task. We propose a novel architectural modification to the state-of-the-art, real-time object detector, RT-DETRv2, by introducing a dedicated probabilistic head. This head enables the model to predict a continuous distribution over the maturity spectrum for each detected object, simultaneously learning the mean maturity state and its associated uncertainty. This uncertainty measure is crucial for downstream decision-making in robotics, providing a confidence score for tasks like selective harvesting. Our model not only provides a far richer and more biologically plausible representation of plant maturity but also maintains exceptional detection performance, achieving a mean Average Precision (mAP) of 85.6\% on a challenging, large-scale fruit dataset. We demonstrate through extensive experiments that our probabilistic approach offers more granular and accurate maturity assessments than its classification-based counterparts, paving the way for more intelligent, uncertainty-aware automated systems in modern agriculture
comment: Sidharth Rai, Rahul Harsha Cheppally contributed equally to this work
☆ Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation
We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
comment: https://github.com/alexmartin1722/mirage
☆ The Generation Phases of Flow Matching: a Denoising Perspective
Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.
☆ MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition
Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.
comment: The paper will be published in the MMAsia2025 conference proceedings
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.
comment: 18 pages, 5 figures
☆ SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing
With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.
♻ ☆ ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources
Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Statically provisioned multimodal systems cannot adapt when compute resources change over time, while existing dynamic networks struggle with strict compute budgets. Additionally, both systems often neglect the impact of variations in modality quality. Consequently, modalities suffering substantial corruption may needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges: it adjusts the total number of active layers across all modalities to meet strict compute resource constraints and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.
comment: Accepted to Neurips 2025
♻ ☆ MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness NeurIPS 2025
Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
comment: Accepted to NeurIPS 2025 DB Track
♻ ☆ VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree NeurIPS 2025
Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.
comment: NeurIPS 2025 poster
♻ ☆ DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices
The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) including our purposed YOLOv8n-CBAM model using our annotated dataset designed for recycling. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy(~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 80% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes (< 7MB ), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of "Greener AI" models to support real-time, sustainable waste sorting on edge devices.
comment: 8 pages, 8 figures
♻ ☆ RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This procedure can be efficiently done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric CIDEr compared to several state-of-the-art zero-shot video captioning methods.
comment: Published in Pattern Recognition
♻ ☆ Frequency-Aware Vision Transformers for High-Fidelity Super-Resolution of Earth System Models
Super-resolution (SR) is crucial for enhancing the spatial fidelity of Earth System Model (ESM) outputs, allowing fine-scale structures vital to climate science to be recovered from coarse simulations. However, traditional deep super-resolution methods, including convolutional and transformer-based models, tend to exhibit spectral bias, reconstructing low-frequency content more readily than valuable high-frequency details. In this work, we introduce two frequency-aware frameworks: the Vision Transformer-Tuned Sinusoidal Implicit Representation (ViSIR), combining Vision Transformers and sinusoidal activations to mitigate spectral bias, and the Vision Transformer Fourier Representation Network (ViFOR), which integrates explicit Fourier-based filtering for independent low- and high-frequency learning. Evaluated on the E3SM-HR Earth system dataset across surface temperature, shortwave, and longwave fluxes, these models outperform leading CNN, GAN, and vanilla transformer baselines, with ViFOR demonstrating up to 2.6~dB improvements in PSNR and significantly higher SSIM. Detailed ablation and scaling studies highlight the benefit of full-field training, the impact of frequency hyperparameters, and the potential for generalization. The results establish ViFOR as a state-of-the-art, scalable solution for climate data downscaling. Future extensions will address temporal super-resolution, multimodal climate variables, automated parameter selection, and integration of physical conservation constraints to broaden scientific applicability.
♻ ☆ Polygonal network disorder and the turning distance
The turning distance is a well-studied metric for measuring the similarity between two polygons. This metric is constructed by taking an $L^p$ distance between step functions which track each shape's tangent angle of a path tracing its boundary. In this study, we introduce \textit{turning disorders} for polygonal planar networks, defined by averaging turning distances between network faces and "ordered" shapes (regular polygons or circles). We derive closed-form expressions of turning distances for special classes of regular polygons, related to the divisibility of $m$ and $n$, and also between regular polygons and circles. These formulas are used to show that the time for computing the 2-turning distances reduces to $O((m+n) \log(m+n))$ when both shapes are regular polygons, an improvement from $O(mn\log(mn))$ operations needed to compute distances between general polygons of $n$ and $m$ sides. We also apply these formulas to several examples of network microstructure with varying disorder. For Archimedean lattices, a class of regular tilings, we can express turning disorders with exact expressions. We also consider turning disorders applied to two examples of stochastic processes on networks: spring networks evolving under T1 moves and polygonal rupture processes. We find that the two aspects of defining different turning disorders, the choice of ordered shape and whether to apply area-weighting, can capture different notions of network disorder.
♻ ☆ Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully open-source and reproducible.. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in a single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.
comment: https://github.com/VITA-MLLM/Long-VITA
♻ ☆ Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning ICCV2025
So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Soft Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets. Code is available at https://github.com/aqeeelmirza/CoMet
comment: Accepted to IEEE/CVF International Conference on Computer Vision (ICCV2025)
♻ ☆ Superpowering Open-Vocabulary Object Detectors for X-ray Vision ICCV 2025
Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: https://github.com/PAGF188/RAXO.
comment: Accepted at ICCV 2025
♻ ☆ DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights
Accurate brain tumor segmentation is significant for clinical diagnosis and treatment but remains challenging due to tumor heterogeneity. Mamba-based State Space Models have demonstrated promising performance. However, despite their computational efficiency over other neural architectures, they incur considerable overhead for this task due to their sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address this, we first propose a dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that improves robustness. We further propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present analysis of common failure scenarios. On the 20% test set used by recent methods, our model achieves Dice improvements of 0.10% for whole tumor, 1.75% for tumor core, and 0.93% for enhancing tumor. Evaluations on the proposed systematic folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 1.16% for tumor core and 1.68% for enhancing tumor over existing state-of-the-art. Furthermore, our model achieves a 15x efficiency improvement while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing methods.
♻ ☆ PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer NeurIPS 2025
Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, \ie, automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https://github.com/showlab/PANDA.
comment: Accepted by NeurIPS 2025
♻ ☆ Mano Technical Report
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
♻ ☆ OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions
In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker's multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners' facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.
comment: 25 pages, 9 figures
♻ ☆ Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge, improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our project page is https://github.com/Liang-ZX/DiscreteDiffusionVLA
comment: 16 pages
♻ ☆ LongCat-Video Technical Report
Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.
♻ ☆ Multispectral State-Space Feature Fusion: Bridging Shared and Cross-Parametric Interactions for Object Detection
Modern multispectral feature fusion for object detection faces two critical limitations: (1) Excessive preference for local complementary features over cross-modal shared semantics adversely affects generalization performance; and (2) The trade-off between the receptive field size and computational complexity present critical bottlenecks for scalable feature modeling. Addressing these issues, a novel Multispectral State-Space Feature Fusion framework, dubbed MS2Fusion, is proposed based on the state space model (SSM), achieving efficient and effective fusion through a dual-path parametric interaction mechanism. More specifically, the first cross-parameter interaction branch inherits the advantage of cross-attention in mining complementary information with cross-modal hidden state decoding in SSM. The second shared-parameter branch explores cross-modal alignment with joint embedding to obtain cross-modal similar semantic features and structures through parameter sharing in SSM. Finally, these two paths are jointly optimized with SSM for fusing multispectral features in a unified framework, allowing our MS2Fusion to enjoy both functional complementarity and shared semantic space. In our extensive experiments on mainstream benchmarks including FLIR, M3FD and LLVIP, our MS2Fusion significantly outperforms other state-of-the-art multispectral object detection methods, evidencing its superiority. Moreover, MS2Fusion is general and applicable to other multispectral perception tasks. We show that, even without specific design, MS2Fusion achieves state-of-the-art results on RGB-T semantic segmentation and RGBT salient object detection, showing its generality. The source code will be available at https://github.com/61s61min/MS2Fusion.git.
comment: submitted on 30/4/2025, Accepted by Information Fusion
♻ ☆ DArFace: Deformation Aware Robustness for Low Quality Face Recognition
Facial recognition systems have achieved remarkable success by leveraging deep neural networks, advanced loss functions, and large-scale datasets. However, their performance often deteriorates in real-world scenarios involving low-quality facial images. Such degradations, common in surveillance footage or standoff imaging include low resolution, motion blur, and various distortions, resulting in a substantial domain gap from the high-quality data typically used during training. While existing approaches attempt to address robustness by modifying network architectures or modeling global spatial transformations, they frequently overlook local, non-rigid deformations that are inherently present in real-world settings. In this work, we introduce \textbf{DArFace}, a \textbf{D}eformation-\textbf{A}ware \textbf{r}obust \textbf{Face} recognition framework that enhances robustness to such degradations without requiring paired high- and low-quality training samples. Our method adversarially integrates both global transformations (e.g., rotation, translation) and local elastic deformations during training to simulate realistic low-quality conditions. Moreover, we introduce a contrastive objective to enforce identity consistency across different deformed views. Extensive evaluations on low-quality benchmarks including TinyFace, IJB-B, and IJB-C demonstrate that DArFace surpasses state-of-the-art methods, with significant gains attributed to the inclusion of local deformation modeling.
♻ ☆ MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos
Detection of anomaly events is relevant for public safety and requires a combination of fine-grained motion information and contextual events at variable time-scales. To this end, we propose a Multi-Timescale Feature Learning (MTFL) method to enhance the representation of anomaly features. Short, medium, and long temporal tubelets are employed to extract spatio-temporal video features using a Video Swin Transformer. Experimental results demonstrate that MTFL outperforms state-of-the-art methods on the UCF-Crime dataset, achieving an anomaly detection performance 89.78% AUC. Moreover, it performs complementary to SotA with 95.32% AUC on the ShanghaiTech and 84.57% AP on the XD-Violence dataset. Furthermore, we generate an extended dataset of the UCF-Crime for development and evaluation on a wider range of anomalies, namely Video Anomaly Detection Dataset (VADD), involving 2,591 videos in 18 classes with extensive coverage of realistic anomalies.
♻ ☆ Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.
comment: Under review. Both the dataset and code will be published
♻ ☆ Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: https://github.com/GigaAI-research/General-World-Models-Survey.
comment: This survey will be regularly updated at: https://github.com/GigaAI-research/General-World-Models-Survey
♻ ☆ Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs NeurIPS 2025
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
comment: Accepted by NeurIPS 2025 Dataset and Benchmark Track, Project page: https://liuxuannan.github.io/Video-SafetyBench.github.io/
♻ ☆ Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning NeurIPS 2025
Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic understanding, and achieves consistent gains in retrieval performance. This even applies to the recent CLIPS, which achieves SOTA retrieval performance. Nevertheless, the short fine-tuning with CLIC leads to an improvement in retrieval and to the best compositional CLIP model on SugarCrepe++. All our models and code are available at https://clic-compositional-clip.github.io
comment: Accepted at NeurIPS 2025
♻ ☆ ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression NeurIPS 2025
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
comment: Accepted at NeurIPS 2025 (oral)
♻ ☆ VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
comment: www.walidbousselham.com/VOLD/
♻ ☆ FRBNet: Revisiting Low-Light Vision through Frequency-Domain Radial Basis Network
Low-light vision remains a fundamental challenge in computer vision due to severe illumination degradation, which significantly affects the performance of downstream tasks such as detection and segmentation. While recent state-of-the-art methods have improved performance through invariant feature learning modules, they still fall short due to incomplete modeling of low-light conditions. Therefore, we revisit low-light image formation and extend the classical Lambertian model to better characterize low-light conditions. By shifting our analysis to the frequency domain, we theoretically prove that the frequency-domain channel ratio can be leveraged to extract illumination-invariant features via a structured filtering process. We then propose a novel and end-to-end trainable module named \textbf{F}requency-domain \textbf{R}adial \textbf{B}asis \textbf{Net}work (\textbf{FRBNet}), which integrates the frequency-domain channel ratio operation with a learnable frequency domain filter for the overall illumination-invariant feature enhancement. As a plug-and-play module, FRBNet can be integrated into existing networks for low-light downstream tasks without modifying loss functions. Extensive experiments across various downstream tasks demonstrate that FRBNet achieves superior performance, including +2.2 mAP for dark object detection and +2.9 mIoU for nighttime segmentation. Code is available at: https://github.com/Sing-Forevet/FRBNet.
♻ ☆ Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation NeurIPS 2025
Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincar\'e ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fr\'echet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.
comment: Accepted to NeurIPS 2025
♻ ☆ Federated Learning with Partially Labeled Data: A Conditional Distillation Approach
In medical imaging, developing generalized segmentation models that can handle multiple organs and lesions is crucial. However, the scarcity of fully annotated datasets and strict privacy regulations present significant barriers to data sharing. Federated Learning (FL) allows decentralized model training, but existing FL methods often struggle with partial labeling, leading to model divergence and catastrophic forgetting. We propose ConDistFL, a novel FL framework incorporating conditional distillation to address these challenges. ConDistFL enables effective learning from partially labeled datasets, significantly improving segmentation accuracy across distributed and non-uniform datasets. In addition to its superior segmentation performance, ConDistFL maintains computational and communication efficiency, ensuring its scalability for real-world applications. Furthermore, ConDistFL demonstrates remarkable generalizability, significantly outperforming existing FL methods in out-of-federation tests, even adapting to unseen contrast phases (e.g., non-contrast CT images) in our experiments. Extensive evaluations on 3D CT and 2D chest X-ray datasets show that ConDistFL is an efficient, adaptable solution for collaborative medical image segmentation in privacy-constrained settings.
comment: This manuscript was submitted to IEEE JBHI and is currently under peer review
♻ ☆ Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions
Deepfake detection remains a pressing challenge, particularly in real-world settings where smartphone-captured media from digital screens often introduces Moir\'e artifacts that can distort detection outcomes. This study systematically evaluates state-of-the-art (SOTA) deepfake detectors on Moir\'e-affected videos, an issue that has received little attention. We collected a dataset of 12,832 videos, spanning 35.64 hours, from the Celeb-DF, DFD, DFDC, UADFV, and FF++ datasets, capturing footage under diverse real-world conditions, including varying screens, smartphones, lighting setups, and camera angles. To further examine the influence of Moir\'e patterns on deepfake detection, we conducted additional experiments using our DeepMoir\'eFake, referred to as (DMF) dataset and two synthetic Moir\'e generation techniques. Across 15 top-performing detectors, our results show that Moir\'e artifacts degrade performance by as much as 25.4%, while synthetically generated Moir\'e patterns lead to a 21.4% drop in accuracy. Surprisingly, demoir\'eing methods, intended as a mitigation approach, instead worsened the problem, reducing accuracy by up to 17.2%. These findings underscore the urgent need for detection models that can robustly handle Moir\'e distortions alongside other realworld challenges, such as compression, sharpening, and blurring. By introducing the DMF dataset, we aim to drive future research toward closing the gap between controlled experiments and practical deepfake detection.
comment: 48 Pages, 29 Figures, 15 Tables
♻ ☆ On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.
♻ ☆ CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic
Recent advances in computational pathology have led to the emergence of numerous foundation models. These models typically rely on general-purpose encoders with multi-instance learning for whole slide image (WSI) classification or apply multimodal approaches to generate reports directly from images. However, these models cannot emulate the diagnostic approach of pathologists, who systematically examine slides at low magnification to obtain an overview before progressively zooming in on suspicious regions to formulate comprehensive diagnoses. Instead, existing models directly output final diagnoses without revealing the underlying reasoning process. To address this gap, we introduce CPathAgent, an innovative agent-based approach that mimics pathologists' diagnostic workflow by autonomously navigating across WSI based on observed visual features, thereby generating substantially more transparent and interpretable diagnostic summaries. To achieve this, we develop a multi-stage training strategy that unifies patch-level, region-level, and WSI-level capabilities within a single model, which is essential for replicating how pathologists understand and reason across diverse image scales. Additionally, we construct PathMMU-HR2, the first expert-validated benchmark for large region analysis. This represents a critical intermediate scale between patches and whole slides, reflecting a key clinical reality where pathologists typically examine several key large regions rather than entire slides at once. Extensive experiments demonstrate that CPathAgent consistently outperforms existing approaches across benchmarks at three different image scales, validating the effectiveness of our agent-based diagnostic approach and highlighting a promising direction for computational pathology.
comment: 52 pages, 34 figures
♻ ☆ Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem ICCV 2025
As generative models are advancing in quality and quantity for creating synthetic content, deepfakes begin to cause online mistrust. Deepfake detectors are proposed to counter this effect, however, misuse of detectors claiming fake content as real or vice versa further fuels this misinformation problem. We present the first comprehensive uncertainty analysis of deepfake detectors, systematically investigating how generative artifacts influence prediction confidence. As reflected in detectors' responses, deepfake generators also contribute to this uncertainty as their generative residues vary, so we cross the uncertainty analysis of deepfake detectors and generators. Based on our observations, the uncertainty manifold holds enough consistent information to leverage uncertainty for deepfake source detection. Our approach leverages Bayesian Neural Networks and Monte Carlo dropout to quantify both aleatoric and epistemic uncertainties across diverse detector architectures. We evaluate uncertainty on two datasets with nine generators, with four blind and two biological detectors, compare different uncertainty methods, explore region- and pixel-based uncertainty, and conduct ablation studies. We conduct and analyze binary real/fake, multi-class real/fake, source detection, and leave-one-out experiments between the generator/detector combinations to share their generalization capability, model calibration, uncertainty, and robustness against adversarial attacks. We further introduce uncertainty maps that localize prediction confidence at the pixel level, revealing distinct patterns correlated with generator-specific artifacts. Our analysis provides critical insights for deploying reliable deepfake detection systems and establishes uncertainty quantification as a fundamental requirement for trustworthy synthetic media detection.
comment: Accepted for publication at the ICCV 2025 workshop - STREAM
♻ ☆ AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.
♻ ☆ Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views NeurIPS 2025
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
comment: 10 pages, 6 figures, 2 tables. Accepted to the NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE). Dataset: https://huggingface.co/datasets/annadeichler/KTH-ARIA-referential
♻ ☆ A Generalized Label Shift Perspective for Cross-Domain Gaze Estimation NeurIPS 2025
Aiming to generalize the well-trained gaze estimation model to new target domains, Cross-domain Gaze Estimation (CDGE) is developed for real-world application scenarios. Existing CDGE methods typically extract the domain-invariant features to mitigate domain shift in feature space, which is proved insufficient by Generalized Label Shift (GLS) theory. In this paper, we introduce a novel GLS perspective to CDGE and modelize the cross-domain problem by label and conditional shift problem. A GLS correction framework is presented and a feasible realization is proposed, in which a importance reweighting strategy based on truncated Gaussian distribution is introduced to overcome the continuity challenges in label shift correction. To embed the reweighted source distribution to conditional invariant learning, we further derive a probability-aware estimation of conditional operator discrepancy. Extensive experiments on standard CDGE tasks with different backbone models validate the superior generalization capability across domain and applicability on various models of proposed method.
comment: NeurIPS 2025
♻ ☆ Acoustic Neural 3D Reconstruction Under Pose Drift IROS
We consider the problem of optimizing neural implicit surfaces for 3D reconstruction using acoustic images collected with drifting sensor poses. The accuracy of current state-of-the-art 3D acoustic modeling algorithms is highly dependent on accurate pose estimation; small errors in sensor pose can lead to severe reconstruction artifacts. In this paper, we propose an algorithm that jointly optimizes the neural scene representation and sonar poses. Our algorithm does so by parameterizing the 6DoF poses as learnable parameters and backpropagating gradients through the neural renderer and implicit representation. We validated our algorithm on both real and simulated datasets. It produces high-fidelity 3D reconstructions even under significant pose drift.
comment: 8 pages, 8 figures. This paper is accepted by 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
♻ ☆ Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness
Backdoor attacks pose a severe threat to deep neural networks (DNNs) by implanting hidden backdoors that can be activated with predefined triggers to manipulate model behaviors maliciously. Existing 3D point cloud backdoor attacks primarily rely on sample-wise global modifications, which suffer from low imperceptibility. Although optimization can improve stealthiness, optimizing sample-wise triggers significantly increases computational cost. To address these limitations, we propose the Stealthy Patch-Wise Backdoor Attack (SPBA), the first patch-wise backdoor attack framework for 3D point clouds. Specifically, SPBA decomposes point clouds into local patches and employs a curvature-based imperceptibility score to guide trigger injection into visually less sensitive patches. By optimizing a unified patch-wise trigger that perturbs spectral features of selected patches, SPBA significantly enhances optimization efficiency while maintaining high stealthiness. Extensive experiments on ModelNet40 and ShapeNetPart further demonstrate that SPBA surpasses prior state-of-the-art backdoor attacks in both attack effectiveness and resistance to defense methods. The code is available at https://github.com/HazardFY/SPBA.
comment: 13 pages, 6 figures, 11 tables
♻ ☆ Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images ICCV 2025
Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is built on a pre-trained vision-language model with a lightweight trainable MLP. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.
comment: Accepted to ICCV 2025. Code is available at: https://github.com/QuIIL/ICCV2025_Ano-NAViLa
♻ ☆ Does CLIP perceive art the same way we do?
CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it 'see' the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.
♻ ☆ Caption-Driven Explainability: Probing CNNs for Bias via CLIP ICIP 2025
Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at .
comment: Accepted and presented at the IEEE ICIP 2025 Satellite Workshop "Generative AI for World Simulations and Communications & Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos", Anchorage, Alaska, USA, September 14, 2025. Camera-ready preprint; the official IEEE Xplore publication will follow. Code is available at
♻ ☆ CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.
comment: IEEE TMM 2025
♻ ☆ LiDAR Remote Sensing Meets Weak Supervision: Concepts, Methods, and Perspectives
Light detection and ranging (LiDAR) remote sensing encompasses two major directions: data interpretation and parameter inversion. However, both directions rely heavily on costly and labor-intensive labeled data and field measurements, which constrains their scalability and spatiotemporal adaptability. Weakly Supervised Learning (WSL) provides a unified framework to address these limitations. This paper departs from the traditional view that treats interpretation and inversion as separate tasks and offers a systematic review of recent advances in LiDAR remote sensing from a unified WSL perspective. We cover typical WSL settings including incomplete supervision(e.g., sparse point labels), inexact supervision (e.g., scene-level tags), inaccurate supervision (e.g., noisy labels), and cross-domain supervision (e.g., domain adaptation/generalization) and corresponding techniques such as pseudo-labeling, consistency regularization, self-training, and label refinement, which collectively enable robust learning from limited and weak annotations.We further analyze LiDAR-specific challenges (e.g., irregular geometry, data sparsity, domain heterogeneity) that require tailored weak supervision, and examine how sparse LiDAR observations can guide joint learning with other remote-sensing data for continuous surface-parameter retrieval. Finally, we highlight future directions where WSL acts as a bridge between LiDAR and foundation models to leverage large-scale multimodal datasets and reduce labeling costs, while also enabling broader WSL-driven advances in generalization, open-world adaptation, and scalable LiDAR remote sensing.
♻ ☆ Real-Time Neural Video Compression with Unified Intra and Inter Coding
Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7\% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
comment: 10 pages
♻ ☆ UMCFuse: A Unified Multiple Complex Scenes Infrared and Visible Image Fusion Framework
Infrared and visible image fusion has emerged as a prominent research area in computer vision. However, little attention has been paid to the fusion task in complex scenes, leading to sub-optimal results under interference. To fill this gap, we propose a unified framework for infrared and visible images fusion in complex scenes, termed UMCFuse. Specifically, we classify the pixels of visible images from the degree of scattering of light transmission, allowing us to separate fine details from overall intensity. Maintaining a balance between interference removal and detail preservation is essential for the generalization capacity of the proposed method. Therefore, we propose an adaptive denoising strategy for the fusion of detail layers. Meanwhile, we fuse the energy features from different modalities by analyzing them from multiple directions. Extensive fusion experiments on real and synthetic complex scenes datasets cover adverse weather conditions, noise, blur, overexposure, fire, as well as downstream tasks including semantic segmentation, object detection, salient object detection, and depth estimation, consistently indicate the superiority of the proposed method compared with the recent representative methods. Our code is available at https://github.com/ixilai/UMCFuse.
comment: Published in IEEE-TIP 2025
♻ ☆ GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning ACM MM 2025
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://www.med-vqa.com/GEMeX/.
comment: Accepted at ACM MM 2025 (also known as GEMeX-ThinkVG)
♻ ☆ Global urban visual perception varies across demographics and personalities
Understanding people's preferences is crucial for urban planning, yet current approaches often combine responses from multi-cultural populations, obscuring demographic differences and risking amplifying biases. We conducted a largescale urban visual perception survey of streetscapes worldwide using street view imagery, examining how demographics -- including gender, age, income, education, race and ethnicity, and personality traits -- shape perceptions among 1,000 participants with balanced demographics from five countries and 45 nationalities. This dataset, Street Perception Evaluation Considering Socioeconomics (SPECS), reveals demographic- and personality-based differences across six traditional indicators -- safe, lively, wealthy, beautiful, boring, depressing -- and four new ones -- live nearby, walk, cycle, green. Location-based sentiments further shape these preferences. Machine learning models trained on existing global datasets tend to overestimate positive indicators and underestimate negative ones compared to human responses, underscoring the need for local context. Our study aspires to rectify the myopic treatment of street perception, which rarely considers demographics or personality traits.
♻ ☆ Task-Agnostic Fusion of Time Series and Imagery for Earth Observation
We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6% in R^2 and 2% in RMSE on average, and exceeds baseline methods by 50\% in R$^2$ and 12\% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.
♻ ☆ GRASP: Geospatial pixel Reasoning viA Structured Policy learning
Geospatial pixel reasoning aims to generate segmentation masks in remote sensing imagery directly from natural-language instructions. Most existing approaches follow a paradigm that fine-tunes multimodal large language models under supervision with dense pixel-level masks as ground truth. While effective within the training data distribution, this design suffers from two main drawbacks: (1) the high cost of large-scale dense mask annotation, and (2) the limited generalization capability of supervised fine-tuning in out-of-domain scenarios. To address these issues, we propose GRASP, a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner. To enhance generalization, we introduce PRIME, a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and grounding behaviors with task objectives. To reduce annotation costs, we design BoP-Rewards, which substitutes dense mask labels with bounding box and positive points. It further verifies outputs through two complementary signals: format, which constrains the reasoning and grounding structure to remain syntactically parsable, and accuracy, which evaluates the quality of predicted boxes and points. For evaluation, we train our method and all baselines on EarthReason and GeoPixInstruct, constructing an in-domain benchmark by merging their test sets. We further release GRASP-1k, a fully out-of-domain benchmark with reasoning-intensive queries, reasoning traces, and fine-grained masks. Experimental results demonstrate state-of-the-art (SOTA) in-domain performance and up to 54\% improvement in out-of-domain scenarios, confirming that reinforcement learning with cost-aware rewards provides a robust and scalable paradigm for geospatial pixel reasoning. All code and datasets will be released publicly.
comment: 15 pages, 9 figures
♻ ☆ Switchable Token-Specific Codebook Quantization For Face Image Compression NeurIPS 2025
With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to quantize and dequantize each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images, which are rich in attributes, such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image compression, which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93.51% for reconstructed images at 0.05 bpp.
comment: NeurIPS 2025 accepted
♻ ☆ Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We benchmark 6 state-of-the-art methods on Long-RVOS. The results show that current approaches struggle severely with the long-video challenges. To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies. Despite simplicity, ReferMo achieves significant improvements over current methods in long-term scenarios. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.
comment: Project Page: \url{https://isee-laboratory.github.io/Long-RVOS}
♻ ☆ GS4: Generalizable Sparse Splatting Semantic SLAM
Traditional SLAM algorithms excel at camera tracking, but typically produce incomplete and low-resolution maps that are not tightly integrated with semantics prediction. Recent work integrates Gaussian Splatting (GS) into SLAM to enable dense, photorealistic 3D mapping, yet existing GS-based SLAM methods require per-scene optimization that is slow and consumes an excessive number of Gaussians. We present GS4, the first generalizable GS-based semantic SLAM system. Compared with prior approaches, GS4 runs 10x faster, uses 10x fewer Gaussians, and achieves state-of-the-art performance across color, depth, semantic mapping and camera tracking. From an RGB-D video stream, GS4 incrementally builds and updates a set of 3D Gaussians using a feed-forward network. First, the Gaussian Prediction Model estimates a sparse set of Gaussian parameters from input frame, which integrates both color and semantic prediction with the same backbone. Then, the Gaussian Refinement Network merges new Gaussians with the existing set while avoiding redundancy. Finally, we propose to optimize GS for only 1-5 iterations that corrects drift and floaters when significant pose changes are detected. Experiments on the real-world ScanNet and ScanNet++ benchmarks demonstrate state-of-the-art semantic SLAM performance, with strong generalization capability shown through zero-shot transfer to the NYUv2 and TUM RGB-D datasets.
comment: 17 pages, 6 figures
♻ ☆ Bridging the gap to real-world language-grounded visual concept learning
Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which ensures that each axis can be independently manipulated without affecting others. We demonstrate the effectiveness of our framework on subsets of ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across diverse real-world concepts that are too varied to be manually predefined. Our method also exhibits strong compositional generalization, outperforming existing visual concept learning and text-based editing methods. The code is available at https://github.com/whieya/Language-grounded-VCL.
♻ ☆ CalFuse: Multi-Modal Continual Learning via Feature Calibration and Parameter Fusion
With the proliferation of multi-modal data in large-scale visual recognition systems, enabling models to continuously acquire knowledge from evolving data streams while preserving prior information has become increasingly critical. Class-Continual Learning (CCL) addresses this challenge by incrementally incorporating new class knowledge without revisiting historical data, making it essential for real-world big data applications. While traditional CCL methods rely solely on visual features, recent advances in Vision-Language Models (VLMs) such as CLIP demonstrate significant potential for CCL by leveraging pre-trained multi-modal knowledge. However, existing approaches face challenges in mitigating catastrophic forgetting while maintaining the cross-modal generalization capabilities of VLMs. To address these limitations, we propose CalFuse, a framework that synergizes feature Calibration with parameter Fusion to enable effective multi-modal knowledge integration in continual learning scenarios. CalFuse introduces a dynamic feature calibration mechanism that adaptively balances original CLIP visual representations with task-specific features, preserving the model's intrinsic cross-modal generalization while adapting to new classes. Concurrently, a QR decomposition-based parameter fusion strategy progressively integrates newly acquired knowledge with historical task parameters, maintaining equilibrium between learning new class representations and retaining prior knowledge across sequential tasks. Extensive experiments on benchmark datasets validate the effectiveness of our approach in large-scale multi-modal continual learning settings, demonstrating superior performance over state-of-the-art methods in both average accuracy and final task retention.
♻ ☆ MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence's semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.
♻ ☆ MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs
Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP3) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space~(RKHS). We then apply the determinantal point process~(DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process~(MDP) for allocating selection sizes across segments. Theoretically, MDP3 provides a \((1 - 1/e)\)-approximate solution to the NP-hard list-wise frame selection problem with pseudo-polynomial time complexity, demonstrating its efficiency. Empirically, MDP3 significantly outperforms existing methods, verifying its effectiveness and robustness.
comment: 26 pages, 14 figures
♻ ☆ IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
comment: https://github.com/lifuguan/IGGT_official
♻ ☆ VSA: Faster Video Diffusion with Trainable Sparse Attention
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.
comment: Accepted by Neurips 2025
♻ ☆ Riemannian-Geometric Fingerprints of Generative Models ICCV 2025
Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training ("regurgitative training"), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.
comment: ICCV 2025 Highlight paper
♻ ☆ InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention NeurIPS 2025
Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules. The code and pretrained models are publicly available at https://github.com/FireRedTeam/InstanceAssemble.
comment: Accepted in NeurIPS 2025
♻ ☆ MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition NeurIPS 2025
Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two stages. The first stage is to partition multi-channel sensor streams into short segments and quantize them into discrete ``motion primitive'' codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. More importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities, regardless of dataset origin.
comment: Accepted by NeurIPS 2025
♻ ☆ Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
comment: 14 pages, 8 figures, project page: https://hcplab-sysu.github.io/TAVP
♻ ☆ Unveiling Concept Attribution in Diffusion Models
Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains largely black-box; little do we know about the roles of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers contribute to the target concept. In this work, we approach diffusion models' interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. To answer this question, we decompose diffusion models using component attribution, systematically unveiling the importance of each component (specifically the model parameter) in generating a concept. The proposed framework, called \textbf{C}omponent \textbf{A}ttribution for \textbf{D}iffusion Model (CAD), discovers the localization of concept-inducing (positive) components, while interestingly uncovers another type of components that contribute negatively to generating a concept, which is missing in the previous knowledge localization work. Based on this holistic understanding of diffusion models, we introduce two fast, inference-time model editing algorithms, CAD-Erase and CAD-Amplify; in particular, CAD-Erase enables erasure and CAD-Amplify allows amplification of a generated concept by ablating the positive and negative components, respectively, while retaining knowledge of other concepts. Extensive experimental results validate the significance of both positive and negative components pinpointed by our framework, demonstrating the potential of providing a complete view of interpreting generative models. Our code is available \href{https://github.com/mail-research/CAD-attribution4diffusion}{here}.
♻ ☆ TraceTrans: Translation and Spatial Tracing for Surgical Prediction
Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.
♻ ☆ From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes NeurIPS 2025
3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best-performing models, Google Gemini-2.5-Pro and OpenAI o3, achieve just around 30% accuracy on space-level tasks and around 40% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scenes beyond object-level semantics.
comment: Update v3 of the NeurIPS 2025 Datasets and Benchmarks paper (v2), including additional evaluations of state-of-the-art multimodal large language models. Project page: https://anywhere-3d.github.io/
♻ ☆ Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion
Unsupervised monocular depth estimation has received widespread attention because of its capability to train without ground truth. In real-world scenarios, the images may be blurry or noisy due to the influence of weather conditions and inherent limitations of the camera. Therefore, it is particularly important to develop a robust depth estimation model. Benefiting from the training strategies of generative networks, generative-based methods often exhibit enhanced robustness. In light of this, we employ a well-converging diffusion model among generative networks for unsupervised monocular depth estimation. Additionally, we propose a hierarchical feature-guided denoising module. This model significantly enriches the model's capacity for learning and interpreting depth distribution by fully leveraging image features to guide the denoising process. Furthermore, we explore the implicit depth within reprojection and design an implicit depth consistency loss. This loss function serves to enhance the performance of the model and ensure the scale consistency of depth within a video sequence. We conduct experiments on the KITTI, Make3D, and our self-collected SIMIT datasets. The results indicate that our approach stands out among generative-based models, while also showcasing remarkable robustness.
♻ ☆ Navigation with VLM framework: Towards Going to Any Language
Navigating towards fully open language goals and exploring open scenes in an intelligent way have always raised significant challenges. Recently, Vision Language Models (VLMs) have demonstrated remarkable capabilities to reason with both language and visual data. Although many works have focused on leveraging VLMs for navigation in open scenes, they often require high computational cost, rely on object-centric approaches, or depend on environmental priors in detailed human instructions. We introduce Navigation with VLM (NavVLM), a training-free framework that harnesses open-source VLMs to enable robots to navigate effectively, even for human-friendly language goal such as abstract places, actions, or specific objects in open scenes. NavVLM leverages the VLM as its cognitive core to perceive environmental information and constantly provides exploration guidance achieving intelligent navigation with only a neat target rather than a detailed instruction with environment prior. We evaluated and validated NavVLM in both simulation and real-world experiments. In simulation, our framework achieves state-of-the-art performance in Success weighted by Path Length (SPL) on object-specifc tasks in richly detailed environments from Matterport 3D (MP3D), Habitat Matterport 3D (HM3D) and Gibson. With navigation episode reported, NavVLM demonstrates the capabilities to navigate towards any open-set languages. In real-world validation, we validated our framework's effectiveness in real-world robot at indoor scene.
comment: under review
♻ ☆ GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving NeurIPS2025
Multi-sensor fusion is crucial for improving the performance and robustness of end-to-end autonomous driving systems. Existing methods predominantly adopt either attention-based flatten fusion or bird's eye view fusion through geometric transformations. However, these approaches often suffer from limited interpretability or dense computational overhead. In this paper, we introduce GaussianFusion, a Gaussian-based multi-sensor fusion framework for end-to-end autonomous driving. Our method employs intuitive and compact Gaussian representations as intermediate carriers to aggregate information from diverse sensors. Specifically, we initialize a set of 2D Gaussians uniformly across the driving scene, where each Gaussian is parameterized by physical attributes and equipped with explicit and implicit features. These Gaussians are progressively refined by integrating multi-modal features. The explicit features capture rich semantic and spatial information about the traffic scene, while the implicit features provide complementary cues beneficial for trajectory planning. To fully exploit rich spatial and semantic information in Gaussians, we design a cascade planning head that iteratively refines trajectory predictions through interactions with Gaussians. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate the effectiveness and robustness of the proposed GaussianFusion framework. The source code will be released at https://github.com/Say2L/GaussianFusion.
comment: Accepted at NeurIPS2025 (Spotlight)
♻ ☆ PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.
♻ ☆ Radar and Event Camera Fusion for Agile Robot Ego-Motion Estimation
Achieving reliable ego motion estimation for agile robots, e.g., aerobatic aircraft, remains challenging because most robot sensors fail to respond timely and clearly to highly dynamic robot motions, often resulting in measurement blurring, distortion, and delays. In this paper, we propose an IMU-free and feature-association-free framework to achieve aggressive ego-motion velocity estimation of a robot platform in highly dynamic scenarios by combining two types of exteroceptive sensors, an event camera and a millimeter wave radar, First, we used instantaneous raw events and Doppler measurements to derive rotational and translational velocities directly. Without a sophisticated association process between measurement frames, the proposed method is more robust in texture-less and structureless environments and is more computationally efficient for edge computing devices. Then, in the back-end, we propose a continuous-time state-space model to fuse the hybrid time-based and event-based measurements to estimate the ego-motion velocity in a fixed-lagged smoother fashion. In the end, we validate our velometer framework extensively in self-collected experiment datasets. The results indicate that our IMU-free and association-free ego motion estimation framework can achieve reliable and efficient velocity output in challenging environments. The source code, illustrative video and dataset are available at https://github.com/ZzhYgwh/TwistEstimator.
comment: 2025.10.28 version v2 for TwistEstimator
♻ ☆ Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model IROS 2025
Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360{\deg} field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.
comment: Accepted at IROS 2025. Project page: https://vita-epfl.github.io/DFI-OmniStereo-website/
♻ ☆ CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.
♻ ☆ GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning$\unicode{x2014}$a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16% human-validated accuracy$\unicode{x2014}$compared to 57.6% on a dataset generated by recent work. Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5% on BDD and 37.9% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found $\href{this https URL}{here}$.
comment: 22 pages, 3 figures, 3 tables, project page: https://ke7.github.io/graid/
♻ ☆ MambaCAFU: Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation
In recent years, deep learning has shown near-expert performance in segmenting complex medical tissues and tumors. However, existing models are often task-specific, with performance varying across modalities and anatomical regions. Balancing model complexity and performance remains challenging, particularly in clinical settings where both accuracy and efficiency are critical. To address these issues, we propose a hybrid segmentation architecture featuring a three-branch encoder that integrates CNNs, Transformers, and a Mamba-based Attention Fusion (MAF) mechanism to capture local, global, and long-range dependencies. A multi-scale attention-based CNN decoder reconstructs fine-grained segmentation maps while preserving contextual consistency. Additionally, a co-attention gate enhances feature selection by emphasizing relevant spatial and semantic information across scales during both encoding and decoding, improving feature interaction and cross-scale communication. Extensive experiments on multiple benchmark datasets show that our approach outperforms state-of-the-art methods in accuracy and generalization, while maintaining comparable computational complexity. By effectively balancing efficiency and effectiveness, our architecture offers a practical and scalable solution for diverse medical imaging tasks. Source code and trained models will be publicly released upon acceptance to support reproducibility and further research.
♻ ☆ Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models
Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.
♻ ☆ SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion
Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.
♻ ☆ Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.
comment: author comment: This version was previously removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. The authors have now obtained the necessary permissions, and the paper is resubmitted accordingly
♻ ☆ Hyperparameters in Continual Learning: A Reality Check
Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose the Generalizable Two-phase Evaluation Protocol (GTEP) consisting of hyperparameter tuning and evaluation phases. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol. Our implementation can be found in https://github.com/csm9493/GTEP.
comment: TMLR 2025 camera ready version
♻ ☆ Cyst-X: A Federated AI System Outperforms Clinical Guidelines to Detect Pancreatic Cancer Precursors and Reduce Unnecessary Surgery
Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we developed Cyst-X, an AI framework for IPMN risk prediction trained on a unique, multi-center dataset of 1,461 MRI scans from 764 patients. Cyst-X achieves significantly higher accuracy (AUC = 0.82) than both the established Kyoto guidelines (AUC = 0.75) and expert radiologists, particularly in correct identification of high-risk lesions. Clinically, this translates to a 20% increase in cancer detection sensitivity (87.8% vs. 64.1%) for high-risk lesions. We demonstrate that this performance is maintained in a federated learning setting, allowing for collaborative model training without compromising patient privacy. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset and models, providing the first large-scale, multi-center MRI resource for pancreatic cyst analysis.
♻ ☆ PSScreen V2: Partially Supervised Multiple Retinal Disease Screening
In this work, we propose PSScreen V2, a partially supervised self-training framework for multiple retinal disease screening. Unlike previous methods that rely on fully labelled or single-domain datasets, PSScreen V2 is designed to learn from multiple partially labelled datasets with different distributions, addressing both label absence and domain shift challenges. To this end, PSScreen V2 adopts a three-branch architecture with one teacher and two student networks. The teacher branch generates pseudo labels from weakly augmented images to address missing labels, while the two student branches introduce novel feature augmentation strategies: Low-Frequency Dropout (LF-Dropout), which enhances domain robustness by randomly discarding domain-related low-frequency components, and Low-Frequency Uncertainty (LF-Uncert), which estimates uncertain domain variability via adversarially learned Gaussian perturbations of low-frequency statistics. Extensive experiments on multiple in-domain and out-of-domain fundus datasets demonstrate that PSScreen V2 achieves state-of-the-art performance and superior domain generalization ability. Furthermore, compatibility tests with diverse backbones, including the vision foundation model DINOv2, as well as evaluations on chest X-ray datasets, highlight the universality and adaptability of the proposed framework. The codes are available at https://github.com/boyiZheng99/PSScreen_V2.
♻ ☆ Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness
Convolutional neural networks (CNNs) trained on object recognition achieve high task performance but continue to exhibit vulnerability under a range of visual perturbations and out-of-domain images, when compared with biological vision. Prior work has demonstrated that coupling a standard CNN with a front-end (VOneBlock) that mimics the primate primary visual cortex (V1) can improve overall model robustness. Expanding on this, we introduce Early Vision Networks (EVNets), a new class of hybrid CNNs that combine the VOneBlock with a novel SubcorticalBlock, whose architecture draws from computational models in neuroscience and is parameterized to maximize alignment with subcortical responses reported across multiple experimental studies. Without being optimized to do so, the assembly of the SubcorticalBlock with the VOneBlock improved V1 alignment across most standard V1 benchmarks, and better modeled extra-classical receptive field phenomena. In addition, EVNets exhibit stronger emergent shape bias and outperform the base CNN architecture by 9.3% on an aggregate benchmark of robustness evaluations, including adversarial perturbations, common corruptions, and domain shifts. Finally, we show that EVNets can be further improved when paired with a state-of-the-art data augmentation technique, surpassing the performance of the isolated data augmentation approach by 6.2% on our robustness benchmark. This result reveals complementary benefits between changes in architecture to better mimic biology and training-based machine learning approaches.
♻ ☆ VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.
comment: 29 pages, 40 figures, 3 algorithms
♻ ☆ Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
comment: Version v1.1
Machine Learning 150
☆ Generative View Stitching
Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv\"ard's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.
comment: Project website: https://andrewsonga.github.io/gvs
☆ A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization NeurIPS 2025
We study bilevel optimization problems where the lower-level problems are strongly convex and have coupled linear constraints. To overcome the potential non-smoothness of the hyper-objective and the computational challenges associated with the Hessian matrix, we utilize penalty and augmented Lagrangian methods to reformulate the original problem as a single-level one. Especially, we establish a strong theoretical connection between the reformulated function and the original hyper-objective by characterizing the closeness of their values and derivatives. Based on this reformulation, we propose a single-loop, first-order algorithm for linearly constrained bilevel optimization (SFLCB). We provide rigorous analyses of its non-asymptotic convergence rates, showing an improvement over prior double-loop algorithms -- form $O(\epsilon^{-3}\log(\epsilon^{-1}))$ to $O(\epsilon^{-3})$. The experiments corroborate our theoretical findings and demonstrate the practical efficiency of the proposed SFLCB algorithm. Simulation code is provided at https://github.com/ShenGroup/SFLCB.
comment: NeurIPS 2025
☆ Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? NeurIPS 2025
Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of "which parts belong together" emerges naturally in a connectionist system.
comment: Accepted as a Spotlight at NeurIPS 2025
☆ Tongyi DeepResearch Technical Report
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
comment: https://tongyi-agent.github.io/blog
☆ Greedy Sampling Is Provably Efficient for RLHF NeurIPS 2025
Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.
comment: NeurIPS 2025
☆ AgentFold: Long-Horizon Web Agents with Proactive Context Management
LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.
comment: 26 pages, 9 figures
☆ Learning to Drive Safely with Hybrid Options
Out of the many deep reinforcement learning approaches for autonomous driving, only few make use of the options (or skills) framework. That is surprising, as this framework is naturally suited for hierarchical control applications in general, and autonomous driving tasks in specific. Therefore, in this work the options framework is applied and tailored to autonomous driving tasks on highways. More specifically, we define dedicated options for longitudinal and lateral manoeuvres with embedded safety and comfort constraints. This way, prior domain knowledge can be incorporated into the learning process and the learned driving behaviour can be constrained more easily. We propose several setups for hierarchical control with options and derive practical algorithms following state-of-the-art reinforcement learning techniques. By separately selecting actions for longitudinal and lateral control, the introduced policies over combined and hybrid options obtain the same expressiveness and flexibility that human drivers have, while being easier to interpret than classical policies over continuous actions. Of all the investigated approaches, these flexible policies over hybrid options perform the best under varying traffic conditions, outperforming the baseline policies over actions.
☆ Eigenfunction Extraction for Ordered Representation Learning
Recent advances in representation learning reveal that widely used objectives, such as contrastive and non-contrastive, implicitly perform spectral decomposition of a contextual kernel, induced by the relationship between inputs and their contexts. Yet, these methods recover only the linear span of top eigenfunctions of the kernel, whereas exact spectral decomposition is essential for understanding feature ordering and importance. In this work, we propose a general framework to extract ordered and identifiable eigenfunctions, based on modular building blocks designed to satisfy key desiderata, including compatibility with the contextual kernel and scalability to modern settings. We then show how two main methodological paradigms, low-rank approximation and Rayleigh quotient optimization, align with this framework for eigenfunction extraction. Finally, we validate our approach on synthetic kernels and demonstrate on real-world image datasets that the recovered eigenvalues act as effective importance scores for feature selection, enabling principled efficiency-accuracy tradeoffs via adaptive-dimensional representations.
☆ Pearl: A Foundation Model for Placing Every Atom in the Right Location
Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers $3.6\times$ improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.
☆ The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets NeurIPS 2025
We study the parameter complexity of robust memorization for $\mathrm{ReLU}$ networks: the number of parameters required to interpolate any given dataset with $\epsilon$-separation between differently labeled points, while ensuring predictions remain consistent within a $\mu$-ball around each training sample. We establish upper and lower bounds on the parameter count as a function of the robustness ratio $\rho = \mu / \epsilon$. Unlike prior work, we provide a fine-grained analysis across the entire range $\rho \in (0,1)$ and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when $\rho$ is small, but grows with increasing $\rho$.
comment: Accepted to NeurIPS 2025, 72 pages, 8 figures
☆ Causal Ordering for Structure Learning From Time Series
Predicting causal structure from time series data is crucial for understanding complex phenomena in physiology, brain connectivity, climate dynamics, and socio-economic behaviour. Causal discovery in time series is hindered by the combinatorial complexity of identifying true causal relationships, especially as the number of variables and time points grow. A common approach to simplify the task is the so-called ordering-based methods. Traditional ordering methods inherently limit the representational capacity of the resulting model. In this work, we fix this issue by leveraging multiple valid causal orderings, instead of a single one as standard practice. We propose DOTS (Diffusion Ordered Temporal Structure), using diffusion-based causal discovery for temporal data. By integrating multiple orderings, DOTS effectively recovers the transitive closure of the underlying directed acyclic graph, mitigating spurious artifacts inherent in single-ordering approaches. We formalise the problem under standard assumptions such as stationarity and the additive noise model, and leverage score matching with diffusion processes to enable efficient Hessian estimation. Extensive experiments validate the approach. Empirical evaluations on synthetic and real-world datasets demonstrate that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery. On synthetic benchmarks ($d{=}\!3-\!6$ variables, $T{=}200\!-\!5{,}000$ samples), DOTS improves mean window-graph $F1$ from $0.63$ (best baseline) to $0.81$. On the CausalTime real-world benchmark ($d{=}20\!-\!36$), while baselines remain the best on individual datasets, DOTS attains the highest average summary-graph $F1$ while halving runtime relative to graph-optimisation methods. These results establish DOTS as a scalable and accurate solution for temporal causal discovery.
comment: 32 pages
☆ Symbolic Snapshot Ensembles
Inductive logic programming (ILP) is a form of logical machine learning. Most ILP algorithms learn a single hypothesis from a single training run. Ensemble methods train an ILP algorithm multiple times to learn multiple hypotheses. In this paper, we train an ILP algorithm only once and save intermediate hypotheses. We then combine the hypotheses using a minimum description length weighting scheme. Our experiments on multiple benchmarks, including game playing and visual reasoning, show that our approach improves predictive accuracy by 4% with less than 1% computational overhead.
☆ Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers NeurIPS 2025
We study the robust geometric median problem in Euclidean space $\mathbb{R}^d$, with a focus on coreset construction.A coreset is a compact summary of a dataset $P$ of size $n$ that approximates the robust cost for all centers $c$ within a multiplicative error $\varepsilon$. Given an outlier count $m$, we construct a coreset of size $\tilde{O}(\varepsilon^{-2} \cdot \min\{\varepsilon^{-2}, d\})$ when $n \geq 4m$, eliminating the $O(m)$ dependency present in prior work [Huang et al., 2022 & 2023]. For the special case of $d = 1$, we achieve an optimal coreset size of $\tilde{\Theta}(\varepsilon^{-1/2} + \frac{m}{n} \varepsilon^{-1})$, revealing a clear separation from the vanilla case studied in [Huang et al., 2023; Afshani and Chris, 2024]. Our results further extend to robust $(k,z)$-clustering in various metric spaces, eliminating the $m$-dependence under mild data assumptions. The key technical contribution is a novel non-component-wise error analysis, enabling substantial reduction of outlier influence, unlike prior methods that retain them.Empirically, our algorithms consistently outperform existing baselines in terms of size-accuracy tradeoffs and runtime, even when data assumptions are violated across a wide range of datasets.
comment: This paper has been accepted by NeurIPS 2025
☆ Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.
comment: 12 Pages
☆ Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation
For three decades statistical physics has been providing a framework to analyse neural networks. A long-standing question remained on its capacity to tackle deep learning models capturing rich feature learning effects, thus going beyond the narrow networks or kernel methods analysed until now. We positively answer through the study of the supervised learning of a multi-layer perceptron. Importantly, (i) its width scales as the input dimension, making it more prone to feature learning than ultra wide networks, and more expressive than narrow ones or with fixed embedding layers; and (ii) we focus on the challenging interpolation regime where the number of trainable parameters and data are comparable, which forces the model to adapt to the task. We consider the matched teacher-student setting. It provides the fundamental limits of learning random deep neural network targets and helps in identifying the sufficient statistics describing what is learnt by an optimally trained network as the data budget increases. A rich phenomenology emerges with various learning transitions. With enough data optimal performance is attained through model's "specialisation" towards the target, but it can be hard to reach for training algorithms which get attracted by sub-optimal solutions predicted by the theory. Specialisation occurs inhomogeneously across layers, propagating from shallow towards deep ones, but also across neurons in each layer. Furthermore, deeper targets are harder to learn. Despite its simplicity, the Bayesian-optimal setting provides insights on how the depth, non-linearity and finite (proportional) width influence neural networks in the feature learning regime that are potentially relevant way beyond it.
comment: 30 pages, 19 figures + appendix. This submission supersedes both arXiv:2505.24849 and arXiv:2501.18530
☆ Semi-supervised and unsupervised learning for health indicator extraction from guided waves in aerospace composite structures
Health indicators (HIs) are central to diagnosing and prognosing the condition of aerospace composite structures, enabling efficient maintenance and operational safety. However, extracting reliable HIs remains challenging due to variability in material properties, stochastic damage evolution, and diverse damage modes. Manufacturing defects (e.g., disbonds) and in-service incidents (e.g., bird strikes) further complicate this process. This study presents a comprehensive data-driven framework that learns HIs via two learning approaches integrated with multi-domain signal processing. Because ground-truth HIs are unavailable, a semi-supervised and an unsupervised approach are proposed: (i) a diversity deep semi-supervised anomaly detection (Diversity-DeepSAD) approach augmented with continuous auxiliary labels used as hypothetical damage proxies, which overcomes the limitation of prior binary labels that only distinguish healthy and failed states while neglecting intermediate degradation, and (ii) a degradation-trend-constrained variational autoencoder (DTC-VAE), in which the monotonicity criterion is embedded via an explicit trend constraint. Guided waves with multiple excitation frequencies are used to monitor single-stiffener composite structures under fatigue loading. Time, frequency, and time-frequency representations are explored, and per-frequency HIs are fused via unsupervised ensemble learning to mitigate frequency dependence and reduce variance. Using fast Fourier transform features, the augmented Diversity-DeepSAD model achieved 81.6% performance, while DTC-VAE delivered the most consistent HIs with 92.3% performance, outperforming existing baselines.
☆ Comparison of generalised additive models and neural networks in applications: A systematic review
Neural networks have become a popular tool in predictive modelling, more commonly associated with machine learning and artificial intelligence than with statistics. Generalised Additive Models (GAMs) are flexible non-linear statistical models that retain interpretability. Both are state-of-the-art in their own right, with their respective advantages and disadvantages. This paper analyses how these two model classes have performed on real-world tabular data. Following PRISMA guidelines, we conducted a systematic review of papers that performed empirical comparisons of GAMs and neural networks. Eligible papers were identified, yielding 143 papers, with 430 datasets. Key attributes at both paper and dataset levels were extracted and reported. Beyond summarising comparisons, we analyse reported performance metrics using mixed-effects modelling to investigate potential characteristics that can explain and quantify observed differences, including application area, study year, sample size, number of predictors, and neural network complexity. Across datasets, no consistent evidence of superiority was found for either GAMs or neural networks when considering the most frequently reported metrics (RMSE, $R^2$, and AUC). Neural networks tended to outperform in larger datasets and in those with more predictors, but this advantage narrowed over time. Conversely, GAMs remained competitive, particularly in smaller data settings, while retaining interpretability. Reporting of dataset characteristics and neural network complexity was incomplete in much of the literature, limiting transparency and reproducibility. This review highlights that GAMs and neural networks should be viewed as complementary approaches rather than competitors. For many tabular applications, the performance trade-off is modest, and interpretability may favour GAMs.
☆ A Novel XAI-Enhanced Quantum Adversarial Networks for Velocity Dispersion Modeling in MaNGA Galaxies
Current quantum machine learning approaches often face challenges balancing predictive accuracy, robustness, and interpretability. To address this, we propose a novel quantum adversarial framework that integrates a hybrid quantum neural network (QNN) with classical deep learning layers, guided by an evaluator model with LIME-based interpretability, and extended through quantum GAN and self-supervised variants. In the proposed model, an adversarial evaluator concurrently guides the QNN by computing feedback loss, thereby optimizing both prediction accuracy and model explainability. Empirical evaluations show that the Vanilla model achieves RMSE = 0.27, MSE = 0.071, MAE = 0.21, and R^2 = 0.59, delivering the most consistent performance across regression metrics compared to adversarial counterparts. These results demonstrate the potential of combining quantum-inspired methods with classical architectures to develop lightweight, high-performance, and interpretable predictive models, advancing the applicability of QML beyond current limitations.
☆ Physics-Informed Extreme Learning Machine (PIELM): Opportunities and Challenges
We are very delighted to see the fast development of physics-informed extreme learning machine (PIELM) in recent years for higher computation efficiency and accuracy in physics-informed machine learning. As a summary or review on PIELM is currently not available, we would like to take this opportunity to show our perspective and experience for this promising research direction. We can see many efforts are made to solve PDEs with sharp gradients, nonlinearities, high-frequency behavior, hard constraints, uncertainty, multiphysics coupling. Despite the success, many urgent challenges remain to be tackled, which also provides us opportunities to develop more robust, interpretable, and generalizable PIELM frameworks with applications in science and engineering.
☆ DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment
Training time-series forecast models requires aligning the conditional distribution of model forecasts with that of the label sequence. The standard direct forecast (DF) approach resorts to minimize the conditional negative log-likelihood of the label sequence, typically estimated using the mean squared error. However, this estimation proves to be biased in the presence of label autocorrelation. In this paper, we propose DistDF, which achieves alignment by alternatively minimizing a discrepancy between the conditional forecast and label distributions. Because conditional discrepancies are difficult to estimate from finite time-series observations, we introduce a newly proposed joint-distribution Wasserstein discrepancy for time-series forecasting, which provably upper bounds the conditional discrepancy of interest. This discrepancy admits tractable, differentiable estimation from empirical samples and integrates seamlessly with gradient-based training. Extensive experiments show that DistDF improves the performance diverse forecast models and achieves the state-of-the-art forecasting performance. Code is available at https://anonymous.4open.science/r/DistDF-F66B.
☆ LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis
With the widespread adoption of LLMs, LoRA has become a dominant method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition, which remains unsatisfactory due to the weak empirical performance of the one-step fine-tuning model that serves as their basis, as well as the fact that these methods either lack a rigorous theoretical foundation or depend heavily on restrictive isotropic assumptions. In this paper, we establish a theoretical framework for data-aware LoRA initialization based on asymptotic analysis. Starting from a general optimization objective that minimizes the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. By solving this problem, we obtain an optimal initialization strategy for LoRA. Building on this theoretical framework, we develop an efficient algorithm, LoRA-DA, which estimates the terms in the optimization problem from a small set of target domain samples and obtains the optimal LoRA initialization. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.
☆ Enforcing boundary conditions for physics-informed neural operators
Machine-learning based methods like physics-informed neural networks and physics-informed neural operators are becoming increasingly adept at solving even complex systems of partial differential equations. Boundary conditions can be enforced either weakly by penalizing deviations in the loss function or strongly by training a solution structure that inherently matches the prescribed values and derivatives. The former approach is easy to implement but the latter can provide benefits with respect to accuracy and training times. However, previous approaches to strongly enforcing Neumann or Robin boundary conditions require a domain with a fully $C^1$ boundary and, as we demonstrate, can lead to instability if those boundary conditions are posed on a segment of the boundary that is piecewise $C^1$ but only $C^0$ globally. We introduce a generalization of the approach by Sukumar \& Srivastava (doi: 10.1016/j.cma.2021.114333), and a new approach based on orthogonal projections that overcome this limitation. The performance of these new techniques is compared against weakly and semi-weakly enforced boundary conditions for the scalar Darcy flow equation and the stationary Navier-Stokes equations.
☆ Dual-Mind World Models: A General Framework for Learning in Dynamic Wireless Networks
Despite the popularity of reinforcement learning (RL) in wireless networks, existing approaches that rely on model-free RL (MFRL) and model-based RL (MBRL) are data inefficient and short-sighted. Such RL-based solutions cannot generalize to novel network states since they capture only statistical patterns rather than the underlying physics and logic from wireless data. These limitations become particularly challenging in complex wireless networks with high dynamics and long-term planning requirements. To address these limitations, in this paper, a novel dual-mind world model-based learning framework is proposed with the goal of optimizing completeness-weighted age of information (CAoI) in a challenging mmWave V2X scenario. Inspired by cognitive psychology, the proposed dual-mind world model encompasses a pattern-driven System 1 component and a logic-driven System 2 component to learn dynamics and logic of the wireless network, and to provide long-term link scheduling over reliable imagined trajectories. Link scheduling is learned through end-to-end differentiable imagined trajectories with logical consistency over an extended horizon rather than relying on wireless data obtained from environment interactions. Moreover, through imagination rollouts, the proposed world model can jointly reason network states and plan link scheduling. During intervals without observations, the proposed method remains capable of making efficient decisions. Extensive experiments are conducted on a realistic simulator based on Sionna with real-world physical channel, ray-tracing, and scene objects with material properties. Simulation results show that the proposed world model achieves a significant improvement in data efficiency and achieves strong generalization and adaptation to unseen environments, compared to the state-of-the-art RL baselines, and the world model approach with only System 1.
☆ Unsupervised Machine-Learning Pipeline for Data-Driven Defect Detection and Characterisation: Application to Displacement Cascades
Neutron irradiation produces, within a few picoseconds, displacement cascades that are sequences of atomic collisions generating point and extended defects which subsequently affects the long-term evolution of materials. The diversity of these defects, characterized morphologically and statistically, defines what is called the "primary damage". In this work, we present a fully unsupervised machine learning (ML) workflow that detects and classifies these defects directly from molecular dynamics data. Local environments are encoded by the Smooth Overlap of Atomic Positions (SOAP) vector, anomalous atoms are isolated with autoencoder neural networks (AE), embedded with Uniform Manifold Approximation and Projection (UMAP) and clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). Applied to 80 keV displacement cascades in Ni, Fe$_7$0Ni$_{10}$Cr$_{20}$, and Zr, the AE successfully identify the small fraction of outlier atoms that participate in defect formation. HDBSCAN then partitions the UMAP latent space of AE-flagged SOAP descriptors into well defined groups representing vacancy- and interstitial-dominated regions and, within each, separates small from large aggregates, assigning 99.7 % of outliers to compact physical motifs. A signed cluster-identification score confirms this separation, and cluster size scales with net defect counts (R2 > 0.89). Statistical cross analyses between the ML outlier map and several conventional detectors (centrosymmetry, dislocation extraction, etc.) reveal strong overlap and complementary coverage, all achieved without template or threshold tuning. This ML workflow thus provides an efficient tool for the quantitative mapping of structural anomalies in materials, particularly those arising from irradiation damage in displacement cascades.
comment: 22 pages, 1 graphical abstract, 7 figures, 4 tables
☆ Local Performance vs. Out-of-Distribution Generalization: An Empirical Analysis of Personalized Federated Learning in Heterogeneous Data Environments
In the context of Federated Learning with heterogeneous data environments, local models tend to converge to their own local model optima during local training steps, deviating from the overall data distributions. Aggregation of these local updates, e.g., with FedAvg, often does not align with the global model optimum (client drift), resulting in an update that is suboptimal for most clients. Personalized Federated Learning approaches address this challenge by exclusively focusing on the average local performances of clients' models on their own data distribution. Generalization to out-of-distribution samples, which is a substantial benefit of FedAvg and represents a significant component of robustness, appears to be inadequately incorporated into the assessment and evaluation processes. This study involves a thorough evaluation of Federated Learning approaches, encompassing both their local performance and their generalization capabilities. Therefore, we examine different stages within a single communication round to enable a more nuanced understanding of the considered metrics. Furthermore, we propose and incorporate a modified approach of FedAvg, designated as Federated Learning with Individualized Updates (FLIU), extending the algorithm by a straightforward individualization step with an adaptive personalization factor. We evaluate and compare the approaches empirically using MNIST and CIFAR-10 under various distributional conditions, including benchmark IID and pathological non-IID, as well as additional novel test environments with Dirichlet distribution specifically developed to stress the algorithms on complex data heterogeneity.
☆ MIMIC-Sepsis: A Curated Benchmark for Modeling and Learning from Sepsis Trajectories in the ICU
Sepsis is a leading cause of mortality in intensive care units (ICUs), yet existing research often relies on outdated datasets, non-reproducible preprocessing pipelines, and limited coverage of clinical interventions. We introduce MIMIC-Sepsis, a curated cohort and benchmark framework derived from the MIMIC-IV database, designed to support reproducible modeling of sepsis trajectories. Our cohort includes 35,239 ICU patients with time-aligned clinical variables and standardized treatment data, including vasopressors, fluids, mechanical ventilation and antibiotics. We describe a transparent preprocessing pipeline-based on Sepsis-3 criteria, structured imputation strategies, and treatment inclusion-and release it alongside benchmark tasks focused on early mortality prediction, length-of-stay estimation, and shock onset classification. Empirical results demonstrate that incorporating treatment variables substantially improves model performance, particularly for Transformer-based architectures. MIMIC-Sepsis serves as a robust platform for evaluating predictive and sequential models in critical care research.
☆ Sample-efficient and Scalable Exploration in Continuous-Time RL
Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
comment: 26 pages, 6 figures, 6 tables
☆ Methodology for Comparing Machine Learning Algorithms for Survival Analysis
This study presents a comparative methodological analysis of six machine learning models for survival analysis (MLSA). Using data from nearly 45,000 colorectal cancer patients in the Hospital-Based Cancer Registries of S\~ao Paulo, we evaluated Random Survival Forest (RSF), Gradient Boosting for Survival Analysis (GBSA), Survival SVM (SSVM), XGBoost-Cox (XGB-Cox), XGBoost-AFT (XGB-AFT), and LightGBM (LGBM), capable of predicting survival considering censored data. Hyperparameter optimization was performed with different samplers, and model performance was assessed using the Concordance Index (C-Index), C-Index IPCW, time-dependent AUC, and Integrated Brier Score (IBS). Survival curves produced by the models were compared with predictions from classification algorithms, and predictor interpretation was conducted using SHAP and permutation importance. XGB-AFT achieved the best performance (C-Index = 0.7618; IPCW = 0.7532), followed by GBSA and RSF. The results highlight the potential and applicability of MLSA to improve survival prediction and support decision making.
☆ Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations
The theory of training deep networks has become a central question of modern machine learning and has inspired many practical advancements. In particular, the gradient descent (GD) optimization algorithm has been extensively studied in recent years. A key assumption about GD has appeared in several recent works: the \emph{GD map is non-singular} -- it preserves sets of measure zero under preimages. Crucially, this assumption has been used to prove that GD avoids saddle points and maxima, and to establish the existence of a computable quantity that determines the convergence to global minima (both for GD and stochastic GD). However, the current literature either assumes the non-singularity of the GD map or imposes restrictive assumptions, such as Lipschitz smoothness of the loss (for example, Lipschitzness does not hold for deep ReLU networks with the cross-entropy loss) and restricts the analysis to GD with small step-sizes. In this paper, we investigate the neural network map as a function on the space of weights and biases. We also prove, for the first time, the non-singularity of the gradient descent (GD) map on the loss landscape of realistic neural network architectures (with fully connected, convolutional, or softmax attention layers) and piecewise analytic activations (which includes sigmoid, ReLU, leaky ReLU, etc.) for almost all step-sizes. Our work significantly extends the existing results on the convergence of GD and SGD by guaranteeing that they apply to practical neural network settings and has the potential to unlock further exploration of learning dynamics.
☆ ARIMA_PLUS: Large-scale, Accurate, Automatic and Interpretable In-Database Time Series Forecasting and Anomaly Detection in Google BigQuery
Time series forecasting and anomaly detection are common tasks for practitioners in industries such as retail, manufacturing, advertising and energy. Two unique challenges stand out: (1) efficiently and accurately forecasting time series or detecting anomalies in large volumes automatically; and (2) ensuring interpretability of results to effectively incorporate business insights. We present ARIMA_PLUS, a novel framework to overcome these two challenges by a unique combination of (a) accurate and interpretable time series models and (b) scalable and fully managed system infrastructure. The model has a sequential and modular structure to handle different components of the time series, including holiday effects, seasonality, trend, and anomalies, which enables high interpretability of the results. Novel enhancements are made to each module, and a unified framework is established to address both forecasting and anomaly detection tasks simultaneously. In terms of accuracy, its comprehensive benchmark on the 42 public datasets in the Monash forecasting repository shows superior performance over not only well-established statistical alternatives (such as ETS, ARIMA, TBATS, Prophet) but also newer neural network models (such as DeepAR, N-BEATS, PatchTST, TimeMixer). In terms of infrastructure, it is directly built into the query engine of BigQuery in Google Cloud. It uses a simple SQL interface and automates tedious technicalities such as data cleaning and model selection. It automatically scales with managed cloud computational and storage resources, making it possible to forecast 100 million time series using only 1.5 hours with a throughput of more than 18000 time series per second. In terms of interpretability, we present several case studies to demonstrate time series insights it generates and customizability it offers.
☆ Nearest Neighbor Matching as Least Squares Density Ratio Estimation and Riesz Regression
This study proves that Nearest Neighbor (NN) matching can be interpreted as an instance of Riesz regression for automatic debiased machine learning. Lin et al. (2023) shows that NN matching is an instance of density-ratio estimation with their new density-ratio estimator. Chernozhukov et al. (2024) develops Riesz regression for automatic debiased machine learning, which directly estimates the Riesz representer (or equivalently, the bias-correction term) by minimizing the mean squared error. In this study, we first prove that the density-ratio estimation method proposed in Lin et al. (2023) is essentially equivalent to Least-Squares Importance Fitting (LSIF) proposed in Kanamori et al. (2009) for direct density-ratio estimation. Furthermore, we derive Riesz regression using the LSIF framework. Based on these results, we derive NN matching from Riesz regression. This study is based on our work Kato (2025a) and Kato (2025b).
☆ Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings
Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.
☆ Attack on a PUF-based Secure Binary Neural Network
Binarized Neural Networks (BNNs) deployed on memristive crossbar arrays provide energy-efficient solutions for edge computing but are susceptible to physical attacks due to memristor nonvolatility. Recently, Rajendran et al. (IEEE Embedded Systems Letter 2025) proposed a Physical Unclonable Function (PUF)-based scheme to secure BNNs against theft attacks. Specifically, the weight and bias matrices of the BNN layers were secured by swapping columns based on device's PUF key bits. In this paper, we demonstrate that this scheme to secure BNNs is vulnerable to PUF-key recovery attack. As a consequence of our attack, we recover the secret weight and bias matrices of the BNN. Our approach is motivated by differential cryptanalysis and reconstructs the PUF key bit-by-bit by observing the change in model accuracy, and eventually recovering the BNN model parameters. Evaluated on a BNN trained on the MNIST dataset, our attack could recover 85% of the PUF key, and recover the BNN model up to 93% classification accuracy compared to the original model's 96% accuracy. Our attack is very efficient and it takes a couple of minutes to recovery the PUF key and the model parameters.
comment: Accepted at VLSID 2026. To be published in IEEE Xplore
☆ APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries
Make-on-demand combinatorial synthesis libraries (CSLs) like Enamine REAL have significantly enabled drug discovery efforts. However, their large size presents a challenge for virtual screening, where the goal is to identify the top compounds in a library according to a computational objective (e.g., optimizing docking score) subject to computational constraints under a limited computational budget. For current library sizes -- numbering in the tens of billions of compounds -- and scoring functions of interest, a routine virtual screening campaign may be limited to scoring fewer than 0.1% of the available compounds, leaving potentially many high scoring compounds undiscovered. Furthermore, as constraints (and sometimes objectives) change during the course of a virtual screening campaign, existing virtual screening algorithms typically offer little room for amortization. We propose the approximate-but-exhaustive search protocol for CSLs, or APEX. APEX utilizes a neural network surrogate that exploits the structure of CSLs in the prediction of objectives and constraints to make full enumeration on a consumer GPU possible in under a minute, allowing for exact retrieval of approximate top-$k$ sets. To demonstrate APEX's capabilities, we develop a benchmark CSL comprised of more than 10 million compounds, all of which have been annotated with their docking scores on five medically relevant targets along with physicohemical properties measured with RDKit such that, for any objective and set of constraints, the ground truth top-$k$ compounds can be identified and compared against the retrievals from any virtual screening algorithm. We show APEX's consistently strong performance both in retrieval accuracy and runtime compared to alternative methods.
☆ A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport
Synthetic data offers a promising solution to the privacy and accessibility challenges of using smart card data in public transport research. Despite rapid progress in generative modeling, there is limited attention to comprehensive evaluation, leaving unclear how reliable, safe, and useful synthetic data truly are. Existing evaluations remain fragmented, typically limited to population-level representativeness or record-level privacy, without considering group-level variations or task-specific utility. To address this gap, we propose a Representativeness-Privacy-Utility (RPU) framework that systematically evaluates synthetic trip data across three complementary dimensions and three hierarchical levels (record, group, population). The framework integrates a consistent set of metrics to quantify similarity, disclosure risk, and practical usefulness, enabling transparent and balanced assessment of synthetic data quality. We apply the framework to benchmark twelve representative generation methods, spanning conventional statistical models, deep generative networks, and privacy-enhanced variants. Results show that synthetic data do not inherently guarantee privacy and there is no "one-size-fits-all" model, the trade-off between privacy and representativeness/utility is obvious. Conditional Tabular generative adversarial network (CTGAN) provide the most balanced trade-off and is suggested for practical applications. The RPU framework provides a systematic and reproducible basis for researchers and practitioners to compare synthetic data generation techniques and select appropriate methods in public transport applications.
☆ Filtering instances and rejecting predictions to obtain reliable models in healthcare
Machine Learning (ML) models are widely used in high-stakes domains such as healthcare, where the reliability of predictions is critical. However, these models often fail to account for uncertainty, providing predictions even with low confidence. This work proposes a novel two-step data-centric approach to enhance the performance of ML models by improving data quality and filtering low-confidence predictions. The first step involves leveraging Instance Hardness (IH) to filter problematic instances during training, thereby refining the dataset. The second step introduces a confidence-based rejection mechanism during inference, ensuring that only reliable predictions are retained. We evaluate our approach using three real-world healthcare datasets, demonstrating its effectiveness at improving model reliability while balancing predictive performance and rejection rate. Additionally, we use alternative criteria - influence values for filtering and uncertainty for rejection - as baselines to evaluate the efficiency of the proposed method. The results demonstrate that integrating IH filtering with confidence-based rejection effectively enhances model performance while preserving a large proportion of instances. This approach provides a practical method for deploying ML systems in safety-critical applications.
comment: This paper is under review at Machine Learning (Springer)
☆ Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning
We introduce Perception Learning (PeL), a paradigm that optimizes an agent's sensory interface $f_\phi:\mathcal{X}\to\mathcal{Z}$ using task-agnostic signals, decoupled from downstream decision learning $g_\theta:\mathcal{Z}\to\mathcal{Y}$. PeL directly targets label-free perceptual properties, such as stability to nuisances, informativeness without collapse, and controlled geometry, assessed via objective representation-invariant metrics. We formalize the separation of perception and decision, define perceptual properties independent of objectives or reparameterizations, and prove that PeL updates preserving sufficient invariants are orthogonal to Bayes task-risk gradients. Additionally, we provide a suite of task-agnostic evaluation metrics to certify perceptual quality.
☆ What do vision-language models see in the context? Investigating multimodal in-context learning
In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.
Transformers can do Bayesian Clustering
Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.
☆ EDC: Equation Discovery for Classification
Equation Discovery techniques have shown considerable success in regression tasks, where they are used to discover concise and interpretable models (\textit{Symbolic Regression}). In this paper, we propose a new ED-based binary classification framework. Our proposed method EDC finds analytical functions of manageable size that specify the location and shape of the decision boundary. In extensive experiments on artificial and real-life data, we demonstrate how EDC is able to discover both the structure of the target equation as well as the value of its parameters, outperforming the current state-of-the-art ED-based classification methods in binary classification and achieving performance comparable to the state of the art in binary classification. We suggest a grammar of modest complexity that appears to work well on the tested datasets but argue that the exact grammar -- and thus the complexity of the models -- is configurable, and especially domain-specific expressions can be included in the pattern language, where that is required. The presented grammar consists of a series of summands (additive terms) that include linear, quadratic and exponential terms, as well as products of two features (producing hyperbolic curves ideal for capturing XOR-like dependencies). The experiments demonstrate that this grammar allows fairly flexible decision boundaries while not so rich to cause overfitting.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Lecture Notes in Computer Science, and is available online at https://doi.org/10.1007/978-3-032-05461-6_9
☆ Problem-Parameter-Free Decentralized Bilevel Optimization NeurIPS 2025
Decentralized bilevel optimization has garnered significant attention due to its critical role in solving large-scale machine learning problems. However, existing methods often rely on prior knowledge of problem parameters-such as smoothness, convexity, or communication network topologies-to determine appropriate stepsizes. In practice, these problem parameters are typically unavailable, leading to substantial manual effort for hyperparameter tuning. In this paper, we propose AdaSDBO, a fully problem-parameter-free algorithm for decentralized bilevel optimization with a single-loop structure. AdaSDBO leverages adaptive stepsizes based on cumulative gradient norms to update all variables simultaneously, dynamically adjusting its progress and eliminating the need for problem-specific hyperparameter tuning. Through rigorous theoretical analysis, we establish that AdaSDBO achieves a convergence rate of $\widetilde{\mathcal{O}}\left(\frac{1}{T}\right)$, matching the performance of well-tuned state-of-the-art methods up to polylogarithmic factors. Extensive numerical experiments demonstrate that AdaSDBO delivers competitive performance compared to existing decentralized bilevel optimization methods while exhibiting remarkable robustness across diverse stepsize configurations.
comment: Accepted by NeurIPS 2025
☆ Towards actionable hypotension prediction -- predicting catecholamine therapy initiation in the intensive care unit
Hypotension in critically ill ICU patients is common and life-threatening. Escalation to catecholamine therapy marks a key management step, with both undertreatment and overtreatment posing risks. Most machine learning (ML) models predict hypotension using fixed MAP thresholds or MAP forecasting, overlooking the clinical decision behind treatment escalation. Predicting catecholamine initiation, the start of vasoactive or inotropic agent administration offers a more clinically actionable target reflecting real decision-making. Using the MIMIC-III database, we modeled catecholamine initiation as a binary event within a 15-minute prediction window. Input features included statistical descriptors from a two-hour sliding MAP context window, along with demographics, biometrics, comorbidities, and ongoing treatments. An Extreme Gradient Boosting (XGBoost) model was trained and interpreted via SHapley Additive exPlanations (SHAP). The model achieved an AUROC of 0.822 (0.813-0.830), outperforming the hypotension baseline (MAP < 65, AUROC 0.686 [0.675-0.699]). SHAP analysis highlighted recent MAP values, MAP trends, and ongoing treatments (e.g., sedatives, electrolytes) as dominant predictors. Subgroup analysis showed higher performance in males, younger patients (<53 years), those with higher BMI (>32), and patients without comorbidities or concurrent medications. Predicting catecholamine initiation based on MAP dynamics, treatment context, and patient characteristics supports the critical decision of when to escalate therapy, shifting focus from threshold-based alarms to actionable decision support. This approach is feasible across a broad ICU cohort under natural event imbalance. Future work should enrich temporal and physiological context, extend label definitions to include therapy escalation, and benchmark against existing hypotension prediction systems.
comment: 27 pages, 8 figures, source code under https://github.com/AI4HealthUOL/actionable-hypotension
☆ HergNet: a Fast Neural Surrogate Model for Sound Field Predictions via Superposition of Plane Waves
We present a novel neural network architecture for the efficient prediction of sound fields in two and three dimensions. The network is designed to automatically satisfy the Helmholtz equation, ensuring that the outputs are physically valid. Therefore, the method can effectively learn solutions to boundary-value problems in various wave phenomena, such as acoustics, optics, and electromagnetism. Numerical experiments show that the proposed strategy can potentially outperform state-of-the-art methods in room acoustics simulation, in particular in the range of mid to high frequencies.
☆ SALS: Sparse Attention in Latent Space for KV cache Compression
Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive low-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space framework. SALS projects the KV cache into a compact latent space via low-rank projection, and performs sparse token selection using RoPE-free query-key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full KV cache reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6.4-fold KV cache compression and 5.7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively.
☆ UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation NeurIPS 2025
Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes -- such as fidelity and diversity -- to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies -- such as prompt embeddings and initial noise -- at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ From Memorization to Reasoning in the Spectrum of Loss Curvature
We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.
☆ Forecasting precipitation in the Arctic using probabilistic machine learning informed by causal climate drivers
Understanding and forecasting precipitation events in the Arctic maritime environments, such as Bear Island and Ny-{\AA}lesund, is crucial for assessing climate risk and developing early warning systems in vulnerable marine regions. This study proposes a probabilistic machine learning framework for modeling and predicting the dynamics and severity of precipitation. We begin by analyzing the scale-dependent relationships between precipitation and key atmospheric drivers (e.g., temperature, relative humidity, cloud cover, and air pressure) using wavelet coherence, which captures localized dependencies across time and frequency domains. To assess joint causal influences, we employ Synergistic-Unique-Redundant Decomposition, which quantifies the impact of interaction effects among each variable on future precipitation dynamics. These insights inform the development of data-driven forecasting models that incorporate both historical precipitation and causal climate drivers. To account for uncertainty, we employ the conformal prediction method, which enables the generation of calibrated non-parametric prediction intervals. Our results underscore the importance of utilizing a comprehensive framework that combines causal analysis with probabilistic forecasting to enhance the reliability and interpretability of precipitation predictions in Arctic marine environments.
☆ Enabling Near-realtime Remote Sensing via Satellite-Ground Collaboration of Large Vision-Language Models
Large vision-language models (LVLMs) have recently demonstrated great potential in remote sensing (RS) tasks (e.g., disaster monitoring) conducted by low Earth orbit (LEO) satellites. However, their deployment in real-world LEO satellite systems remains largely unexplored, hindered by limited onboard computing resources and brief satellite-ground contacts. We propose Grace, a satellite-ground collaborative system designed for near-realtime LVLM inference in RS tasks. Accordingly, we deploy compact LVLM on satellites for realtime inference, but larger ones on ground stations (GSs) to guarantee end-to-end performance. Grace is comprised of two main phases that are asynchronous satellite-GS Retrieval-Augmented Generation (RAG), and a task dispatch algorithm. Firstly, we still the knowledge archive of GS RAG to satellite archive with tailored adaptive update algorithm during limited satellite-ground data exchange period. Secondly, propose a confidence-based test algorithm that either processes the task onboard the satellite or offloads it to the GS. Extensive experiments based on real-world satellite orbital data show that Grace reduces the average latency by 76-95% compared to state-of-the-art methods, without compromising inference accuracy.
comment: 15 pages, 11 figures
☆ Temporal Knowledge Graph Hyperedge Forecasting: Exploring Entity-to-Category Link Prediction
Temporal Knowledge Graphs have emerged as a powerful way of not only modeling static relationships between entities but also the dynamics of how relations evolve over time. As these informational structures can be used to store information from a real-world setting, such as a news flow, predicting future graph components to a certain extent equates predicting real-world events. Most of the research in this field focuses on embedding-based methods, often leveraging convolutional neural net architectures. These solutions act as black boxes, limiting insight. In this paper, we explore an extension to an established rule-based framework, TLogic, that yields a high accuracy in combination with explainable predictions. This offers transparency and allows the end-user to critically evaluate the rules applied at the end of the prediction stage. The new rule format incorporates entity category as a key component with the purpose of limiting rule application only to relevant entities. When categories are unknown for building the graph, we propose a data-driven method to generate them with an LLM-based approach. Additionally, we investigate the choice of aggregation method for scores of retrieved entities when performing category prediction.
☆ PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. While generative reward models (GRMs) offer greater interpretability than traditional scalar RMs, current training paradigms remain limited. Pair-wise methods rely on binary good-versus-bad labels, which cause mismatches for point-wise inference and necessitate complex pairing strategies for effective application in RLHF. On the other hand, point-wise methods require more elaborate absolute labeling with rubric-driven criteria, resulting in poor adaptability and high annotation costs. In this work, we propose the Preference-Aware Task-Adaptive Reward Model (PaTaRM), a unified framework that integrates a preference-aware reward (PAR) mechanism with dynamic rubric adaptation. PaTaRM leverages relative preference information from pairwise data to construct robust point-wise training signals, eliminating the need for explicit point-wise labels. Simultaneously, it employs a task-adaptive rubric system that flexibly generates evaluation criteria for both global task consistency and instance-specific fine-grained reasoning. This design enables efficient, generalizable, and interpretable reward modeling for RLHF. Extensive experiments show that PaTaRM achieves an average relative improvement of 4.7% on RewardBench and RMBench across Qwen3-8B and Qwen3-14B models. Furthermore, PaTaRM boosts downstream RLHF performance, with an average improvement of 13.6% across IFEval and InFoBench benchmarks, confirming its effectiveness and robustness. Our code is available at https://github.com/JaneEyre0530/PaTaRM.
☆ Sparse Optimistic Information Directed Sampling
Many high-dimensional online decision-making problems can be modeled as stochastic sparse linear bandits. Most existing algorithms are designed to achieve optimal worst-case regret in either the data-rich regime, where polynomial dependence on the ambient dimension is unavoidable, or the data-poor regime, where dimension-independence is possible at the cost of worse dependence on the number of rounds. In contrast, the sparse Information Directed Sampling (IDS) algorithm satisfies a Bayesian regret bound that has the optimal rate in both regimes simultaneously. In this work, we explore the use of Sparse Optimistic Information Directed Sampling (SOIDS) to achieve the same adaptivity in the worst-case setting, without Bayesian assumptions. Through a novel analysis that enables the use of a time-dependent learning rate, we show that SOIDS can optimally balance information and regret. Our results extend the theoretical guarantees of IDS, providing the first algorithm that simultaneously achieves optimal worst-case regret in both the data-rich and data-poor regimes. We empirically demonstrate the good performance of SOIDS.
☆ PRIVET: Privacy Metric Based on Extreme Value Theory
Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content. This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage, an issue closely tied to overfitting. Existing methods almost exclusively rely on global criteria to estimate the risk of privacy failure associated to a model, offering only quantitative non interpretable insights. The absence of rigorous evaluation methods for data privacy at the sample-level may hinder the practical deployment of synthetic data in real-world applications. Using extreme value statistics on nearest-neighbor distances, we propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample. We empirically demonstrate that PRIVET reliably detects instances of memorization and privacy leakage across diverse data modalities, including settings with very high dimensionality, limited sample sizes such as genetic data and even under underfitting regimes. We compare our method to existing approaches under controlled settings and show its advantage in providing both dataset level and sample level assessments through qualitative and quantitative outputs. Additionally, our analysis reveals limitations in existing computer vision embeddings to yield perceptually meaningful distances when identifying near-duplicate samples.
☆ A comparison between joint and dual UKF implementations for state estimation and leak localization in water distribution networks
The sustainability of modern cities highly depends on efficient water distribution management, including effective pressure control and leak detection and localization. Accurate information about the network hydraulic state is therefore essential. This article presents a comparison between two data-driven state estimation methods based on the Unscented Kalman Filter (UKF), fusing pressure, demand and flow data for head and flow estimation. One approach uses a joint state vector with a single estimator, while the other uses a dual-estimator scheme. We analyse their main characteristics, discussing differences, advantages and limitations, and compare them theoretically in terms of accuracy and complexity. Finally, we show several estimation results for the L-TOWN benchmark, allowing to discuss their properties in a real implementation.
comment: This work has been submitted to ECC2026 for review. It has 7 pages and 2 figures
☆ Closing Gaps: An Imputation Analysis of ICU Vital Signs
As more Intensive Care Unit (ICU) data becomes available, the interest in developing clinical prediction models to improve healthcare protocols increases. However, the lack of data quality still hinders clinical prediction using Machine Learning (ML). Many vital sign measurements, such as heart rate, contain sizeable missing segments, leaving gaps in the data that could negatively impact prediction performance. Previous works have introduced numerous time-series imputation techniques. Nevertheless, more comprehensive work is needed to compare a representative set of methods for imputing ICU vital signs and determine the best practice. In reality, ad-hoc imputation techniques that could decrease prediction accuracy, like zero imputation, are still used. In this work, we compare established imputation techniques to guide researchers in improving the performance of clinical prediction models by selecting the most accurate imputation technique. We introduce an extensible and reusable benchmark with currently 15 imputation and 4 amputation methods, created for benchmarking on major ICU datasets. We hope to provide a comparative basis and facilitate further ML development to bring more models into clinical practice.
comment: Preprint
☆ Unlocking Out-of-Distribution Generalization in Dynamics through Physics-Guided Augmentation
In dynamical system modeling, traditional numerical methods are limited by high computational costs, while modern data-driven approaches struggle with data scarcity and distribution shifts. To address these fundamental limitations, we first propose SPARK, a physics-guided quantitative augmentation plugin. Specifically, SPARK utilizes a reconstruction autoencoder to integrate physical parameters into a physics-rich discrete state dictionary. This state dictionary then acts as a structured dictionary of physical states, enabling the creation of new, physically-plausible training samples via principled interpolation in the latent space. Further, for downstream prediction, these augmented representations are seamlessly integrated with a Fourier-enhanced Graph ODE, a combination designed to robustly model the enriched data distribution while capturing long-term temporal dependencies. Extensive experiments on diverse benchmarks demonstrate that SPARK significantly outperforms state-of-the-art baselines, particularly in challenging out-of-distribution scenarios and data-scarce regimes, proving the efficacy of our physics-guided augmentation paradigm.
☆ What Can Be Recovered Under Sparse Adversarial Corruption? Assumption-Free Theory for Linear Measurements
Let \(\bm{A} \in \mathbb{R}^{m \times n}\) be an arbitrary, known matrix and \(\bm{e}\) a \(q\)-sparse adversarial vector. Given \(\bm{y} = \bm{A} x^* + \bm{e}\) and \(q\), we seek the smallest set containing \(x^*\)-hence the one conveying maximal information about \(x^*\)-that is uniformly recoverable from \(\bm{y}\) without knowing \(\bm{e}\). While exact recovery of \(x^*\) via strong (and often impractical) structural assumptions on \(\bm{A}\) or \(x^*\) (for example, restricted isometry, sparsity) is well studied, recoverability for arbitrary \(\bm{A}\) and \(x^*\) remains open. Our main result shows that the best that one can hope to recover is \(x^* + \ker(\bm{U})\), where \(\bm{U}\) is the unique projection matrix onto the intersection of rowspaces of all possible submatrices of \(\bm{A}\) obtained by deleting \(2q\) rows. Moreover, we prove that every \(x\) that minimizes the \(\ell\_0\)-norm of \(\bm{y} - \bm{A} x\) lies in \(x^* + \ker(\bm{U})\), which then gives a constructive approach to recover this set.
☆ Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment
Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.
comment: an early-stage version
☆ SPEAR++: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning NeurIPS 2025
Federated Learning has seen an increased deployment in real-world scenarios recently, as it enables the distributed training of machine learning models without explicit data sharing between individual clients. Yet, the introduction of the so-called gradient inversion attacks has fundamentally challenged its privacy-preserving properties. Unfortunately, as these attacks mostly rely on direct data optimization without any formal guarantees, the vulnerability of real-world systems remains in dispute and requires tedious testing for each new federated deployment. To overcome these issues, recently the SPEAR attack was introduced, which is based on a theoretical analysis of the gradients of linear layers with ReLU activations. While SPEAR is an important theoretical breakthrough, the attack's practicality was severely limited by its exponential runtime in the batch size b. In this work, we fill this gap by applying State-of-the-Art techniques from Sparsely-Used Dictionary Learning to make the problem of gradient inversion on linear layers with ReLU activations tractable. Our experiments demonstrate that our new attack, SPEAR++, retains all desirable properties of SPEAR, such as robustness to DP noise and FedAvg aggregation, while being applicable to 10x bigger batch sizes.
comment: Published at the Workshop on Regulatable ML at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Blindfolded Experts Generalize Better: Insights from Robotic Manipulation and Videogames
Behavioral cloning is a simple yet effective technique for learning sequential decision-making from demonstrations. Recently, it has gained prominence as the core of foundation models for the physical world, where achieving generalization requires countless demonstrations of a multitude of tasks. Typically, a human expert with full information on the task demonstrates a (nearly) optimal behavior. In this paper, we propose to hide some of the task's information from the demonstrator. This ``blindfolded'' expert is compelled to employ non-trivial exploration to solve the task. We show that cloning the blindfolded expert generalizes better to unseen tasks than its fully-informed counterpart. We conduct experiments of real-world robot peg insertion tasks with (limited) human demonstrations, alongside videogames from the Procgen benchmark. Additionally, we support our findings with theoretical analysis, which confirms that the generalization error scales with $\sqrt{I/m}$, where $I$ measures the amount of task information available to the demonstrator, and $m$ is the number of demonstrated tasks. Both theory and practice indicate that cloning blindfolded experts generalizes better with fewer demonstrated tasks. Project page with videos and code: https://sites.google.com/view/blindfoldedexperts/home
☆ Self-Concordant Perturbations for Linear Bandits
We study the adversarial linear bandits problem and present a unified algorithmic framework that bridges Follow-the-Regularized-Leader (FTRL) and Follow-the-Perturbed-Leader (FTPL) methods, extending the known connection between them from the full-information setting. Within this framework, we introduce self-concordant perturbations, a family of probability distributions that mirror the role of self-concordant barriers previously employed in the FTRL-based SCRiBLe algorithm. Using this idea, we design a novel FTPL-based algorithm that combines self-concordant regularization with efficient stochastic exploration. Our approach achieves a regret of $O(d\sqrt{n \ln n})$ on both the $d$-dimensional hypercube and the Euclidean ball. On the Euclidean ball, this matches the rate attained by existing self-concordant FTRL methods. For the hypercube, this represents a $\sqrt{d}$ improvement over these methods and matches the optimal bound up to logarithmic factors.
☆ V-SAT: Video Subtitle Annotation Tool
The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to 3.54 after resolving all language mode issues and F1-scores of ~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing the first comprehensive solution for robust subtitle annotation.
☆ EddyFormer: Accelerated Neural Simulations of Three-Dimensional Turbulence at Scale NeurIPS 2025
Computationally resolving turbulence remains a central challenge in fluid dynamics due to its multi-scale interactions. Fully resolving large-scale turbulence through direct numerical simulation (DNS) is computationally prohibitive, motivating data-driven machine learning alternatives. In this work, we propose EddyFormer, a Transformer-based spectral-element (SEM) architecture for large-scale turbulence simulation that combines the accuracy of spectral methods with the scalability of the attention mechanism. We introduce an SEM tokenization that decomposes the flow into grid-scale and subgrid-scale components, enabling capture of both local and global features. We create a new three-dimensional isotropic turbulence dataset and train EddyFormer to achieves DNS-level accuracy at 256^3 resolution, providing a 30x speedup over DNS. When applied to unseen domains up to 4x larger than in training, EddyFormer preserves accuracy on physics-invariant metrics-energy spectra, correlation functions, and structure functions-showing domain generalization. On The Well benchmark suite of diverse turbulent flows, EddyFormer resolves cases where prior ML models fail to converge, accurately reproducing complex dynamics across a wide range of physical conditions.
comment: NeurIPS 2025
☆ Identifiable learning of dissipative dynamics
Complex dissipative systems appear across science and engineering, from polymers and active matter to learning algorithms. These systems operate far from equilibrium, where energy dissipation and time irreversibility are key to their behavior, but are difficult to quantify from data. Learning accurate and interpretable models of such dynamics remains a major challenge: the models must be expressive enough to describe diverse processes, yet constrained enough to remain physically meaningful and mathematically identifiable. Here, we introduce I-OnsagerNet, a neural framework that learns dissipative stochastic dynamics directly from trajectories while ensuring both interpretability and uniqueness. I-OnsagerNet extends the Onsager principle to guarantee that the learned potential is obtained from the stationary density and that the drift decomposes cleanly into time-reversible and time-irreversible components, as dictated by the Helmholtz decomposition. Our approach enables us to calculate the entropy production and to quantify irreversibility, offering a principled way to detect and quantify deviations from equilibrium. Applications to polymer stretching in elongational flow and to stochastic gradient Langevin dynamics reveal new insights, including super-linear scaling of barrier heights and sub-linear scaling of entropy production rates with the strain rate, and the suppression of irreversibility with increasing batch size. I-OnsagerNet thus establishes a general, data-driven framework for discovering and interpreting non-equilibrium dynamics.
Self-supervised Synthetic Pretraining for Inference of Stellar Mass Embedded in Dense Gas NeurIPS 2025
Stellar mass is a fundamental quantity that determines the properties and evolution of stars. However, estimating stellar masses in star-forming regions is challenging because young stars are obscured by dense gas and the regions are highly inhomogeneous, making spherical dynamical estimates unreliable. Supervised machine learning could link such complex structures to stellar mass, but it requires large, high-quality labeled datasets from high-resolution magneto-hydrodynamical (MHD) simulations, which are computationally expensive. We address this by pretraining a vision transformer on one million synthetic fractal images using the self-supervised framework DINOv2, and then applying the frozen model to limited high-resolution MHD simulations. Our results demonstrate that synthetic pretraining improves frozen-feature regression stellar mass predictions, with the pretrained model performing slightly better than a supervised model trained on the same limited simulations. Principal component analysis of the extracted features further reveals semantically meaningful structures, suggesting that the model enables unsupervised segmentation of star-forming regions without the need for labeled data or fine-tuning.
comment: 6 pages, 3 figures, 1 table, accepted for NeurIPS 2025 ML4PS workshop
☆ Fixed Point Neural Acceleration and Inverse Surrogate Model for Battery Parameter Identification
The rapid expansion of electric vehicles has intensified the need for accurate and efficient diagnosis of lithium-ion batteries. Parameter identification of electrochemical battery models is widely recognized as a powerful method for battery health assessment. However, conventional metaheuristic approaches suffer from high computational cost and slow convergence, and recent machine learning methods are limited by their reliance on constant current data, which may not be available in practice. To overcome these challenges, we propose deep learning-based framework for parameter identification of electrochemical battery models. The proposed framework combines a neural surrogate model of the single particle model with electrolyte (NeuralSPMe) and a deep learning-based fixed-point iteration method. NeuralSPMe is trained on realistic EV load profiles to accurately predict lithium concentration dynamics under dynamic operating conditions while a parameter update network (PUNet) performs fixed-point iterative updates to significantly reduce both the evaluation time per sample and the overall number of iterations required for convergence. Experimental evaluations demonstrate that the proposed framework accelerates the parameter identification by more than 2000 times, achieves superior sample efficiency and more than 10 times higher accuracy compared to conventional metaheuristic algorithms, particularly under dynamic load scenarios encountered in practical applications.
comment: 31 pages, 11 figures, submitted to Applied Energy
☆ Causal Convolutional Neural Networks as Finite Impulse Response Filters
This study investigates the behavior of Causal Convolutional Neural Networks (CNNs) with quasi-linear activation functions when applied to time-series data characterized by multimodal frequency content. We demonstrate that, once trained, such networks exhibit properties analogous to Finite Impulse Response (FIR) filters, particularly when the convolutional kernels are of extended length exceeding those typically employed in standard CNN architectures. Causal CNNs are shown to capture spectral features both implicitly and explicitly, offering enhanced interpretability for tasks involving dynamic systems. Leveraging the associative property of convolution, we further show that the entire network can be reduced to an equivalent single-layer filter resembling an FIR filter optimized via least-squares criteria. This equivalence yields new insights into the spectral learning behavior of CNNs trained on signals with sparse frequency content. The approach is validated on both simulated beam dynamics and real-world bridge vibration datasets, underlining its relevance for modeling and identifying physical systems governed by dynamic responses.
comment: 14 pages, 19 figures, Under review
☆ Graph-Guided Concept Selection for Efficient Retrieval-Augmented Generation
Graph-based RAG constructs a knowledge graph (KG) from text chunks to enhance retrieval in Large Language Model (LLM)-based question answering. It is especially beneficial in domains such as biomedicine, law, and political science, where effective retrieval often involves multi-hop reasoning over proprietary documents. However, these methods demand numerous LLM calls to extract entities and relations from text chunks, incurring prohibitive costs at scale. Through a carefully designed ablation study, we observe that certain words (termed concepts) and their associated documents are more important. Based on this insight, we propose Graph-Guided Concept Selection (G2ConS). Its core comprises a chunk selection method and an LLM-independent concept graph. The former selects salient document chunks to reduce KG construction costs; the latter closes knowledge gaps introduced by chunk selection at zero cost. Evaluations on multiple real-world datasets show that G2ConS outperforms all baselines in construction cost, retrieval effectiveness, and answering quality.
☆ HistoLens: An Interactive XAI Toolkit for Verifying and Mitigating Flaws in Vision-Language Models for Histopathology
For doctors to truly trust artificial intelligence, it can't be a black box. They need to understand its reasoning, almost as if they were consulting a colleague. We created HistoLens1 to be that transparent, collaborative partner. It allows a pathologist to simply ask a question in plain English about a tissue slide--just as they would ask a trainee. Our system intelligently translates this question into a precise query for its AI engine, which then provides a clear, structured report. But it doesn't stop there. If a doctor ever asks, "Why?", HistoLens can instantly provide a 'visual proof' for any finding--a heatmap that points to the exact cells and regions the AI used for its analysis. We've also ensured the AI focuses only on the patient's tissue, just like a trained pathologist would, by teaching it to ignore distracting background noise. The result is a workflow where the pathologist remains the expert in charge, using a trustworthy AI assistant to verify their insights and make faster, more confident diagnoses.
☆ Taming the Tail: NoI Topology Synthesis for Mixed DL Workloads on Chiplet-Based Accelerators
Heterogeneous chiplet-based systems improve scaling by disag-gregating CPUs/GPUs and emerging technologies (HBM/DRAM).However this on-package disaggregation introduces a latency inNetwork-on-Interposer(NoI). We observe that in modern large-modelinference, parameters and activations routinely move backand forth from HBM/DRAM, injecting large, bursty flows into theinterposer. These memory-driven transfers inflate tail latency andviolate Service Level Agreements (SLAs) across k-ary n-cube base-line NoI topologies. To address this gap we introduce an InterferenceScore (IS) that quantifies worst-case slowdown under contention.We then formulate NoI synthesis as a multi-objective optimization(MOO) problem. We develop PARL (Partition-Aware ReinforcementLearner), a topology generator that balances throughput, latency,and power. PARL-generated topologies reduce contention at the memory cut, meet SLAs, and cut worst-case slowdown to 1.2 times while maintaining competitive mean throughput relative to link-rich meshes. Overall, this reframes NoI design for heterogeneouschiplet accelerators with workload-aware objectives.
☆ Enhancing Pre-trained Representation Classifiability can Boost its Interpretability ICLR 2025
The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.
comment: ICLR 2025 (Spotlight)
☆ Learning Parameterized Skills from Demonstrations
We present DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Our method learns parameterized skill policies jointly with a meta-policy that selects the appropriate discrete skill and continuous parameters at each timestep. Using a combination of temporal variational inference and information-theoretic regularization methods, we address the challenge of degeneracy common in latent variable models, ensuring that the learned skills are temporally extended, semantically meaningful, and adaptable. We empirically show that learning parameterized skills from multitask expert demonstrations significantly improves generalization to unseen tasks. Our method outperforms multitask as well as skill learning baselines on both LIBERO and MetaWorld benchmarks. We also demonstrate that DEPS discovers interpretable parameterized skills, such as an object grasping skill whose continuous arguments define the grasp location.
comment: Neurips 2025
☆ Information-Theoretic Discrete Diffusion NeurIPS 2025
We present an information-theoretic framework for discrete diffusion models that yields principled estimators of log-likelihood using score-matching losses. Inspired by the I-MMSE identity for the Gaussian setup, we derive analogous results for the discrete setting. Specifically, we introduce the Information-Minimum Denoising Score Entropy (I-MDSE) relation, which links mutual information between data and its diffused version to the minimum denoising score entropy (DSE) loss. We extend this theory to masked diffusion and establish the Information-Minimum Denoising Cross-Entropy (I-MDCE) relation, connecting cross-entropy losses to mutual information in discrete masked processes. These results provide a time-integral decomposition of the log-likelihood of the data in terms of optimal score-based losses, showing that commonly used losses such as DSE and DCE are not merely variational bounds but tight and principled estimators of log-likelihood. The I-MDCE decomposition further enables practical extensions, including time-free formula, conditional likelihood estimation in prompt-response tasks, and coupled Monte Carlo estimation of likelihood ratios. Experiments on synthetic and real-world data confirm the accuracy, variance stability, and utility of our estimators. The code is publicly available at https://github.com/Dongjae0324/infodis.
comment: Accepted at NeurIPS 2025
♻ ☆ Physics-Informed Latent Neural Operator for Real-time Predictions of time-dependent parametric PDEs
Deep operator network (DeepONet) has shown significant promise as surrogate models for systems governed by partial differential equations (PDEs), enabling accurate mappings between infinite-dimensional function spaces. However, when applied to systems with high-dimensional input-output mappings arising from large numbers of spatial and temporal collocation points, these models often require heavily overparameterized networks, leading to long training times. Latent DeepONet addresses some of these challenges by introducing a two-step approach: first learning a reduced latent space using a separate model, followed by operator learning within this latent space. While efficient, this method is inherently data-driven and lacks mechanisms for incorporating physical laws, limiting its robustness and generalizability in data-scarce settings. In this work, we propose PI-Latent-NO, a physics-informed latent neural operator framework that integrates governing physics directly into the learning process. Our architecture features two coupled DeepONets trained end-to-end: a Latent-DeepONet that learns a low-dimensional representation of the solution, and a Reconstruction-DeepONet that maps this latent representation back to the physical space. By embedding PDE constraints into the training via automatic differentiation, our method eliminates the need for labeled training data and ensures physics-consistent predictions. The proposed framework is both memory and compute-efficient, exhibiting near-constant scaling with problem size and demonstrating significant speedups over traditional physics-informed operator models. We validate our approach on a range of parametric PDEs, showcasing its accuracy, scalability, and suitability for real-time prediction in complex physical systems.
♻ ☆ DeltaPhi: Physical States Residual Learning for Neural Operators in Data-Limited PDE Solving
The limited availability of high-quality training data poses a major obstacle in data-driven PDE solving, where expensive data collection and resolution constraints severely impact the ability of neural operator networks to learn and generalize the underlying physical system. To address this challenge, we propose DeltaPhi, a novel learning framework that transforms the PDE solving task from learning direct input-output mappings to learning the residuals between similar physical states, a fundamentally different approach to neural operator learning. This reformulation provides implicit data augmentation by exploiting the inherent stability of physical systems where closer initial states lead to closer evolution trajectories. DeltaPhi is architecture-agnostic and can be seamlessly integrated with existing neural operators to enhance their performance. Extensive experiments demonstrate consistent and significant improvements across diverse physical systems including regular and irregular domains, different neural architectures, multiple training data amount, and cross-resolution scenarios, confirming its effectiveness as a general enhancement for neural operators in data-limited PDE solving.
comment: Neurips 2025
♻ ☆ Datasheets for Machine Learning Sensors
Machine learning (ML) is becoming prevalent in embedded AI sensing systems. These "ML sensors" enable context-sensitive, real-time data collection and decision-making across diverse applications ranging from anomaly detection in industrial settings to wildlife tracking for conservation efforts. As such, there is a need to provide transparency in the operation of such ML-enabled sensing systems through comprehensive documentation. This is needed to enable their reproducibility, to address new compliance and auditing regimes mandated in regulation and industry-specific policy, and to verify and validate the responsible nature of their operation. To address this gap, we introduce the datasheet for ML sensors framework. We provide a comprehensive template, collaboratively developed in academia-industry partnerships, that captures the distinct attributes of ML sensors, including hardware specifications, ML model and dataset characteristics, end-to-end performance metrics, and environmental impacts. Our framework addresses the continuous streaming nature of sensor data, real-time processing requirements, and embeds benchmarking methodologies that reflect real-world deployment conditions, ensuring practical viability. Aligned with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), our approach enhances the transparency and reusability of ML sensor documentation across academic, industrial, and regulatory domains. To show the application of our approach, we present two datasheets: the first for an open-source ML sensor designed in-house and the second for a commercial ML sensor developed by industry collaborators, both performing computer vision-based person detection.
♻ ☆ ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources
Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Statically provisioned multimodal systems cannot adapt when compute resources change over time, while existing dynamic networks struggle with strict compute budgets. Additionally, both systems often neglect the impact of variations in modality quality. Consequently, modalities suffering substantial corruption may needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges: it adjusts the total number of active layers across all modalities to meet strict compute resource constraints and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.
comment: Accepted to Neurips 2025
♻ ☆ SGFusion: Stochastic Geographic Gradient Fusion in Federated Learning
This paper proposes Stochastic Geographic Gradient Fusion (SGFusion), a novel training algorithm to leverage the geographic information of mobile users in Federated Learning (FL). SGFusion maps the data collected by mobile devices onto geographical zones and trains one FL model per zone, which adapts well to the data and behaviors of users in that zone. SGFusion models the local data-based correlation among geographical zones as a hierarchical random graph (HRG) optimized by Markov Chain Monte Carlo sampling. At each training step, every zone fuses its local gradient with gradients derived from a small set of other zones sampled from the HRG. This approach enables knowledge fusion and sharing among geographical zones in a probabilistic and stochastic gradient fusion process with self-attention weights, such that "more similar" zones have "higher probabilities" of sharing gradients with "larger attention weights." SGFusion remarkably improves model utility without introducing undue computational cost. Extensive theoretical and empirical results using a heart-rate prediction dataset collected across 6 countries show that models trained with SGFusion converge with upper-bounded expected errors and significantly improve utility in all countries compared to existing approaches without notable cost in system scalability.
♻ ☆ Hybrid Deep Learning Model to Estimate Cognitive Effort from fNIRS Signals
This study estimates cognitive effort based on functional near-infrared spectroscopy data and performance scores using a hybrid DeepNet model. The estimation of cognitive effort enables educators to modify material to enhance learning effectiveness and student engagement. In this study, we collected oxygenated hemoglobin using functional near-infrared spectroscopy during an educational quiz game. Participants (n=16) responded to 16 questions in a Unity-based educational game, each within a 30-second response time limit. We used DeepNet models to predict the performance score from the oxygenated hemoglobin, and compared traditional machine learning and DeepNet models to determine which approach provides better accuracy in predicting performance scores. The result shows that the proposed CNN-GRU gives better performance with 73% than other models. After the prediction, we used the predicted score and the oxygenated hemoglobin to observe cognitive effort by calculating relative neural efficiency and involvement in our test cases. Our result shows that even with moderate accuracy, the predicted cognitive effort closely follow the actual trends. This findings can be helpful in designing and improving learning environments and provide valuable insights into learning materials.
♻ ☆ Global Optimization of Gaussian Process Acquisition Functions Using a Piecewise-Linear Kernel Approximation
Bayesian optimization relies on iteratively constructing and optimizing an acquisition function. The latter turns out to be a challenging, non-convex optimization problem itself. Despite the relative importance of this step, most algorithms employ sampling- or gradient-based methods, which do not provably converge to global optima. This work investigates mixed-integer programming (MIP) as a paradigm for global acquisition function optimization. Specifically, our Piecewise-linear Kernel Mixed Integer Quadratic Programming (PK-MIQP) formulation introduces a piecewise-linear approximation for Gaussian process kernels and admits a corresponding MIQP representation for acquisition functions. The proposed method is applicable to uncertainty-based acquisition functions for any stationary or dot-product kernel. We analyze the theoretical regret bounds of the proposed approximation, and empirically demonstrate the framework on synthetic functions, constrained benchmarks, and a hyperparameter tuning task.
comment: 18 pages, 4 figures, 5 tables
♻ ☆ Says Who? Effective Zero-Shot Annotation of Focalization
Focalization describes the way in which access to narrative information is restricted or controlled based on the knowledge available to knowledge of the narrator. It is encoded via a wide range of lexico-grammatical features and is subject to reader interpretation. Even trained annotators frequently disagree on correct labels, suggesting this task is both qualitatively and computationally challenging. In this work, we test how well five contemporary large language model (LLM) families and two baselines perform when annotating short literary excerpts for focalization. Despite the challenging nature of the task, we find that LLMs show comparable performance to trained human annotators, with GPT-4o achieving an average F1 of 84.79%. Further, we demonstrate that the log probabilities output by GPT-family models frequently reflect the difficulty of annotating particular excerpts. Finally, we provide a case study analyzing sixteen Stephen King novels, demonstrating the usefulness of this approach for computational literary studies and the insights gleaned from examining focalization at scale.
comment: Accepted at CHR 2025
♻ ☆ TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models
Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification (MTSC). Effective adaptation of LLMs for MTSC necessitates informative data representations. Existing LLM-based methods directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs. Despite their effectiveness, we reveal that these methods conceal three inherent bottlenecks: (1) they struggle to encode temporal and channel-specific information in a lossless manner, both of which are critical components of multivariate time series; (2) it is much difficult to align the learned representation space with the semantic space of the LLMs; (3) they require task-specific retraining, which is both computationally expensive and labor-intensive. To bridge these gaps, we propose TableTime, which reformulates MTSC as a table understanding task. Specifically, TableTime introduces the following strategies: (1) convert multivariate time series into a tabular form, thus minimizing information loss to the greatest extent; (2) represent tabular time series in text format to achieve natural alignment with the semantic space of LLMs; (3) design a reasoning framework that integrates contextual text information, neighborhood assistance, multi-path inference and problem decomposition to enhance the reasoning ability of LLMs and realize zero-shot classification. Extensive experiments performed on 10 publicly representative datasets from UEA archive verify the superiorities of the TableTime.
♻ ☆ FedMAP: Personalised Federated Learning for Real Large-Scale Healthcare Systems
Federated learning (FL) promises to enable collaborative machine learning across healthcare sites whilst preserving data privacy. Practical deployment remains limited by statistical heterogeneity arising from differences in patient demographics, treatments, and outcomes, and infrastructure constraints. We introduce FedMAP, a personalised FL (PFL) framework that addresses heterogeneity through local Maximum a Posteriori (MAP) estimation with Input Convex Neural Network priors. These priors represent global knowledge gathered from other sites that guides the model while adapting to local data, and we provide a formal proof of convergence. Unlike many PFL methods that rely on fixed regularisation, FedMAP's prior adaptively learns patterns that capture complex inter-site relationships. We demonstrate improved performance compared to local training, FedAvg, and several PFL methods across three large-scale clinical datasets: 10-year cardiovascular risk prediction (CPRD, 387 general practitioner practices, 258,688 patients), iron deficiency detection (INTERVAL, 4 donor centres, 31,949 blood donors), and mortality prediction (eICU, 150 hospitals, 44,842 patients). FedMAP incorporates a three-tier design that enables participation across healthcare sites with varying infrastructure and technical capabilities, from full federated training to inference-only deployment. Geographical analysis reveals substantial equity improvements, with underperforming regions achieving up to 14.3% performance gains. This framework provides the first practical pathway for large-scale healthcare FL deployment, which ensures clinical sites at all scales can benefit, equity is enhanced, and privacy is retained.
♻ ☆ GST-UNet: A Neural Framework for Spatiotemporal Causal Inference with Time-Varying Confounding NeurIPS 2025
Estimating causal effects from spatiotemporal observational data is essential in public health, environmental science, and policy evaluation, where randomized experiments are often infeasible. Existing approaches, however, either rely on strong structural assumptions or fail to handle key challenges such as interference, spatial confounding, temporal carryover, and time-varying confounding -- where covariates are influenced by past treatments and, in turn, affect future ones. We introduce GST-UNet (G-computation Spatio-Temporal UNet), a theoretically grounded neural framework that combines a U-Net-based spatiotemporal encoder with regression-based iterative G-computation to estimate location-specific potential outcomes under complex intervention sequences. GST-UNet explicitly adjusts for time-varying confounders and captures non-linear spatial and temporal dependencies, enabling valid causal inference from a single observed trajectory in data-scarce settings. We validate its effectiveness in synthetic experiments and in a real-world analysis of wildfire smoke exposure and respiratory hospitalizations during the 2018 California Camp Fire. Together, these results position GST-UNet as a principled and ready-to-use framework for spatiotemporal causal inference, advancing reliable estimation in policy-relevant and scientific domains.
comment: 29 pages, 6 figures, 6 tables, NeurIPS 2025
♻ ☆ Adaptive Anomaly Detection in Network Flows with Low-Rank Tensor Decompositions and Deep Unrolling
Anomaly detection (AD) is increasingly recognized as a key component for ensuring the resilience of future communication systems. While deep learning has shown state-of-the-art AD performance, its application in critical systems is hindered by concerns regarding training data efficiency, domain adaptation and interpretability. This work considers AD in network flows using incomplete measurements, leveraging a robust tensor decomposition approach and deep unrolling techniques to address these challenges. We first propose a novel block-successive convex approximation algorithm based on a regularized model-fitting objective where the normal flows are modeled as low-rank tensors and anomalies as sparse. An augmentation of the objective is introduced to decrease the computational cost. We apply deep unrolling to derive a novel deep network architecture based on our proposed algorithm, treating the regularization parameters as learnable weights. Inspired by Bayesian approaches, we extend the model architecture to perform online adaptation to per-flow and per-time-step statistics, improving AD performance while maintaining a low parameter count and preserving the problem's permutation equivariances. To optimize the deep network weights for detection performance, we employ a homotopy optimization approach based on an efficient approximation of the area under the receiver operating characteristic curve. Extensive experiments on synthetic and real-world data demonstrate that our proposed deep network architecture exhibits a high training data efficiency, outperforms reference methods, and adapts seamlessly to varying network topologies.
comment: 18 pages, 7 figures
♻ ☆ Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining
Continual Learning (CL) is essential for enabling self-evolving large language models (LLMs) to adapt and remain effective amid rapid knowledge growth. Yet, despite its importance, little attention has been given to establishing statistical reliability guarantees for LLMs under CL, particularly in the setting of continual domain pretraining (CDP). Conformal Prediction (CP) has shown promise in offering correctness guarantees for LLMs, but it faces major challenges in CDP: testing data often stems from unknown or shifting domain distributions, under which CP may no longer provide valid guarantees. Moreover, when high coverage is required, CP can yield excessively large prediction sets for unanswerable queries, reducing informativeness. To address these challenges, we introduce an adaptive rejection and non-exchangeable CP framework. Our method first estimates the distribution of questions across domains in the test set using transformer-based clustering, then reweights or resamples the calibration data accordingly. Building on this, adaptive rejection CP allows the LLM to selectively abstain from answering when its confidence or competence shifts significantly. Extensive experiments demonstrate that our framework enhances both the effectiveness and reliability of CP under CDP scenarios. Our code is available at: https://anonymous.4open.science/r/CPCL-8C12/
♻ ☆ GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection NeurIPS 2025
Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Online (Non-)Convex Learning via Tempered Optimism
Optimistic Online Learning aims to exploit experts conveying reliable information to predict the future. However, such implicit optimism may be challenged when it comes to practical crafting of such experts. A fundamental example consists in approximating a minimiser of the current problem and use it as expert. In the context of dynamic environments, such an expert only conveys partially relevant information as it may lead to overfitting. To tackle this issue, we introduce in this work the \emph{optimistically tempered} (OT) online learning framework designed to handle such imperfect experts. As a first contribution, we show that tempered optimism is a fruitful paradigm for Online Non-Convex Learning by proposing simple, yet powerful modification of Online Gradient and Mirror Descent. Second, we derive a second OT algorithm for convex losses and third, evaluate the practical efficiency of tempered optimism on real-life datasets and a toy experiment.
♻ ☆ Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning ICCV2025
So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Soft Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets. Code is available at https://github.com/aqeeelmirza/CoMet
comment: Accepted to IEEE/CVF International Conference on Computer Vision (ICCV2025)
♻ ☆ $β$-DQN: Improving Deep Q-Learning By Evolving the Behavior
While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like $\epsilon$-greedy. Motivated by this, we introduce $\beta$-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function $\beta$. This function estimates the probability that each action has been taken at each state. By leveraging $\beta$, we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. $\beta$-DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that $\beta$-DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.
comment: aamas 2025
♻ ☆ Uni-LoRA: One Vector is All You Need NeurIPS 2025
Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space $R^D$, can be reconstructed through a projection from a subspace R^d, with $d \ll D$. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, $P \in R^{D \times d}$.Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making Uni-LoRA both a unified framework and a "one-vector-only" solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance. Our code is available at https://github.com/KaiyangLi1992/Uni-LoRA.
comment: NeurIPS 2025 Spotlight
♻ ☆ Group-in-Group Policy Optimization for LLM Agent Training NeurIPS 2025
Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks, using Qwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals, achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3B and 47.2% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.
comment: NeurIPS 2025
♻ ☆ TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising NeurIPS 2025
Dark matter makes up approximately 85% of total matter in our universe, yet it has never been directly observed in any laboratory on Earth. The origin of dark matter is one of the most important questions in contemporary physics, and a convincing detection of dark matter would be a Nobel-Prize-level breakthrough in fundamental science. The ABRACADABRA experiment was specifically designed to search for dark matter. Although it has not yet made a discovery, ABRACADABRA has produced several dark matter search results widely endorsed by the physics community. The experiment generates ultra-long time-series data at a rate of 10 million samples per second, where the dark matter signal would manifest itself as a sinusoidal oscillation mode within the ultra-long time series. In this paper, we present the TIDMAD -- a comprehensive data release from the ABRACADABRA experiment including three key components: an ultra-long time series dataset divided into training, validation, and science subsets; a carefully-designed denoising score for direct model benchmarking; and a complete analysis framework which produces a community-standard dark matter search result suitable for publication as a physics paper. This data release enables core AI algorithms to extract the dark matter signal and produce real physics results thereby advancing fundamental science. The data downloading and associated analysis scripts are available at https://github.com/jessicafry/TIDMAD
comment: Accepted by NeurIPS 2025 (Spotlight)
♻ ☆ Mirror Descent and Novel Exponentiated Gradient Algorithms Using Trace-Form Entropies and Deformed Logarithms
This paper introduces a broad class of Mirror Descent (MD) and Generalized Exponentiated Gradient (GEG) algorithms derived from trace-form entropies defined via deformed logarithms. Leveraging these generalized entropies yields MD \& GEG algorithms with improved convergence behavior, robustness to vanishing and exploding gradients, and inherent adaptability to non-Euclidean geometries through mirror maps. We establish deep connections between these methods and Amari's natural gradient, revealing a unified geometric foundation for additive, multiplicative, and natural gradient updates. Focusing on the Tsallis, Kaniadakis, Sharma--Taneja--Mittal, and Kaniadakis--Lissia--Scarfone entropy families, we show that each entropy induces a distinct Riemannian metric on the parameter space, leading to GEG algorithms that preserve the natural statistical geometry. The tunable parameters of deformed logarithms enable adaptive geometric selection, providing enhanced robustness and convergence over classical Euclidean optimization. Overall, our framework unifies key first-order MD optimization methods under a single information-geometric perspective based on generalized Bregman divergences, where the choice of entropy determines the underlying metric and dual geometric structure.
comment: 22 pages, 9 figures
♻ ☆ Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning
Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent's actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high-level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW-Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.
comment: Under review
♻ ☆ Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning
We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.
♻ ☆ Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
comment: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence in Measurement and Education Conference (AIME-Con)
♻ ☆ AutoJudge: Judge Decoding Without Manual Annotation
We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to $\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$ drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting $\ge 25$ tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.
♻ ☆ Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations
In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Segmented Sampling for Boundary Approximation}$ (SSBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5\%$ to $50\%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the SSBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and code are available at: https://github.com/dsin85691/SSBA_For_Counterfactuals
comment: This paper is 15 pages long consisting of multiple sections including an abstract, introduction, related works, methodology, results, ablation studies, conclusion, future works, and an appendix section. There are 10 figures and 5 tables in total
♻ ☆ Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge, improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our project page is https://github.com/Liang-ZX/DiscreteDiffusionVLA
comment: 16 pages
♻ ☆ The Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data
Differentially Private (DP) generative marginal models are often used in the wild to release synthetic tabular datasets in lieu of sensitive data while providing formal privacy guarantees. These models approximate low-dimensional marginals or query workloads; crucially, they require the training data to be pre-discretized, i.e., continuous values need to first be partitioned into bins. However, as the range of values (or their domain) is often inferred directly from the training data, with the number of bins and bin edges typically defined arbitrarily, this approach can ultimately break end-to-end DP guarantees and may not always yield optimal utility. In this paper, we present an extensive measurement study of four discretization strategies in the context of DP marginal generative models. More precisely, we design DP versions of three discretizers (uniform, quantile, and k-means) and reimplement the PrivTree algorithm. We find that optimizing both the choice of discretizer and bin count can improve utility, on average, by almost 30% across six DP marginal models, compared to the default strategy and number of bins, with PrivTree being the best-performing discretizer in the majority of cases. We demonstrate that, while DP generative models with non-private discretization remain vulnerable to membership inference attacks, applying DP during discretization effectively mitigates this risk. Finally, we improve on an existing approach for automatically selecting the optimal number of bins, and achieve high utility while reducing both privacy budget consumption and computational overhead.
♻ ☆ The Formalism-Implementation Gap in Reinforcement Learning Research
The last decade has seen an upswing in interest and adoption of reinforcement learning (RL) techniques, in large part due to its demonstrated capabilities at performing certain tasks at "super-human levels". This has incentivized the community to prioritize research that demonstrates RL agent performance, often at the expense of research aimed at understanding their learning dynamics. Performance-focused research runs the risk of overfitting on academic benchmarks -- thereby rendering them less useful -- which can make it difficult to transfer proposed techniques to novel problems. Further, it implicitly diminishes work that does not push the performance-frontier, but aims at improving our understanding of these techniques. This paper argues two points: (i) RL research should stop focusing solely on demonstrating agent capabilities, and focus more on advancing the science and understanding of reinforcement learning; and (ii) we need to be more precise on how our benchmarks map to the underlying mathematical formalisms. We use the popular Arcade Learning Environment (ALE; Bellemare et al., 2013) as an example of a benchmark that, despite being increasingly considered "saturated", can be effectively used for developing this understanding, and facilitating the deployment of RL techniques in impactful real-world problems.
♻ ☆ Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training
Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time $\tau_\mathrm{gen}$ at which models begin to generate high-quality samples, and a later time $\tau_\mathrm{mem}$ beyond which memorization emerges. Crucially, we find that $\tau_\mathrm{mem}$ increases linearly with the training set size $n$, while $\tau_\mathrm{gen}$ remains constant. This creates a growing window of training times with $n$ where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when $n$ becomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.
comment: Accepted as an oral at Neurips 2025. 40 pages, 15 figures
♻ ☆ RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices
To deploy LLMs on resource-contained platforms such as mobile robots and smartphones, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown strong computational efficiency; nevertheless, RWKV models still have high parameter counts which limited their deployment. In this paper, we propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture. Combined, our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, our models require 4x less memory footprint.
♻ ☆ Data Fusion of Deep Learned Molecular Embeddings for Property Prediction
Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and applicability. To improve predictions, techniques such as transfer learning and multitask learning have been used. The performance of multitask learning models depends on the strength of the underlying correlations between tasks and the completeness of the data set. Standard multitask models tend to underperform when trained on sparse data sets with weakly correlated properties. To address this gap, we fuse deep-learned embeddings generated by independent pretrained single-task models, resulting in a multitask model that inherits rich, property-specific representations. By reusing (rather than retraining) these embeddings, the resulting fused model outperforms standard multitask models and can be extended with fewer trainable parameters. We demonstrate this technique on a widely used benchmark data set of quantum chemistry data for small molecules as well as a newly compiled sparse data set of experimental data collected from literature and our own quantum chemistry and thermochemical calculations.
comment: J. Chem. Inf. Model. 2025
♻ ☆ UniCrossFi: A Unified Framework For Cross-Domain Wi-Fi-based Gesture Recognition
Wi-Fi sensing systems are severely hindered by cross domain problem when deployed in unseen real-world environments. Existing methods typically design separate frameworks for either domain adaptation or domain generalization, often relying on extensive labeled data. Existing methods that designed for domain generalization is often relying on extensive labeled data. However, real-world scenarios are far more complex, where the deployed model must be capable of handling generalization under limited labeled source data. To this end, we propose UniCrossFi, a unified framework designed to mitigate performance drop in CSI-based sensing across diverse deployment settings. Our framework not only extends conventional Domain Generalization (DG) to a more practical Semi-Supervised Domain Generalization (SSDG) setting, where only partially labeled source data are available, but also introduces a physics-informed data augmentation strategy, Antenna Response Consistency (ARC). ARC mitigates the risk of learning superficial shortcuts by exploiting the intrinsic spatial diversity of multi-antenna systems, treating signals from different antennas as naturally augmented views of the same event. In addition, we design a Unified Contrastive Objective to prevent conventional contrastive learning from pushing apart samples from different domains that share the same class. We conduct extensive experiments on the public Widar and CSIDA datasets. The results demonstrate that UniCrossFi consistently establishes a new state-of-the-art, significantly outperforming existing methods across all unsupervised domain adaptation, DG, and SSDG benchmarks. UniCrossFi provides a principled and practical solution to the domain shift challenge, advancing the feasibility of robust, real-world Wi-Fi sensing systems that can operate effectively with limited labeled data.
♻ ☆ Telegrapher's Generative Model via Kac Flows
We break the mold in flow-based generative modeling by proposing a new model based on the damped wave equation, also known as telegrapher's equation. Similar to the diffusion equation and Brownian motion, there is a Feynman-Kac type relation between the telegrapher's equation and the stochastic Kac process in 1D. The Kac flow evolves stepwise linearly in time, so that the probability flow is Lipschitz continuous in the Wasserstein distance and, in contrast to diffusion flows, the norm of the velocity is globally bounded. Furthermore, the Kac model has the diffusion model as its asymptotic limit. We extend these considerations to a multi-dimensional stochastic process which consists of independent 1D Kac processes in each spatial component. We show that this process gives rise to an absolutely continuous curve in the Wasserstein space and compute the conditional velocity field starting in a Dirac point analytically. Using the framework of flow matching, we train a neural network that approximates the velocity field and use it for sample generation. Our numerical experiments demonstrate the scalability of our approach, and show its advantages over diffusion models.
comment: Update V2: We added CIFAR experiments. Update V3: The old FID scores & CIFAR images of the Kac model corresponded to the schedule g(t) = t. We now updated them with both schedules t and t^2. Update V4: We corrected a minor implementation error and updated the CIFAR images/table
♻ ☆ Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm
IIn this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) updates, and applying the Bregman divergence with a two--parameter deformation of the logarithm as a link function. This link function (referred here to as the Euler logarithm) is associated with a relatively wide class of trace--form entropies. In order to derive novel GEG/MD updates, we estimate a deformed exponential function, which closely approximates the inverse of the Euler two--parameter deformed logarithm. The characteristic shape and properties of the Euler logarithm and its inverse--deformed exponential functions, are tuned by two hyperparameters. By learning these hyperparameters, we can adapt to the distribution of training data and adjust them to achieve desired properties of gradient descent algorithms. In the literature, there exist nowadays more than fifty mathematically well-established entropic functionals and associated deformed logarithms, so it is impossible to investigate all of them in one research paper. Therefore, we focus here on a class of trace-form entropies and the associated deformed two--parameters logarithms.
comment: 10 pages, preprint of Journal paper
♻ ☆ Linear regression with overparameterized linear neural networks: Tight upper and lower bounds for implicit $\ell^1$-regularization
Modern machine learning models are often trained in a setting where the number of parameters exceeds the number of training samples. To understand the implicit bias of gradient descent in such overparameterized models, prior work has studied diagonal linear neural networks in the regression setting. These studies have shown that, when initialized with small weights, gradient descent tends to favor solutions with minimal $\ell^1$-norm - an effect known as implicit regularization. In this paper, we investigate implicit regularization in diagonal linear neural networks of depth $D\ge 2$ for overparameterized linear regression problems. We focus on analyzing the approximation error between the limit point of gradient flow trajectories and the solution to the $\ell^1$-minimization problem. By deriving tight upper and lower bounds on the approximation error, we precisely characterize how the approximation error depends on the scale of initialization $\alpha$. Our results reveal a qualitative difference between depths: for $D \ge 3$, the error decreases linearly with $\alpha$, whereas for $D=2$, it decreases at rate $\alpha^{1-\varrho}$, where the parameter $\varrho \in [0,1)$ can be explicitly characterized. Interestingly, this parameter is closely linked to so-called null space property constants studied in the sparse recovery literature. We demonstrate the asymptotic tightness of our bounds through explicit examples. Numerical experiments corroborate our theoretical findings and suggest that deeper networks, i.e., $D \ge 3$, may lead to better generalization, particularly for realistic initialization scales.
♻ ☆ Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring
Dropout is common in clinical studies, with up to half of patients leaving early due to side effects or other reasons. When dropout is informative (i.e., dependent on survival time), it introduces censoring bias, because of which treatment effect estimates are also biased. In this paper, we propose an assumption-lean framework to assess the robustness of conditional average treatment effect (CATE) estimates in survival analysis when facing censoring bias. Unlike existing works that rely on strong assumptions, such as non-informative censoring, to obtain point estimation, we use partial identification to derive informative bounds on the CATE. Thereby, our framework helps to identify patient subgroups where treatment is effective despite informative censoring. We further develop a novel meta-learner that estimates the bounds using arbitrary machine learning models and with favorable theoretical properties, including double robustness and quasi-oracle efficiency. We demonstrate the practical value of our meta-learner through numerical experiments and in an application to a cancer drug trial. Together, our framework offers a practical tool for assessing the robustness of estimated treatment effects in the presence of censoring and thus promotes the reliable use of survival data for evidence generation in medicine and epidemiology.
♻ ☆ Diffusion Models Meet Contextual Bandits
Efficient online decision-making in contextual bandits is challenging, as methods without informative priors often suffer from computational or statistical inefficiencies. In this work, we leverage pre-trained diffusion models as expressive priors to capture complex action dependencies and develop a practical algorithm that efficiently approximates posteriors under such priors, enabling both fast updates and sampling. Empirical results demonstrate the effectiveness and versatility of our approach across diverse contextual bandit settings.
comment: Neurips 2025
♻ ☆ Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning NeurIPS 2025
Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic understanding, and achieves consistent gains in retrieval performance. This even applies to the recent CLIPS, which achieves SOTA retrieval performance. Nevertheless, the short fine-tuning with CLIC leads to an improvement in retrieval and to the best compositional CLIP model on SugarCrepe++. All our models and code are available at https://clic-compositional-clip.github.io
comment: Accepted at NeurIPS 2025
♻ ☆ Provable Scaling Laws for the Test-Time Compute of Large Language Models NeurIPS 2025
We propose two simple, principled and practical algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs). The first one is a two-stage knockout-style algorithm: given an input problem, it first generates multiple candidate solutions, and then aggregate them via a knockout tournament for the final output. Assuming that the LLM can generate a correct solution with non-zero probability and do better than a random guess in comparing a pair of correct and incorrect solutions, we prove theoretically that the failure probability of this algorithm decays to zero exponentially or by a power law (depending on the specific way of scaling) as its test-time compute grows. The second one is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents, rather than eliminated upon loss to a single opponent. Under analogous but more robust assumptions, we prove that its failure probability also decays to zero exponentially with more test-time compute. Both algorithms require a black-box LLM and nothing else (e.g., no verifier or reward model) for a minimalistic implementation, which makes them appealing for practical applications and easy to adapt for different tasks. Through extensive experiments with diverse models and datasets, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.
comment: NeurIPS 2025 camera-ready version
♻ ☆ Robust Point Cloud Reinforcement Learning via PCA-Based Canonicalization
Reinforcement Learning (RL) from raw visual input has achieved impressive successes in recent years, yet it remains fragile to out-of-distribution variations such as changes in lighting, color, and viewpoint. Point Cloud Reinforcement Learning (PC-RL) offers a promising alternative by mitigating appearance-based brittleness, but its sensitivity to camera pose mismatches continues to undermine reliability in realistic settings. To address this challenge, we propose PCA Point Cloud (PPC), a canonicalization framework specifically tailored for downstream robotic control. PPC maps point clouds under arbitrary rigid-body transformations to a unique canonical pose, aligning observations to a consistent frame, thereby substantially decreasing viewpoint-induced inconsistencies. In our experiments, we show that PPC improves robustness to unseen camera poses across challenging robotic tasks, providing a principled alternative to domain randomization.
♻ ☆ Offline Learning and Forgetting for Reasoning with Large Language Models
Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model's search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.
comment: Published in Transactions on Machine Learning Research (TMLR), 2025. Code: https://github.com/twni2016/llm-reasoning-uft
♻ ☆ FraudTransformer: Time-Aware GPT for Transaction Fraud Detection
Detecting payment fraud in real-world banking streams requires models that can exploit both the order of events and the irregular time gaps between them. We introduce FraudTransformer, a sequence model that augments a vanilla GPT-style architecture with (i) a dedicated time encoder that embeds either absolute timestamps or inter-event values, and (ii) a learned positional encoder that preserves relative order. Experiments on a large industrial dataset -- tens of millions of transactions and auxiliary events -- show that FraudTransformer surpasses four strong classical baselines (Logistic Regression, XGBoost and LightGBM) as well as transformer ablations that omit either the time or positional component. On the held-out test set it delivers the highest AUROC and PRAUC.
comment: Accepted in AI-FIND ICAIF'25 (https://sites.google.com/view/icaif-fraud-detection-workshop/home)
♻ ☆ ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression NeurIPS 2025
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
comment: Accepted at NeurIPS 2025 (oral)
♻ ☆ Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.
♻ ☆ LittleBit: Ultra Low-Bit Quantization via Latent Factorization NeurIPS 2025
Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off--unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level--and makes powerful LLMs practical for resource-constrained environments.
comment: Accepted to NeurIPS 2025. Banseok Lee and Dongkyu Kim contributed equally
♻ ☆ Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation NeurIPS 2025
Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincar\'e ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fr\'echet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.
comment: Accepted to NeurIPS 2025
♻ ☆ Trade-offs in Data Memorization via Strong Data Processing Inequalities COLT 2025
Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when $O(1)$ $d$-dimensional examples are available, which then decays as the number of examples grows at a problem-specific rate. Further, our lower bounds are generally matched (up to logarithmic factors) by simple learning algorithms. We also extend our lower bounds to more general mixture-of-clusters models. Our definitions and results build on the work of Brown et al. (2021) and address several limitations of the lower bounds in their work.
comment: Appeared in COLT 2025; this revision includes an improved upper bound in Theorem 3.1, as well as several minor clarifications and modifications
♻ ☆ Acoustic and Machine Learning Methods for Speech-Based Suicide Risk Assessment: A Systematic Review
Suicide remains a public health challenge, necessitating improved detection methods to facilitate timely intervention and treatment. This systematic review evaluates the role of Artificial Intelligence (AI) and Machine Learning (ML) in assessing suicide risk through acoustic analysis of speech. Following PRISMA guidelines, we analyzed 33 articles selected from PubMed, Cochrane, Scopus, and Web of Science databases. The last search was conducted in February 2025. Risk of bias was assessed using the PROBAST tool. Studies analyzing acoustic features between individuals at risk of suicide (RS) and those not at risk (NRS) were included, while studies lacking acoustic data, a suicide-related focus, or sufficient methodological details were excluded. Sample sizes varied widely and were reported in terms of participants or speech segments, depending on the study. Results were synthesized narratively based on acoustic features and classifier performance. Findings consistently showed significant acoustic feature variations between RS and NRS populations, particularly involving jitter, fundamental frequency (F0), Mel-frequency cepstral coefficients (MFCC), and power spectral density (PSD). Classifier performance varied based on algorithms, modalities, and speech elicitation methods, with multimodal approaches integrating acoustic, linguistic, and metadata features demonstrating superior performance. Among the 29 classifier-based studies, reported AUC values ranged from 0.62 to 0.985 and accuracies from 60% to 99.85%. Most datasets were imbalanced in favor of NRS, and performance metrics were rarely reported separately by group, limiting clear identification of direction of effect.
comment: Preprint version of a manuscript submitted to the Journal of Affective Disorders
♻ ☆ MH-GIN: Multi-scale Heterogeneous Graph-based Imputation Network for AIS Data (Extended Version) VLDB 2026
Location-tracking data from the Automatic Identification System, much of which is publicly available, plays a key role in a range of maritime safety and monitoring applications. However, the data suffers from missing values that hamper downstream applications. Imputing the missing values is challenging because the values of different heterogeneous attributes are updated at diverse rates, resulting in the occurrence of multi-scale dependencies among attributes. Existing imputation methods that assume similar update rates across attributes are unable to capture and exploit such dependencies, limiting their imputation accuracy. We propose MH-GIN, a Multi-scale Heterogeneous Graph-based Imputation Network that aims improve imputation accuracy by capturing multi-scale dependencies. Specifically, MH-GIN first extracts multi-scale temporal features for each attribute while preserving their intrinsic heterogeneous characteristics. Then, it constructs a multi-scale heterogeneous graph to explicitly model dependencies between heterogeneous attributes to enable more accurate imputation of missing values through graph propagation. Experimental results on two real-world datasets find that MH-GIN is capable of an average 57% reduction in imputation errors compared to state-of-the-art methods, while maintaining computational efficiency. The source code and implementation details of MH-GIN are publicly available https://github.com/hyLiu1994/MH-GIN.
comment: 18 pages, 4 figures; This paper is accepted by PVLDB 2026
♻ ☆ Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testimony of the impact of the approach, the method is intended to replace the hand-crafted counterpart in next iterations of the series.
♻ ☆ The Logical Expressiveness of Temporal GNNs via Two-Dimensional Product Logics
In recent years, the expressive power of various neural architectures -- including graph neural networks (GNNs), transformers, and recurrent neural networks -- has been characterised using tools from logic and formal language theory. As the capabilities of basic architectures are becoming well understood, increasing attention is turning to models that combine multiple architectural paradigms. Among them particularly important, and challenging to analyse, are temporal extensions of GNNs, which integrate both spatial (graph-structure) and temporal (evolution over time) dimensions. In this paper, we initiate the study of logical characterisation of temporal GNNs by connecting them to two-dimensional product logics. We show that the expressive power of temporal GNNs depends on how graph and temporal components are combined. In particular, temporal GNNs that apply static GNNs recursively over time can capture all properties definable in the product logic of (past) propositional temporal logic PTL and the modal logic K. In contrast, architectures such as graph-and-time TGNNs and global TGNNs can only express restricted fragments of this logic, where the interaction between temporal and spatial operators is syntactically constrained. These provide us with the first results on the logical expressiveness of temporal GNNs.
♻ ☆ MixAT: Combining Continuous and Discrete Adversarial Training for LLMs NeurIPS 2025
Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.
comment: Published at 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ SeeDNorm: Self-Rescaled Dynamic Normalization
Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $\gamma$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $\gamma$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.
comment: 31 pages, 14 figures, 18 tables
♻ ☆ DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment NeurIPS 2025
How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.
comment: NeurIPS 2025
♻ ☆ Minimax Optimal Transfer Learning for Kernel-based Nonparametric Regression
In recent years, transfer learning has garnered significant attention in the machine learning community. Its ability to leverage knowledge from related studies to improve generalization performance in a target study has made it highly appealing. This paper focuses on investigating the transfer learning problem within the context of nonparametric regression over a reproducing kernel Hilbert space. The aim is to bridge the gap between practical effectiveness and theoretical guarantees. We specifically consider two scenarios: one where the transferable sources are known and another where they are unknown. For the known transferable source case, we propose a two-step kernel-based estimator by solely using kernel ridge regression. For the unknown case, we develop a novel method based on an efficient aggregation algorithm, which can automatically detect and alleviate the effects of negative sources. This paper provides the statistical properties of the desired estimators and establishes the minimax optimal rate. Through extensive numerical experiments on synthetic data and real examples, we validate our theoretical findings and demonstrate the effectiveness of our proposed method.
♻ ☆ Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem ICCV 2025
As generative models are advancing in quality and quantity for creating synthetic content, deepfakes begin to cause online mistrust. Deepfake detectors are proposed to counter this effect, however, misuse of detectors claiming fake content as real or vice versa further fuels this misinformation problem. We present the first comprehensive uncertainty analysis of deepfake detectors, systematically investigating how generative artifacts influence prediction confidence. As reflected in detectors' responses, deepfake generators also contribute to this uncertainty as their generative residues vary, so we cross the uncertainty analysis of deepfake detectors and generators. Based on our observations, the uncertainty manifold holds enough consistent information to leverage uncertainty for deepfake source detection. Our approach leverages Bayesian Neural Networks and Monte Carlo dropout to quantify both aleatoric and epistemic uncertainties across diverse detector architectures. We evaluate uncertainty on two datasets with nine generators, with four blind and two biological detectors, compare different uncertainty methods, explore region- and pixel-based uncertainty, and conduct ablation studies. We conduct and analyze binary real/fake, multi-class real/fake, source detection, and leave-one-out experiments between the generator/detector combinations to share their generalization capability, model calibration, uncertainty, and robustness against adversarial attacks. We further introduce uncertainty maps that localize prediction confidence at the pixel level, revealing distinct patterns correlated with generator-specific artifacts. Our analysis provides critical insights for deploying reliable deepfake detection systems and establishes uncertainty quantification as a fundamental requirement for trustworthy synthetic media detection.
comment: Accepted for publication at the ICCV 2025 workshop - STREAM
♻ ☆ Clustering-Based Low-Rank Matrix Approximation for Medical Image Compression
Medical images are inherently high-resolution and contain locally varying structures crucial for diagnosis. Efficient compression must preserve diagnostic fidelity while minimizing redundancy. Low-rank matrix approximation (LoRMA) techniques have shown strong potential for image compression by capturing global correlations; however, they often fail to adapt to local structural variations across regions of interest. To address this, we introduce an adaptive LoRMA, which partitions a medical image into overlapping patches, groups structurally similar patches into clusters using k-means, and performs SVD within each cluster. We derive the overall compression factor accounting for patch overlap and analyze how patch size influences compression efficiency and computational cost. While applicable to any data with high local variation, we focus on medical imaging due to its pronounced local variability. We evaluate and compare our adaptive LoRMA against global SVD across four imaging modalities: MRI, ultrasound, CT scan, and chest X-ray. Results demonstrate that adaptive LoRMA effectively preserves structural integrity, edge details, and diagnostic relevance, measured by PSNR, SSIM, MSE, IoU, and EPI. Adaptive LoRMA minimizes block artifacts and residual errors, particularly in pathological regions, consistently outperforming global SVD in PSNR, SSIM, IoU, EPI, and achieving lower MSE. It prioritizes clinically salient regions while allowing aggressive compression in non-critical regions, optimizing storage efficiency. Although adaptive LoRMA requires higher processing time, its diagnostic fidelity justifies the overhead for high-compression applications.
♻ ☆ MARS-M: When Variance Reduction Meets Matrices
Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, which improves upon $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.
♻ ☆ Geometric Mixture Models for Electrolyte Conductivity Prediction
Accurate prediction of ionic conductivity in electrolyte systems is crucial for advancing numerous scientific and technological applications. While significant progress has been made, current research faces two fundamental challenges: (1) the lack of high-quality standardized benchmarks, and (2) inadequate modeling of geometric structure and intermolecular interactions in mixture systems. To address these limitations, we first reorganize and enhance the CALiSol and DiffMix electrolyte datasets by incorporating geometric graph representations of molecules. We then propose GeoMix, a novel geometry-aware framework that preserves Set-SE(3) equivariance-an essential but challenging property for mixture systems. At the heart of GeoMix lies the Geometric Interaction Network (GIN), an equivariant module specifically designed for intermolecular geometric message passing. Comprehensive experiments demonstrate that GeoMix consistently outperforms diverse baselines (including MLPs, GNNs, and geometric GNNs) across both datasets, validating the importance of cross-molecular geometric interactions and equivariant message passing for accurate property prediction. This work not only establishes new benchmarks for electrolyte research but also provides a general geometric learning framework that advances modeling of mixture systems in energy materials, pharmaceutical development, and beyond.
♻ ☆ Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review
Reinforcement learning (RL) has become a foundational approach for enabling intelligent robotic behavior in dynamic and uncertain environments. This work presents an in-depth review of RL principles, advanced deep reinforcement learning (DRL) algorithms, and their integration into robotic and control systems. Beginning with the formalism of Markov Decision Processes (MDPs), the study outlines essential elements of the agent-environment interaction and explores core algorithmic strategies including actor-critic methods, value-based learning, and policy gradients. Emphasis is placed on modern DRL techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving high-dimensional, continuous control tasks. A structured taxonomy is introduced to categorize RL applications across domains such as locomotion, manipulation, multi-agent coordination, and human-robot interaction, along with training methodologies and deployment readiness levels. The review synthesizes recent research efforts, highlighting technical trends, design patterns, and the growing maturity of RL in real-world robotics. Overall, this work aims to bridge theoretical advances with practical implementations, providing a consolidated perspective on the evolving role of RL in autonomous robotic systems.
♻ ☆ URB -- Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles NeurIPS 2025
Connected Autonomous Vehicles (CAVs) promise to reduce congestion in future urban networks, potentially by optimizing their routing decisions. Unlike for human drivers, these decisions can be made with collective, data-driven policies, developed using machine learning algorithms. Reinforcement learning (RL) can facilitate the development of such collective routing strategies, yet standardized and realistic benchmarks are missing. To that end, we present URB: Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles. URB is a comprehensive benchmarking environment that unifies evaluation across 29 real-world traffic networks paired with realistic demand patterns. URB comes with a catalog of predefined tasks, multi-agent RL (MARL) algorithm implementations, three baseline methods, domain-specific performance metrics, and a modular configuration scheme. Our results show that, despite the lengthy and costly training, state-of-the-art MARL algorithms rarely outperformed humans. The experimental results reported in this paper initiate the first leaderboard for MARL in large-scale urban routing optimization. They reveal that current approaches struggle to scale, emphasizing the urgent need for advancements in this domain.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), Datasets and Benchmarks Track
♻ ☆ Do Language Models Use Their Depth Efficiently? NeurIPS 2025
Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.
comment: Accepted to NeurIPS 2025
♻ ☆ Unlearning Comparator: A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods
Machine Unlearning (MU) aims to remove target training data from a trained model so that the removed data no longer influences the model's behavior, fulfilling "right to be forgotten" obligations under data privacy laws. Yet, we observe that researchers in this rapidly emerging field face challenges in analyzing and understanding the behavior of different MU methods, especially in terms of three fundamental principles in MU: accuracy, efficiency, and privacy. Consequently, they often rely on aggregate metrics and ad-hoc evaluations, making it difficult to accurately assess the trade-offs between methods. To fill this gap, we introduce a visual analytics system, Unlearning Comparator, designed to facilitate the systematic evaluation of MU methods. Our system supports two important tasks in the evaluation process: model comparison and attack simulation. First, it allows the user to compare the behaviors of two models, such as a model generated by a certain method and a retrained baseline, at class-, instance-, and layer-levels to better understand the changes made after unlearning. Second, our system simulates membership inference attacks (MIAs) to evaluate the privacy of a method, where an attacker attempts to determine whether specific data samples were part of the original training set. We evaluate our system through a case study visually analyzing prominent MU methods and demonstrate that it helps the user not only understand model behaviors but also gain insights that can inform the improvement of MU methods. The source code is publicly available at https://github.com/gnueaj/Machine-Unlearning-Comparator.
comment: Submitted to IEEE Transactions on Visualization and Computer Graphics (TVCG), under review. 15 pages. This work has been submitted to the IEEE for possible publication
♻ ☆ Two-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion NeurIPS 2025
Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimates of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their frameworks. In this work, we propose a novel two-stage training framework to jointly synthesize a controller and a Lyapunov function for continuous-time systems. By leveraging a Zubov-inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training-data sampling strategy and a domain-updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $\alpha,\!\beta$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10{,}000$ times faster compared to the traditional SMT solver dReal. Our code is available at https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.
comment: NeurIPS 2025
♻ ☆ Pairwise Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model NeurIPS 2025
In this paper, we propose a flow-based method for learning all-to-all transfer maps among conditional distributions that approximates pairwise optimal transport. The proposed method addresses the challenge of handling the case of continuous conditions, which often involve a large set of conditions with sparse empirical observations per condition. We introduce a novel cost function that enables simultaneous learning of optimal transports for all pairs of conditional distributions. Our method is supported by a theoretical guarantee that, in the limit, it converges to the pairwise optimal transports among infinite pairs of conditional distributions. The learned transport maps are subsequently used to couple data points in conditional flow matching. We demonstrate the effectiveness of this method on synthetic and benchmark datasets, as well as on chemical datasets in which continuous physical properties are defined as conditions. The code for this project can be found at https://github.com/kotatumuri-room/A2A-FM
comment: Accepted at NeurIPS 2025, 32 pages, 18 figures
♻ ☆ Scalable Exploration via Ensemble++ NeurIPS 2025
Thompson Sampling is a principled method for balancing exploration and exploitation, but its real-world adoption faces computational challenges in large-scale or non-conjugate settings. While ensemble-based approaches offer partial remedies, they typically require prohibitively large ensemble sizes. We propose Ensemble++, a scalable exploration framework using a novel shared-factor ensemble architecture with random linear combinations. For linear bandits, we provide theoretical guarantees showing that Ensemble++ achieves regret comparable to exact Thompson Sampling with only $\Theta(d \log T)$ ensemble sizes--significantly outperforming prior methods. Crucially, this efficiency holds across both compact and finite action sets with either time-invariant or time-varying contexts without configuration changes. We extend this theoretical foundation to nonlinear rewards by replacing fixed features with learnable neural representations while preserving the same incremental update principle, effectively bridging theory and practice for real-world tasks. Comprehensive experiments across linear, quadratic, neural, and GPT-based contextual bandits validate our theoretical findings and demonstrate Ensemble++'s superior regret-computation tradeoff versus state-of-the-art methods.
comment: NeurIPS 2025
♻ ☆ FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching
We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle an extensive fragment space, our framework enables more efficient and scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate modern molecular graph generative models' ability to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a FragFM comparative study against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.
comment: 49 pages, 29 figures, under review
♻ ☆ PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning NeurIPS 2025
In recent years, unlearning techniques, which are methods for inducing a model to "forget" previously learned information, have attracted attention as a way to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While several unlearning benchmarks have been established for LLMs, a practical evaluation framework for unlearning in LMMs has been less explored. Specifically, existing unlearning benchmark for LMMs considers only scenarios in which the model is required to unlearn fine-tuned knowledge through a single unlearning operation. In this study, we introduce PULSE protocol for realistic unlearning scenarios for LMMs by introducing two critical perspectives: (i) Pre-trained knowledge Unlearning for analyzing the effect across different knowledge acquisition phases and (ii) Long-term Sustainability Evaluation to address sequential requests. We then evaluate existing unlearning methods along these dimensions. Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Moreover, methods that effectively unlearn a batch of target data in a single operation exhibit substantial performance degradation when the same data are split and unlearned sequentially.
comment: Accepted at NeurIPS 2025 Workshop: Evaluating the Evolving LLM Lifecycle
♻ ☆ Causal Spatio-Temporal Prediction: An Effective and Efficient Multi-Modal Approach
Spatio-temporal prediction plays a crucial role in intelligent transportation, weather forecasting, and urban planning. While integrating multi-modal data has shown potential for enhancing prediction accuracy, key challenges persist: (i) inadequate fusion of multi-modal information, (ii) confounding factors that obscure causal relations, and (iii) high computational complexity of prediction models. To address these challenges, we propose E^2-CSTP, an Effective and Efficient Causal multi-modal Spatio-Temporal Prediction framework. E^2-CSTP leverages cross-modal attention and gating mechanisms to effectively integrate multi-modal data. Building on this, we design a dual-branch causal inference approach: the primary branch focuses on spatio-temporal prediction, while the auxiliary branch mitigates bias by modeling additional modalities and applying causal interventions to uncover true causal dependencies. To improve model efficiency, we integrate GCN with the Mamba architecture for accelerated spatio-temporal encoding. Extensive experiments on 4 real-world datasets show that E^2-CSTP significantly outperforms 9 state-of-the-art methods, achieving up to 9.66% improvements in accuracy as well as 17.37%-56.11% reductions in computational overhead.
♻ ☆ Rademacher Meets Colors: More Expressivity, but at What Cost ?
The expressive power of graph neural networks (GNNs) is typically understood through their correspondence with graph isomorphism tests such as the Weisfeiler-Leman (WL) hierarchy. While more expressive GNNs can distinguish a richer set of graphs, they are also observed to suffer from higher generalization error. This work provides a theoretical explanation for this trade-off by linking expressivity and generalization through the lens of coloring algorithms. Specifically, we show that the number of equivalence classes induced by WL colorings directly bounds the GNNs Rademacher complexity -- a key data-dependent measure of generalization. Our analysis reveals that greater expressivity leads to higher complexity and thus weaker generalization guarantees. Furthermore, we prove that the Rademacher complexity is stable under perturbations in the color counts across different samples, ensuring robustness to sampling variability across datasets. Importantly, our framework is not restricted to message-passing GNNs or 1-WL, but extends to arbitrary GNN architectures and expressivity measures that partition graphs into equivalence classes. These results unify the study of expressivity and generalization in GNNs, providing a principled understanding of why increasing expressive power often comes at the cost of generalization.
♻ ☆ Interpretable Clustering with Adaptive Heterogeneous Causal Structure Learning in Mixed Observational Data
Understanding causal heterogeneity is essential for scientific discovery in domains such as biology and medicine. However, existing methods lack causal awareness, with insufficient modeling of heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations. We propose an unsupervised framework, HCL (Interpretable Causal Mechanism-Aware Clustering with Adaptive Heterogeneous Causal Structure Learning), that jointly infers latent clusters and their associated causal structures from mixed-type observational data without requiring temporal ordering, environment labels, interventions or other prior knowledge. HCL relaxes the homogeneity and sufficiency assumptions by introducing an equivalent representation that encodes both structural heterogeneity and confounding. It further develops a bi-directional iterative strategy to alternately refine causal clustering and structure learning, along with a self-supervised regularization that balance cross-cluster universality and specificity. Together, these components enable convergence toward interpretable, heterogeneous causal patterns. Theoretically, we show identifiability of heterogeneous causal structures under mild conditions. Empirically, HCL achieves superior performance in both clustering and structure learning tasks, and recovers biologically meaningful mechanisms in real-world single-cell perturbation data, demonstrating its utility for discovering interpretable, mechanism-level causal heterogeneity.
♻ ☆ CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA Graph implementations, and remarkably x7.72 over cuDNN libraries. Furthermore, the model also demonstrates portability across different GPU architectures. Beyond these benchmark results, CUDA-L1 demonstrates several properties: it 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance. The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
comment: Project Page: https://deepreinforce-ai.github.io/cudal1_blog/
♻ ☆ Mixture-of-Experts Meets In-Context Reinforcement Learning
In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.
comment: 28 pages, 13 figures, 17 tables
♻ ☆ An unsupervised tour through the hidden pathways of deep neural networks
The goal of this thesis is to improve our understanding of the internal mechanisms by which deep artificial neural networks create meaningful representations and are able to generalize. We focus on the challenge of characterizing the semantic content of the hidden representations with unsupervised learning tools, partially developed by us and described in this thesis, which allow harnessing the low-dimensional structure of the data. Chapter 2. introduces Gride, a method that allows estimating the intrinsic dimension of the data as an explicit function of the scale without performing any decimation of the data set. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among nearest data points. In Chapter 3, we study the evolution of the probability density across the hidden layers in some state-of-the-art deep neural networks. We find that the initial layers generate a unimodal probability density getting rid of any structure irrelevant to classification. In subsequent layers, density peaks arise in a hierarchical fashion that mirrors the semantic hierarchy of the concepts. This process leaves a footprint in the probability density of the output layer, where the topography of the peaks allows reconstructing the semantic relationships of the categories. In Chapter 4, we study the problem of generalization in deep neural networks: adding parameters to a network that interpolates its training data will typically improve its generalization performance, at odds with the classical bias-variance trade-off. We show that wide neural networks learn redundant representations instead of overfitting to spurious correlation and that redundant neurons appear only if the network is regularized and the training error is zero.
comment: PhD thesis
♻ ☆ Global urban visual perception varies across demographics and personalities
Understanding people's preferences is crucial for urban planning, yet current approaches often combine responses from multi-cultural populations, obscuring demographic differences and risking amplifying biases. We conducted a largescale urban visual perception survey of streetscapes worldwide using street view imagery, examining how demographics -- including gender, age, income, education, race and ethnicity, and personality traits -- shape perceptions among 1,000 participants with balanced demographics from five countries and 45 nationalities. This dataset, Street Perception Evaluation Considering Socioeconomics (SPECS), reveals demographic- and personality-based differences across six traditional indicators -- safe, lively, wealthy, beautiful, boring, depressing -- and four new ones -- live nearby, walk, cycle, green. Location-based sentiments further shape these preferences. Machine learning models trained on existing global datasets tend to overestimate positive indicators and underestimate negative ones compared to human responses, underscoring the need for local context. Our study aspires to rectify the myopic treatment of street perception, which rarely considers demographics or personality traits.
♻ ☆ A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning NeurIPS 2025
Multi-source transfer learning provides an effective solution to data scarcity in real-world supervised learning scenarios by leveraging multiple source tasks. In this field, existing works typically use all available samples from sources in training, which constrains their training efficiency and may lead to suboptimal results. To address this, we propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure based on K-L divergence, and minimize it based on high-dimensional statistical analysis to determine the optimal transfer quantity for each source task. Additionally, we develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for target model training in multi-source transfer learning. Experimental studies on diverse architectures and two real-world benchmark datasets show that our proposed algorithm significantly outperforms state-of-the-art approaches in both accuracy and data efficiency. The code and supplementary materials are available in https://github.com/zqy0126/OTQMS.
comment: NeurIPS 2025 Poster
♻ ☆ PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models
Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and model expressiveness. While existing ultra-low-bit PTQ methods rely on binary approximations or complex compensation mechanisms, they suffer from either limited representational capacity or computational overhead that undermines their efficiency gains. We introduce PTQ to Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit representation. PTQTP achieves multiplication-free inference, identical to 1-bit quantization, while maintaining superior expressiveness through its novel structured decomposition. Our approach provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment across diverse modern LLMs without architectural modifications; and (3) uniform ternary operations that eliminate the need for mixed-precision or compensation schemes. Comprehensive experiments across LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP significantly outperforms existing low-bit PTQ methods, achieving 82.4% mathematical reasoning retention versus 0% for competing approaches. PTQTP approaches and sometimes surpasses 1.58-bit quantization-aware training performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods. These results establish PTQTP as a practical solution for efficient LLM deployment in resource-constrained environments. The code will be available at https://github.com/HeXiao-55/PTQTP.
comment: under review
Human-Computer Interaction 32
☆ ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.
☆ Eye-Tracking, Mouse Tracking, Stimulus Tracking,and Decision-Making Datasets in Digital Pathology
Interpretation of giga-pixel whole-slide images (WSIs) is an important but difficult task for pathologists. Their diagnostic accuracy is estimated to average around 70%. Adding a second pathologist does not substantially improve decision consistency. The field lacks adequate behavioral data to explain diagnostic errors and inconsistencies. To fill in this gap, we present PathoGaze1.0, a comprehensive behavioral dataset capturing the dynamic visual search and decision-making processes of the full diagnostic workflow during cancer diagnosis. The dataset comprises 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data (EMSVD) collected from 19 pathologists interpreting 397 WSIs. The data collection process emphasizes ecological validity through an application-grounded testbed, called PTAH. In total, we recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events. In addition, such data could also be used to improve the training of both pathologists and AI systems that might support human experts. All experiments were preregistered at https://osf.io/hj9a7, and the complete dataset along with analysis code is available at https://go.osu.edu/pathogaze.
comment: 16 pages, 9 figures, submitted to Nature Scientific Data
☆ What Does It Take? Developing a Smartphone App that Motivates Older Adults to be Physically Active
Maintaining physical activity is essential for older adults' health and well-being, yet participation remains low. Traditional paper-based and in-person interventions have been effective but face scalability issues. Smartphone apps offer a potential solution, but their effectiveness in real-world use remains underexplored. Most prior studies take place in controlled environments, use specialized hardware, or rely on in-person training sessions or researcher-led setup. This study examines the feasibility and engagement of Senior Fit, a standalone mobile fitness app designed for older adults. We conducted continuous testing with 25 participants aged 65-85, refining the app based on their feedback to improve usability and accessibility. Our findings underscore both the potential and key challenges in designing digital health interventions. Older adults valued features such as video demonstrations and reminders that made activity feel accessible and motivating, yet some expressed frustration with manual logging and limited personalization. The Facebook group provided encouragement for some but excluded others unfamiliar with the platform. These results highlight the need for fitness apps that integrate flexible tracking, clear feedback, and low-barrier social support. We contribute design recommendations for creating inclusive mobile fitness tools that align with older adults' routines and capabilities, offering insights for future long-term, real-world deployments.
comment: 28 pages
☆ Detecting the Use of Generative AI in Crowdsourced Surveys: Implications for Data Integrity SC
The widespread adoption of generative AI (GenAI) has introduced new challenges in crowdsourced data collection, particularly in survey-based research. While GenAI offers powerful capabilities, its unintended use in crowdsourcing, such as generating automated survey responses, threatens the integrity of empirical research and complicates efforts to understand public opinion and behavior. In this study, we investigate and evaluate two approaches for detecting AI-generated responses in online surveys: LLM-based detection and signature-based detection. We conducted experiments across seven survey studies, comparing responses collected before 2022 with those collected after the release of ChatGPT. Our findings reveal a significant increase in AI-generated responses in the post-2022 studies, highlighting how GenAI may silently distort crowdsourced data. This work raises broader concerns about evolving landscape of data integrity, where GenAI can compromise data quality, mislead researchers, and influence downstream findings in fields such as health, politics, and social behavior. By surfacing detection strategies and empirical evidence of GenAI's impact, we aim to contribute to ongoing conversation about safeguarding research integrity and supporting scholars navigating these methodological and ethical challenges.
comment: Accepted by CSCW 2025 workshop Beyond Information: Online Participatory Culture and Information Disorder
☆ OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.
comment: work in progress
☆ AI for a Planet Under Pressure
Artificial intelligence (AI) is already driving scientific breakthroughs in a variety of research fields, ranging from the life sciences to mathematics. This raises a critical question: can AI be applied both responsibly and effectively to address complex and interconnected sustainability challenges? This report is the result of a collaboration between the Stockholm resilience Centre (Stockholm University), the Potsdam Institute for Climate Impact Research (PIK), and Google DeepMind. Our work explores the potential and limitations of using AI as a research method to help tackle eight broad sustainability challenges. The results build on iterated expert dialogues and assessments, a systematic AI-supported literature overview including over 8,500 academic publications, and expert deep-dives into eight specific issue areas. The report also includes recommendations to sustainability scientists, research funders, the private sector, and philanthropies.
comment: 88 pages, 8 figures, 1 table
☆ Developer Productivity with GenAI
Generative AI (GenAI) tools are increasingly being adopted in software development as productivity aids. However, evidence regarding where and when these tools actually enhance productivity is unclear. In this paper, we investigate how GenAI adoption affects different dimensions of developer productivity. We surveyed 415 software practitioners to capture their perceptions of productivity changes associated with AI-assisted development using the SPACE framework - Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow. Our results, disaggregated by frequency of AI usage, reveal limited overall productivity change, highlighting the productivity paradox in which developers become faster but do not necessarily create better software or feel more fulfilled.
☆ Advancing Interdisciplinary Approaches to Online Safety Research
The growing prevalence of negative experiences in online spaces demands urgent attention from the human-computer interaction (HCI) community. However, research on online safety remains fragmented across different HCI subfields, with limited communication and collaboration between disciplines. This siloed approach risks creating ineffective responses, including design solutions that fail to meet the diverse needs of users, and policy efforts that overlook critical usability concerns. This workshop aims to foster interdisciplinary dialogue on online safety by bringing together researchers from within and beyond HCI - including but not limited to Social Computing, Digital Design, Internet Policy, Cybersecurity, Ethics, and Social Sciences. By uniting researchers, policymakers, industry practitioners, and community advocates we aim to identify shared challenges in online safety research, highlight gaps in current knowledge, and establish common research priorities. The workshop will support the development of interdisciplinary research plans and establish collaborative environments - both within and beyond Australia - to action them.
☆ Building AI Literacy at Home: How Families Navigate Children's Self-Directed Learning with AI
As generative AI becomes embedded in children's learning spaces, families face new challenges in guiding its use. Middle childhood (ages 7-13) is a critical stage where children seek autonomy even as parental influence remains strong. Using self-directed learning (SDL) as a lens, we examine how parents perceive and support children's developing AI literacy through focus groups with 13 parent-child pairs. Parents described evolving phases of engagement driven by screen time, self-motivation, and growing knowledge. While many framed AI primarily as a study tool, few considered its non-educational roles or risks, such as privacy and infrastructural embedding. Parents also noted gaps in their own AI understanding, often turning to joint exploration and engagement as a form of co-learning. Our findings reveal how families co-construct children's AI literacy, exposing tensions between practical expectations and critical literacies, and provide design implications that foster SDL while balancing autonomy and oversight.
☆ VR-Assisted Guide Dog Training: A 360° PanoHaptic System for Right-Hand Commands Analysis
This paper presents a VR-based guide dog training system designed to assist novice trainers in understanding guide dog behavior and issuing appropriate training commands. Guide dogs play a vital role in supporting independent mobility for visually impaired individuals, yet the limited number of skilled trainers restricts their availability. Training is highly demanding, requiring accurate observation of the dog's status and precise command issuance, especially through right-hand gestures. While the trainer's left hand holds the harness to perceive haptic cues, the right hand is used to indicate directions, maintain attention, and provide comfort, with motion patterns varying by scenario and the dog's progress. Currently, novices learn mainly by observing experts or watching videos, which lacks immersion and makes it difficult to adopt the trainer's perspective for understanding behavior or synchronizing command timing. To address these limitations, the proposed system introduces a VR-based assistive platform integrating panoramic visuals and haptic feedback to create an immersive training environment. The visual module provides contextual guidance, including cues for command execution and real-time comparison of the user's posture with standard actions, while the haptic module delivers tactile feedback for command gestures. Users can re-experience training sessions across diverse scenarios and dog proficiency levels, allowing independent and repeated practice. By improving the timing, accuracy, and expressiveness of right-hand commands, the system aims to accelerate skill acquisition, enhance training quality, and mitigate the shortage of qualified trainers, ultimately increasing the availability of guide dogs for visually impaired individuals.
comment: 9 pages, 9 figures
☆ Understanding Reader Perception Shifts upon Disclosure of AI Authorship
As AI writing support becomes ubiquitous, how disclosing its use affects reader perception remains a critical, underexplored question. We conducted a study with 261 participants to examine how revealing varying levels of AI involvement shifts author impressions across six distinct communicative acts. Our analysis of 990 responses shows that disclosure generally erodes perceptions of trustworthiness, caring, competence, and likability, with the sharpest declines in social and interpersonal writing. A thematic analysis of participants' feedback links these negative shifts to a perceived loss of human sincerity, diminished author effort, and the contextual inappropriateness of AI. Conversely, we find that higher AI literacy mitigates these negative perceptions, leading to greater tolerance or even appreciation for AI use. Our results highlight the nuanced social dynamics of AI-mediated authorship and inform design implications for creating transparent, context-sensitive writing systems that better preserve trust and authenticity.
☆ Modeling Object Attention in Mobile AR for Intrinsic Cognitive Security
We study attention in mobile Augmented Reality (AR) using object recall as a proxy outcome. We observe that the ability to recall an object (physical or virtual) that was encountered in a mobile AR experience depends on many possible impact factors and attributes, with some objects being readily recalled while others are not, and some people recalling objects overall much better or worse than others. This opens up a potential cognitive attack in which adversaries might create conditions that make an AR user not recall certain potentially mission-critical objects. We explore whether a calibrated predictor of object recall can help shield against such cognitive attacks. We pool data from four mobile AR studies (with a total of 1,152 object recall probes) and fit a Partial Least Squares Structural Equation Model (PLS-SEM) with formative Object, Scene, and User State composites predicting recall, also benchmarking against Random Forest and multilayer perceptron classifiers. PLS-SEM attains the best F1 score in three of four studies. Additionally, path estimates identify lighting, augmentation density, AR registration stability, cognitive load, and AR familiarity as primary drivers. The model outputs per-object recall probabilities that can drive interface adjustments when predicted recall falls. Overall, PLS-SEM provides competitive accuracy with interpretable levers for design and evaluation in mobile AR.
comment: Conference Paper, 5 pages. Published at the 2025 ACM the International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing (MobiHoc)
☆ Toward Socially-Aware LLMs: A Survey of Multimodal Approaches to Human Behavior Understanding
LLM-powered multimodal systems are increasingly used to interpret human social behavior, yet how researchers apply the models' 'social competence' remains poorly understood. This paper presents a systematic literature review of 176 publications across different application domains (e.g., healthcare, education, and entertainment). Using a four-dimensional coding framework (application, technical, evaluative, and ethical), we find (1) frequent use of pattern recognition and information extraction from multimodal sources, but limited support for adaptive, interactive reasoning; (2) a dominant 'modality-to-text' pipeline that privileges language over rich audiovisual cues, striping away nuanced social cues; (3) evaluation practices reliant on static benchmarks, with socially grounded, human-centered assessments rare; and (4) Ethical discussions focused mainly on legal and rights-related risks (e.g., privacy), leaving societal risks (e.g., deception) overlooked--or at best acknowledged but left unaddressed. We outline a research agenda for evaluating socially competent, ethically informed, and interaction-aware multi-modal systems.
☆ Towards Human-AI Synergy in Requirements Engineering: A Framework and Preliminary Study
The future of Requirements Engineering (RE) is increasingly driven by artificial intelligence (AI), reshaping how we elicit, analyze, and validate requirements. Traditional RE is based on labor-intensive manual processes prone to errors and complexity. AI-powered approaches, specifically large language models (LLMs), natural language processing (NLP), and generative AI, offer transformative solutions and reduce inefficiencies. However, the use of AI in RE also brings challenges like algorithmic bias, lack of explainability, and ethical concerns related to automation. To address these issues, this study introduces the Human-AI RE Synergy Model (HARE-SM), a conceptual framework that integrates AI-driven analysis with human oversight to improve requirements elicitation, analysis, and validation. The model emphasizes ethical AI use through transparency, explainability, and bias mitigation. We outline a multi-phase research methodology focused on preparing RE datasets, fine-tuning AI models, and designing collaborative human-AI workflows. This preliminary study presents the conceptual framework and early-stage prototype implementation, establishing a research agenda and practical design direction for applying intelligent data science techniques to semi-structured and unstructured RE data in collaborative environments.
comment: Accepted at the 2025 Sixth International Conference on Intelligent Data Science Technologies and Applications (IDSTA 2025),8 pages, 4 figures. Published in IEEE
☆ OrchVis: Hierarchical Multi-Agent Orchestration for Human Oversight
We introduce OrchVis, a multi-agent orchestration framework that visualizes, verifies, and coordinates goal-driven collaboration among LLM-based agents. Through hierarchical goal alignment, task assignment, and conflict resolution, OrchVis enables humans to supervise complex multi-agent workflows without micromanaging each step. The system parses user intent into structured goals, monitors execution via automated verification, and exposes inter-agent dependencies through an interactive planning panel. When conflicts arise, users can explore system-proposed alternatives and selectively replan. OrchVis advances human-centered design for multi-agent systems by combining transparent visualization with adaptive autonomy.
☆ Efficiency Without Cognitive Change: Evidence from Human Interaction with Narrow AI Systems
The growing integration of artificial intelligence (AI) into human cognition raises a fundamental question: does AI merely improve efficiency, or does it alter how we think? This study experimentally tested whether short-term exposure to narrow AI tools enhances core cognitive abilities or simply optimizes task performance. Thirty young adults completed standardized neuropsychological assessments embedded in a seven-week protocol with a four-week online intervention involving problem-solving and verbal comprehension tasks, either with or without AI support (ChatGPT). While AI-assisted participants completed several tasks faster and more accurately, no significant pre-post differences emerged in standardized measures of problem solving or verbal comprehension. These results demonstrate efficiency gains without cognitive change, suggesting that current narrow AI systems serve as cognitive scaffolds extending performance without transforming underlying mental capacities. The findings highlight the need for ethical and educational frameworks that promote critical and autonomous thinking in an increasingly AI-augmented cognitive ecology.
comment: 30 pages, 8 figures. Preprint submitted for peer review (not yet accepted or published)
☆ The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems
Artificial intelligence systems based on large language models (LLMs) can now generate coherent text, music, and images, yet they operate without a persistent state: each inference reconstructs context from scratch. This paper introduces the Narrative Continuity Test (NCT) -- a conceptual framework for evaluating identity persistence and diachronic coherence in AI systems. Unlike capability benchmarks that assess task performance, the NCT examines whether an LLM remains the same interlocutor across time and interaction gaps. The framework defines five necessary axes -- Situated Memory, Goal Persistence, Autonomous Self-Correction, Stylistic & Semantic Stability, and Persona/Role Continuity -- and explains why current architectures systematically fail to support them. Case analyses (Character.AI, Grok, Replit, Air Canada) show predictable continuity failures under stateless inference. The NCT reframes AI evaluation from performance to persistence, outlining conceptual requirements for future benchmarks and architectural designs that could sustain long-term identity and goal coherence in generative models.
comment: 35 pages, 127 references
♻ ☆ Datasheets for Machine Learning Sensors
Machine learning (ML) is becoming prevalent in embedded AI sensing systems. These "ML sensors" enable context-sensitive, real-time data collection and decision-making across diverse applications ranging from anomaly detection in industrial settings to wildlife tracking for conservation efforts. As such, there is a need to provide transparency in the operation of such ML-enabled sensing systems through comprehensive documentation. This is needed to enable their reproducibility, to address new compliance and auditing regimes mandated in regulation and industry-specific policy, and to verify and validate the responsible nature of their operation. To address this gap, we introduce the datasheet for ML sensors framework. We provide a comprehensive template, collaboratively developed in academia-industry partnerships, that captures the distinct attributes of ML sensors, including hardware specifications, ML model and dataset characteristics, end-to-end performance metrics, and environmental impacts. Our framework addresses the continuous streaming nature of sensor data, real-time processing requirements, and embeds benchmarking methodologies that reflect real-world deployment conditions, ensuring practical viability. Aligned with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), our approach enhances the transparency and reusability of ML sensor documentation across academic, industrial, and regulatory domains. To show the application of our approach, we present two datasheets: the first for an open-source ML sensor designed in-house and the second for a commercial ML sensor developed by industry collaborators, both performing computer vision-based person detection.
♻ ☆ Complementary Human-AI Clinical Reasoning in Ophthalmology
Vision impairment and blindness are a major global health challenge where gaps in the ophthalmology workforce limit access to specialist care. We evaluate AMIE, a medically fine-tuned conversational system based on Gemini with integrated web search and self-critique reasoning, using real-world clinical vignettes that reflect scenarios a general ophthalmologist would be expected to manage. We conducted two complementary evaluations: (1) a human-AI interactive diagnostic reasoning study in which ophthalmologists recorded initial differentials and plans, then reviewed AMIE's structured output and revised their answers; and (2) a masked preference and quality study comparing AMIE's narrative outputs with case author reference answers using a predefined rubric. AMIE showed standalone diagnostic performance comparable to clinicians at baseline. Crucially, after reviewing AMIE's responses, ophthalmologists tended to rank the correct diagnosis higher, reached greater agreement with one another, and enriched their investigation and management plans. Improvements were observed even when AMIE's top choice differed from or underperformed the clinician baseline, consistent with a complementary effect in which structured reasoning support helps clinicians re-rank rather than simply accept the model output. Preferences varied by clinical grade, suggesting opportunities to personalise responses by experience. Without ophthalmology-specific fine-tuning, AMIE matched clinician baseline and augmented clinical reasoning at the point of need, motivating multi-axis evaluation, domain adaptation, and prospective multimodal studies in real-world settings.
♻ ☆ FieldGen: From Teleoperated Pre-Manipulation Trajectories to Field-Guided Data Generation
Large-scale and diverse datasets are vital for training robust robotic manipulation policies, yet existing data collection methods struggle to balance scale, diversity, and quality. Simulation offers scalability but suffers from sim-to-real gaps, while teleoperation yields high-quality demonstrations with limited diversity and high labor cost. We introduce FieldGen, a field-guided data generation framework that enables scalable, diverse, and high-quality real-world data collection with minimal human supervision. FieldGen decomposes manipulation into two stages: a pre-manipulation phase, allowing trajectory diversity, and a fine manipulation phase requiring expert precision. Human demonstrations capture key contact and pose information, after which an attraction field automatically generates diverse trajectories converging to successful configurations. This decoupled design combines scalable trajectory diversity with precise supervision. Moreover, FieldGen-Reward augments generated data with reward annotations to further enhance policy learning. Experiments demonstrate that policies trained with FieldGen achieve higher success rates and improved stability compared to teleoperation-based baselines, while significantly reducing human effort in long-term real-world data collection. Webpage is available at https://fieldgen.github.io/.
comment: Webpage: https://fieldgen.github.io/
♻ ☆ A Scoping Review of Functional Near-Infrared Spectroscopy (fNIRS) Applications in Game-integrated Learning Systems
Functional Near-Infrared Spectroscopy (fNIRS) has emerged as a valuable tool to investigate cognitive and emotional processes during learning. We focus specifically on game-integrated learning systems as the context for fNIRS-based brain data analysis. We selected game-integrated learning systems because such systems make learning more engaging, interactive, and immersive, all of which are critical features for adaptive learning design. The goal of this scoping review is to help researchers understand how fNIRS has been used so far to study brain activity in game-integrated learning systems. We also aim to show how brain data captured through fNIRS can support the development of adaptive learning systems by monitoring learners' cognitive states. Using the PRISMA-ScR framework, 1300 papers were screened, and 21 empirical studies were selected for in-depth analysis. Studies were categorized as affective/cognitive response studies or comparative studies, and further analyzed by learning platform, game device, fNIRS configuration, outcome measures, and study design. The findings reveal that game-integrated learning systems can be as effective as traditional methods in improving engagement and involvement. The findings also show that fNIRS offers valuable insights into cognitive states, but it has not yet been widely implemented in real-time adaptive systems. We identify key challenges in standardization and data interpretation and highlight the potential of fNIRS for developing brain-aware, interactive learning environments. This review offers insights to guide future research on using brain data to support adaptive learning and intelligent system design.
comment: 28 pages, 3 figures
♻ ☆ Hybrid Deep Learning Model to Estimate Cognitive Effort from fNIRS Signals
This study estimates cognitive effort based on functional near-infrared spectroscopy data and performance scores using a hybrid DeepNet model. The estimation of cognitive effort enables educators to modify material to enhance learning effectiveness and student engagement. In this study, we collected oxygenated hemoglobin using functional near-infrared spectroscopy during an educational quiz game. Participants (n=16) responded to 16 questions in a Unity-based educational game, each within a 30-second response time limit. We used DeepNet models to predict the performance score from the oxygenated hemoglobin, and compared traditional machine learning and DeepNet models to determine which approach provides better accuracy in predicting performance scores. The result shows that the proposed CNN-GRU gives better performance with 73% than other models. After the prediction, we used the predicted score and the oxygenated hemoglobin to observe cognitive effort by calculating relative neural efficiency and involvement in our test cases. Our result shows that even with moderate accuracy, the predicted cognitive effort closely follow the actual trends. This findings can be helpful in designing and improving learning environments and provide valuable insights into learning materials.
♻ ☆ Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles
Background: Trustworthy AI serves as a foundational pillar for two major AI ethics conferences: AIES and FAccT. However, current research often adopts techno-centric approaches, focusing primarily on technical attributes such as reliability, robustness, and fairness, while overlooking the sociotechnical dimensions critical to understanding AI trustworthiness in real-world contexts. Objectives: This scoping review aims to examine how the AIES and FAccT communities conceptualize, measure, and validate AI trustworthiness, identifying major gaps and opportunities for advancing a holistic understanding of trustworthy AI systems. Methods: We conduct a scoping review of AIES and FAccT conference proceedings to date, systematically analyzing how trustworthiness is defined, operationalized, and applied across different research domains. Our analysis focuses on conceptualization approaches, measurement methods, verification and validation techniques, application areas, and underlying values. Results: While significant progress has been made in defining technical attributes such as transparency, accountability, and robustness, our findings reveal critical gaps. Current research often predominantly emphasizes technical precision at the expense of social and ethical considerations. The sociotechnical nature of AI systems remains less explored and trustworthiness emerges as a contested concept shaped by those with the power to define it. Conclusions: An interdisciplinary approach combining technical rigor with social, cultural, and institutional considerations is essential for advancing trustworthy AI. We propose actionable measures for the AI ethics community to adopt holistic frameworks that genuinely address the complex interplay between AI systems and society, ultimately promoting responsible technological development that benefits all stakeholders.
comment: Submitted to Journal of Artificial Intelligence Research (JAIR)
♻ ☆ OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions
In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker's multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners' facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.
comment: 25 pages, 9 figures
♻ ☆ Privacy Perspectives and Practices of Chinese Smart Home Product Teams
Previous research has explored the privacy needs and concerns of device owners, primary users, and different bystander groups with regard to smart home devices like security cameras, smart speakers, and hubs, but little is known about the privacy views and practices of smart home product teams, particularly those in non-Western contexts. This paper presents findings from 27 semi-structured interviews with Chinese smart home product team members, including product/project managers, software/hardware engineers, user experience (UX) designers, legal/privacy experts, and marketers/operation specialists. We examine their privacy perspectives, practices, and risk mitigation strategies. Our results show that participants emphasized compliance with Chinese data privacy laws, which typically prioritized national security over individual privacy rights. China-specific cultural, social, and legal factors also influenced participants' ethical considerations and attitudes toward balancing user privacy and security with convenience. Drawing on our findings, we propose a set of recommendations for smart home product teams, along with socio-technical and legal interventions to address smart home privacy issues-especially those belonging to at-risk groups-in Chinese multi-user smart homes.
comment: Paper accepted for presentation at IEEE Symposium on Security and Privacy (S&P) 2026
♻ ☆ The Software Diversity Card: A Framework for Reporting Diversity in Software Projects
Context: Interest in diversity in software development has significantly increased in recent years. Reporting on diversity in software projects can enhance user trust and assist regulators in evaluating adoption. Recent AI directives include clauses that mandate diversity information during development, highlighting the growing interest of public regulators. However, current documentation often neglects diversity in favor of technical features, partly due to a lack of tools for its description and annotation. Objectives: This work introduces the Software Diversity Card, a structured approach for documenting and sharing diversity-related aspects within software projects. It aims to profile the various teams involved in software development and governance, including user groups in testing and software adaptations for diverse social groups. Methods: We conducted a literature review on diversity and inclusion in software development and analyzed 1,000 top-starred Open Source Software (OSS) repositories on GitHub to identify diversity-related information. Moreover, we present a diversity modeling language, a toolkit for generating cards using it, and a study of its application in two real-world software projects. Results: Despite the growing awareness of diversity in the research community, our analysis found a notable lack of diversity reporting in OSS projects. Applying the card to real-world examples highlighted challenges such as balancing anonymity and transparency, managing sensitive data, and ensuring authenticity. Conclusion: Our proposal can enhance diversity practices in software development, support public administrations in software assessment, and help businesses promote diversity as a key asset.
comment: To appear
♻ ☆ Unlearning Comparator: A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods
Machine Unlearning (MU) aims to remove target training data from a trained model so that the removed data no longer influences the model's behavior, fulfilling "right to be forgotten" obligations under data privacy laws. Yet, we observe that researchers in this rapidly emerging field face challenges in analyzing and understanding the behavior of different MU methods, especially in terms of three fundamental principles in MU: accuracy, efficiency, and privacy. Consequently, they often rely on aggregate metrics and ad-hoc evaluations, making it difficult to accurately assess the trade-offs between methods. To fill this gap, we introduce a visual analytics system, Unlearning Comparator, designed to facilitate the systematic evaluation of MU methods. Our system supports two important tasks in the evaluation process: model comparison and attack simulation. First, it allows the user to compare the behaviors of two models, such as a model generated by a certain method and a retrained baseline, at class-, instance-, and layer-levels to better understand the changes made after unlearning. Second, our system simulates membership inference attacks (MIAs) to evaluate the privacy of a method, where an attacker attempts to determine whether specific data samples were part of the original training set. We evaluate our system through a case study visually analyzing prominent MU methods and demonstrate that it helps the user not only understand model behaviors but also gain insights that can inform the improvement of MU methods. The source code is publicly available at https://github.com/gnueaj/Machine-Unlearning-Comparator.
comment: Submitted to IEEE Transactions on Visualization and Computer Graphics (TVCG), under review. 15 pages. This work has been submitted to the IEEE for possible publication
♻ ☆ Scalable Exploration via Ensemble++ NeurIPS 2025
Thompson Sampling is a principled method for balancing exploration and exploitation, but its real-world adoption faces computational challenges in large-scale or non-conjugate settings. While ensemble-based approaches offer partial remedies, they typically require prohibitively large ensemble sizes. We propose Ensemble++, a scalable exploration framework using a novel shared-factor ensemble architecture with random linear combinations. For linear bandits, we provide theoretical guarantees showing that Ensemble++ achieves regret comparable to exact Thompson Sampling with only $\Theta(d \log T)$ ensemble sizes--significantly outperforming prior methods. Crucially, this efficiency holds across both compact and finite action sets with either time-invariant or time-varying contexts without configuration changes. We extend this theoretical foundation to nonlinear rewards by replacing fixed features with learnable neural representations while preserving the same incremental update principle, effectively bridging theory and practice for real-world tasks. Comprehensive experiments across linear, quadratic, neural, and GPT-based contextual bandits validate our theoretical findings and demonstrate Ensemble++'s superior regret-computation tradeoff versus state-of-the-art methods.
comment: NeurIPS 2025
♻ ☆ MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence's semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.
♻ ☆ NeuroPilot: A Realtime Brain-Computer Interface system to enhance concentration of students in online learning
The prevalence of online learning poses a vital challenge in real-time monitoring of students' concentration. Traditional methods such as questionnaire assessments require manual intervention, and webcam-based monitoring fails to provide accurate insights about learners' mental focus as it is deceived by mere screen fixation without cognitive engagement. Existing BCI-based approaches lack real-time validation and evaluation procedures. To address these limitations, a Brain-Computer Interface (BCI) system is developed using a non-invasive Electroencephalogram (EEG) headband, FocusCalm, to record brainwave activity under attentive and non-attentive states. 20 minutes of data were collected from each of 20 participants watching a pre-recorded educational video. The data validation employed a novel intra-video questionnaire assessment. Subsequently, collected signals were segmented (sliding window), filtered (Butterworth bandpass), and cleaned (removal of high-amplitude and EOG artifacts such as eye blinks). Time, frequency, wavelet, and statistical features were extracted, followed by recursive feature elimination (RFE) with support vector machines (SVMs) to classify attention and non-attention states. The leave-one-subject-out (LOSO) cross-validation accuracy was found to be 88.77%. The system provides feedback alerts upon detection of a non-attention state and maintains focus profile logs. A pilot study was conducted to evaluate the effectiveness of real-time feedback. Five participants underwent a 10-minute session comprising a 5-minute baseline phase devoid of feedback, succeeded by a 5-minute feedback phase, during which alerts were activated if participants exhibited inattention for approximately 8 consecutive seconds. A paired t-test (t = 5.73, p = 0.007) indicated a statistically significant improvement in concentration during the feedback phase.
comment: V2: Author list updated with consent from all authors. Minor grammatical errors fixed. This work has been submitted to the IEEE for possible publication
♻ ☆ Security Implications of User Non-compliance Behavior to Software Updates: A Risk Assessment Study
Software updates are essential to enhance security, fix bugs, and add better features to the existing software. While some users accept software updates, non-compliance remains a widespread issue. While some users accept software updates, non-compliance remains a widespread issue. End users' systems remain vulnerable to security threats when security updates are not installed or are installed with a delay. Despite research efforts, users' noncompliance behavior with software updates is still prevalent. In this study, we explored how psychological factors influence users' perception and behavior toward software updates. In addition, we investigated how information about potential vulnerabilities and risk scores influences their behavior. Next, we proposed a model that utilizes attributes from the National Vulnerability Database (NVD) to effectively assess the overall risk score associated with delaying software updates. Next, we conducted a user study with Windows OS users, showing that providing a risk score for not updating their systems and information about vulnerabilities significantly increased users' willingness to update their systems. Additionally, we examined the influence of demographic factors, gender, on users' decision-making regarding software updates. Our results show no statistically significant difference in male and female users' responses in terms of concerns about securing their systems. The implications of this study are relevant for software developers and manufacturers as they can use this information to design more effective software update notification messages. The communication of the potential risks and their corresponding risk scores may motivate users to take action and update their systems in a timely manner, which can ultimately improve the overall security of the system.
comment: 15 page, 10 tables, 6 figures
♻ ☆ Reciprocity Deficits: Observing AI in the street with everyday publics
The street has emerged as a primary site where everyday publics are confronted with AI as an infrastructural phenomenon, as machine learning-based systems are now commonly deployed in this setting in the form of automated cars, facial recognition, smart billboards and the like. While these deployments of AI in the street have attracted significant media attention and public controversy in recent years, the presence of AI in the street often remains inscrutable, and many everyday publics are unaware of it. In this paper, we explore the challenges and possibilities of everyday public engagement with AI in the situated environment of city streets under these paradoxical conditions. Combining perspectives and approaches from social and cultural studies of AI, Design Research and Science and Technology Studies (STS), we explore the affordances of the street as a site for 'material participation' in AI through design-based interventions: the creation of 'everyday AI observatories.' We narrate and reflect on our participatory observations of AI in five city streets in the UK and Australia and highlight a set of tensions that emerged from them: 1) the framing of the street as a transactional environment, 2) the designed invisibility of AI and its publics in the street 3) the stratification of street environments through statistical governance. Based on this discussion and drawing on Jane Jacobs' notion of "eyes on the street," we put forward the relational notion of "reciprocity deficits" between AI infrastructures and everyday publics in the street. The conclusion reflects on the consequences of this form of social invisibility of AI for situated engagement with AI by everyday publics in the street and for public trust in urban governance.
comment: 13 pages, 4 Figures
Software Engineering 29
☆ ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.
☆ The Divine Software Engineering Comedy -- Inferno: The Okinawa Files
In June 2024 I co-organized the FUture of Software Engineering symposium in Okinawa, Japan. Me, Andrian Marcus, Takashi Kobayashi and Shinpei Hayashi were general chairs, Nicole Novielli, Kevin Moran, Yutaro Kashiwa and Masanari Kondo were program chairs, some members of my group, Carmen Armenti, Stefano Campanella, Roberto Minelli, were the tables, can't have a room with only chairs, after all. We invited a crowd of people to discuss what future software engineering has. FUSE became a 3-day marathon on whether there is actually a future at all for SE. This essay is a slightly dark take about what I saw at that event, very loosely based on the discussions that took place, adding some healthy sarcasm and cynicism, the intellectual salt and pepper I never seem to run out of. I listened to the brilliant people who gathered to talk about where we're headed, and distilled three nightmares headed in our direction: software makers who don't know what they're doing, but get the job done anyway, a field moving so fast it can't remember its own lessons, and technologies multiplying like rabbits in Spring. So, let's start. The future, eh? The future of software engineering looks like a car crash in slow motion: you can see it coming but you can't look away. The thing is...
☆ Affordance Representation and Recognition for Autonomous Agents
The autonomy of software agents is fundamentally dependent on their ability to construct an actionable internal world model from the structured data that defines their digital environment, such as the Document Object Model (DOM) of web pages and the semantic descriptions of web services. However, constructing this world model from raw structured data presents two critical challenges: the verbosity of raw HTML makes it computationally intractable for direct use by foundation models, while the static nature of hardcoded API integrations prevents agents from adapting to evolving services. This paper introduces a pattern language for world modeling from structured data, presenting two complementary architectural patterns. The DOM Transduction Pattern addresses the challenge of web page complexity by distilling} a verbose, raw DOM into a compact, task-relevant representation or world model optimized for an agent's reasoning core. Concurrently, the Hypermedia Affordances Recognition Pattern enables the agent to dynamically enrich its world model by parsing standardized semantic descriptions to discover and integrate the capabilities of unknown web services at runtime. Together, these patterns provide a robust framework for engineering agents that can efficiently construct and maintain an accurate world model, enabling scalable, adaptive, and interoperable automation across the web and its extended resources.
☆ CodeWiki: Automated Repository-Level Documentation at Scale
Developers spend nearly 58% of their time understanding codebases, yet maintaining comprehensive documentation remains challenging due to complexity and manual effort. While recent Large Language Models (LLMs) show promise for function-level documentation, they fail at the repository level, where capturing architectural patterns and cross-module interactions is essential. We introduce CodeWiki, the first open-source framework for holistic repository-level documentation across seven programming languages. CodeWiki employs three innovations: (i) hierarchical decomposition that preserves architectural context, (ii) recursive agentic processing with dynamic delegation, and (iii) synthesis of textual and visual artifacts including architecture diagrams and data flows. We also present CodeWikiBench, the first repository-level documentation benchmark with multi-level rubrics and agentic assessment. CodeWiki achieves 68.79% quality score with proprietary models and 64.80% with open-source alternatives, outperforming existing closed-source systems and demonstrating scalable, accurate documentation for real-world repositories.
☆ LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead
The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE studies, analyze their limitations, identify key research gaps, and outline a detailed roadmap. We envision these frameworks as reliable, robust, and scalable human surrogates capable of consistent, multi-faceted artifact evaluation by 2030. Our work aims to foster research and adoption of LLM-as-a-Judge frameworks, ultimately improving the scalability of software artifact evaluation.
☆ Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs) and widely adopted tools. However, existing benchmarks for code agent evaluation face two major limitations: high annotation cost and expertise requirements, and rigid evaluation metrics that rely primarily on unit tests. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse and challenging project-level tasks. Based on this approach, we introduce PRDBench, a novel benchmark comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Document (PRD) requirements, comprehensive evaluation criteria, and reference implementations. PRDBench features rich data sources, high task complexity, and flexible metrics. We further employ an Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of various test types beyond unit tests. Extensive experiments on PRDBench demonstrate its effectiveness in assessing the capabilities of both code agents and evaluation agents, providing a scalable and robust framework for annotation and evaluation.
☆ Developer Productivity with GenAI
Generative AI (GenAI) tools are increasingly being adopted in software development as productivity aids. However, evidence regarding where and when these tools actually enhance productivity is unclear. In this paper, we investigate how GenAI adoption affects different dimensions of developer productivity. We surveyed 415 software practitioners to capture their perceptions of productivity changes associated with AI-assisted development using the SPACE framework - Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow. Our results, disaggregated by frequency of AI usage, reveal limited overall productivity change, highlighting the productivity paradox in which developers become faster but do not necessarily create better software or feel more fulfilled.
☆ MAGNET: A Multi-Graph Attentional Network for Code Clone Detection
Code clone detection is a fundamental task in software engineering that underpins refactoring, debugging, plagiarism detection, and vulnerability analysis. Existing methods often rely on singular representations such as abstract syntax trees (ASTs), control flow graphs (CFGs), and data flow graphs (DFGs), which capture only partial aspects of code semantics. Hybrid approaches have emerged, but their fusion strategies are typically handcrafted and ineffective. In this study, we propose MAGNET, a multi-graph attentional framework that jointly leverages AST, CFG, and DFG representations to capture syntactic and semantic features of source code. MAGNET integrates residual graph neural networks with node-level self-attention to learn both local and long-range dependencies, introduces a gated cross-attention mechanism for fine-grained inter-graph interactions, and employs Set2Set pooling to fuse multi-graph embeddings into unified program-level representations. Extensive experiments on BigCloneBench and Google Code Jam demonstrate that MAGNET achieves state-of-the-art performance with an overall F1 score of 96.5\% and 99.2\% on the two datasets, respectively. Ablation studies confirm the critical contributions of multi-graph fusion and each attentional component. Our code is available at https://github.com/ZixianReid/Multigraph_match
☆ Investigating Software Aging in LLM-Generated Software Systems
Automatically generated software, especially code produced by Large Language Models (LLMs), is increasingly adopted to accelerate development and reduce manual effort. However, little is known about the long-term reliability of such systems under sustained execution. In this paper, we experimentally investigate the phenomenon of software aging in applications generated by LLM-based tools. Using the Bolt platform and standardized prompts from Baxbench, we generated four service-oriented applications and subjected them to 50-hour load tests. Resource usage, response time, and throughput were continuously monitored to detect degradation patterns. The results reveal significant evidence of software aging, including progressive memory growth, increased response time, and performance instability across all applications. Statistical analyzes confirm these trends and highlight variability in the severity of aging according to the type of application. Our findings show the need to consider aging in automatically generated software and provide a foundation for future studies on mitigation strategies and long-term reliability evaluation.
comment: Presented at the 17th International Workshop on Software Aging and Rejuvenation (WoSAR), 2025
☆ Monitoring and Observability of Machine Learning Systems: Current Practices and Gaps
Production machine learning (ML) systems fail silently -- not with crashes, but through wrong decisions. While observability is recognized as critical for ML operations, there is a lack empirical evidence of what practitioners actually capture. This study presents empirical results on ML observability in practice through seven focus group sessions in several domains. We catalog the information practitioners systematically capture across ML systems and their environment and map how they use it to validate models, detect and diagnose faults, and explain observed degradations. Finally, we identify gaps in current practice and outline implications for tooling design and research to establish ML observability practices.
☆ Lifecycle-Aware code generation: Leveraging Software Engineering Phases in LLMs
Recent progress in large language models (LLMs) has advanced automatic code generation, yet most approaches rely on direct, single-step translation from problem descriptions to code, disregarding structured software engineering practices. We introduce a lifecycle-aware framework that systematically incorporates intermediate artifacts such as requirements analysis, state machine modeling, and pseudocode into both the training and inference stages. This design aligns code generation with standard software development phases and enables more structured reasoning. Experiments show that lifecycle-level fine-tuning improves code correctness by up to 75% over the same model before fine-tuning, with performance gains compounding across intermediate stages. Multi-step inference consistently surpasses single-step generation, demonstrating the effectiveness of intermediate scaffolding. Notably, open-source LLMs, once fine-tuned under our framework, match or slightly outperform models pretrained on code. When applied to DeepSeek-Coder-1.3B, our framework yields relative CodeBLEU improvements of 34.3%, 20.0%, 11.2%, and 22.3% over ChatGPT-3.5, ChatGPT-4o-mini, DeepSeek-R1, and LLaMA-8B, respectively. Our pipeline also proves robust with up to 80\% less training data, confirming its resilience. Ablation studies further reveal that each intermediate artifact contributes distinctly to final code quality, with state machine modeling yielding the most substantial impact. Our source code and detailed experimental data are available at https://anonymous.4open.science/r/Lifecycle-Aware-3CCB.
☆ Validating Alerts in Cloud-Native Observability SP'25
Observability and alerting form the backbone of modern reliability engineering. Alerts help teams catch faults early before they turn into production outages and serve as first clues for troubleshooting. However, designing effective alerts is challenging. They need to strike a fine balance between catching issues early and minimizing false alarms. On top of this, alerts often cover uncommon faults, so the code is rarely executed and therefore rarely checked. To address these challenges, several industry practitioners advocate for testing alerting code with the same rigor as application code. Still, there's a lack of tools that support such systematic design and validation of alerts. This paper introduces a new alerting extension for the observability experimentation tool OXN. It lets engineers experiment with alerts early during development. With OXN, engineers can now tune rules at design time and routinely validate the firing behavior of their alerts, avoiding future problems at runtime.
comment: 16th Symposium on Software Performance (SSP'25)
☆ A Pragmatic Way to Measure Chain-of-Thought Monitorability
While Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety, this opportunity could be lost through shifts in training practices or model architecture. To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility (whether the reasoning can be followed by a human) and coverage (whether the CoT contains all the reasoning needed for a human to also produce the final output). We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs. After sanity-checking our prompted autorater with synthetic CoT degradations, we apply it to several frontier models on challenging benchmarks, finding that they exhibit high monitorability. We present these metrics, including our complete autorater prompt, as a tool for developers to track how design decisions impact monitorability. While the exact prompt we share is still a preliminary version under ongoing development, we are sharing it now in the hopes that others in the community will find it useful. Our method helps measure the default monitorability of CoT - it should be seen as a complement, not a replacement, for the adversarial stress-testing needed to test robustness against deliberately evasive models.
comment: The first two authors contributed equally
☆ Automating Benchmark Design
The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark $\tau$-bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.
☆ Towards Human-AI Synergy in Requirements Engineering: A Framework and Preliminary Study
The future of Requirements Engineering (RE) is increasingly driven by artificial intelligence (AI), reshaping how we elicit, analyze, and validate requirements. Traditional RE is based on labor-intensive manual processes prone to errors and complexity. AI-powered approaches, specifically large language models (LLMs), natural language processing (NLP), and generative AI, offer transformative solutions and reduce inefficiencies. However, the use of AI in RE also brings challenges like algorithmic bias, lack of explainability, and ethical concerns related to automation. To address these issues, this study introduces the Human-AI RE Synergy Model (HARE-SM), a conceptual framework that integrates AI-driven analysis with human oversight to improve requirements elicitation, analysis, and validation. The model emphasizes ethical AI use through transparency, explainability, and bias mitigation. We outline a multi-phase research methodology focused on preparing RE datasets, fine-tuning AI models, and designing collaborative human-AI workflows. This preliminary study presents the conceptual framework and early-stage prototype implementation, establishing a research agenda and practical design direction for applying intelligent data science techniques to semi-structured and unstructured RE data in collaborative environments.
comment: Accepted at the 2025 Sixth International Conference on Intelligent Data Science Technologies and Applications (IDSTA 2025),8 pages, 4 figures. Published in IEEE
☆ VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus
We introduce VeriStruct, a novel framework that extends AI-assisted automated verification from single functions to more complex data structure modules in Verus. VeriStruct employs a planner module to orchestrate the systematic generation of abstractions, type invariants, specifications, and proof code. To address the challenge that LLMs often misunderstand Verus' annotation syntax and verification-specific semantics, VeriStruct embeds syntax guidance within prompts and includes a repair stage to automatically correct annotation errors. In an evaluation on eleven Rust data structure modules, VeriStruct succeeds on ten of the eleven, successfully verifying 128 out of 129 functions (99.2%) in total. These results represent an important step toward the goal of automatic AI-assisted formal verification.
☆ Trust Dynamics in Strategic Coopetition: Computational Foundations for Requirements Engineering in Multi-Agent Systems
Requirements engineering increasingly occurs in multi-stakeholder environments where organizations simultaneously cooperate and compete, creating coopetitive relationships in which trust evolves dynamically based on observed behavior over repeated interactions. While conceptual modeling languages like i* represent trust relationships qualitatively, they lack computational mechanisms for analyzing how trust changes with behavioral evidence. Conversely, computational trust models from multi-agent systems provide algorithmic updating but lack grounding in requirements engineering contexts and conceptual models. This technical report bridges this gap by developing a computational trust model that extends game-theoretic foundations for strategic coopetition with dynamic trust evolution. We introduce trust as a two-layer system with immediate trust responding to current behavior and reputation tracking violation history. Trust evolves through asymmetric updating where cooperation builds trust gradually while violations erode it sharply, creating hysteresis effects and trust ceilings that constrain relationship recovery. We develop a structured translation framework enabling requirements engineers to instantiate computational trust models from i* dependency networks and organizational contexts. Comprehensive experimental validation across 78,125 parameter configurations establishes robust emergence of negativity bias, hysteresis effects, and cumulative damage amplification. Empirical validation using the Renault-Nissan Alliance case study (1999-2025) achieves 49 out of 60 validation points (81.7%), successfully reproducing documented trust evolution across five distinct relationship phases including crisis and recovery periods. This technical report builds upon its foundational companion work in arXiv:2510.18802.
comment: 62 pages, 20 figures, This technical report is the second in a research program and should be read in conjunction with its foundational companion work arXiv:2510.18802. It builds on the frameworks established in that prior work and also adapts and extends material on trustworthiness first presented in the doctoral dissertation 'Modeling Strategic Coopetition' (Pant, 2021, University of Toronto)
☆ A Roadmap for Tamed Interactions with Large Language Models
We are witnessing a bloom of AI-powered software driven by Large Language Models (LLMs). Although the applications of these LLMs are impressive and seemingly countless, their unreliability hinders adoption. In fact, the tendency of LLMs to produce faulty or hallucinated content makes them unsuitable for automating workflows and pipelines. In this regard, Software Engineering (SE) provides valuable support, offering a wide range of formal tools to specify, verify, and validate software behaviour. Such SE tools can be applied to define constraints over LLM outputs and, consequently, offer stronger guarantees on the generated content. In this paper, we argue that the development of a Domain Specific Language (DSL) for scripting interactions with LLMs using an LLM Scripting Language (LSL) may be key to improve AI-based applications. Currently, LLMs and LLM-based software still lack reliability, robustness, and trustworthiness, and the tools or frameworks to cope with these issues suffer from fragmentation. In this paper, we present our vision of LSL. With LSL, we aim to address the limitations above by exploring ways to control LLM outputs, enforce structure in interactions, and integrate these aspects with verification, validation, and explainability. Our goal is to make LLM interaction programmable and decoupled from training or implementation.
♻ ☆ Galapagos: Automated N-Version Programming with LLMs
N-Version Programming is a well-known methodology for developing fault-tolerant systems. It achieves fault detection and correction at runtime by adding diverse redundancy into programs, minimizing fault mode overlap between redundant program variants. In this work, we propose the automated generation of program variants using large language models. We design, develop and evaluate Gal\'apagos: a tool for generating program variants using LLMs, validating their correctness and equivalence, and using them to assemble N-Version binaries. We evaluate Gal\'apagos by creating N-Version components of real-world C code. Our original results show that Gal\'apagos can produce program variants that are proven to be functionally equivalent, even when the variants are written in a different programming language. Our systematic diversity measurement indicates that functionally equivalent variants produced by Gal\'apagos, are statically different after compilation, and present diverging internal behavior at runtime. We demonstrate that the variants produced by Gal\'apagos can protect C code against real miscompilation bugs which affect the Clang compiler. Overall, our paper shows that producing N-Version software can be drastically automated by advanced usage of practical formal verification and generative language models.
♻ ☆ Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality
Benchmarks are essential for unified evaluation and reproducibility. The rapid rise of Artificial Intelligence for Software Engineering (AI4SE) has produced numerous benchmarks for tasks such as code generation and bug repair. However, this proliferation has led to major challenges: (1) fragmented knowledge across tasks, (2) difficulty in selecting contextually relevant benchmarks, (3) lack of standardization in benchmark creation, and (4) flaws that limit utility. Addressing these requires a dual approach: systematically mapping existing benchmarks for informed selection and defining unified guidelines for robust, adaptable benchmark development. We conduct a review of 247 studies, identifying 273 AI4SE benchmarks since 2014. We categorize them, analyze limitations, and expose gaps in current practices. Building on these insights, we introduce BenchScout, an extensible semantic search tool for locating suitable benchmarks. BenchScout employs automated clustering with contextual embeddings of benchmark-related studies, followed by dimensionality reduction. In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5. To improve benchmarking standards, we propose BenchFrame, a unified framework for enhancing benchmark quality. Applying BenchFrame to HumanEval yielded HumanEvalNext, featuring corrected errors, improved language conversion, higher test coverage, and greater difficulty. Evaluating 10 state-of-the-art code models on HumanEval, HumanEvalPlus, and HumanEvalNext revealed average pass-at-1 drops of 31.22% and 19.94%, respectively, underscoring the need for continuous benchmark refinement. We further examine BenchFrame's scalability through an agentic pipeline and confirm its generalizability on the MBPP dataset. All review data, user study materials, and enhanced benchmarks are publicly released.
♻ ☆ The Software Diversity Card: A Framework for Reporting Diversity in Software Projects
Context: Interest in diversity in software development has significantly increased in recent years. Reporting on diversity in software projects can enhance user trust and assist regulators in evaluating adoption. Recent AI directives include clauses that mandate diversity information during development, highlighting the growing interest of public regulators. However, current documentation often neglects diversity in favor of technical features, partly due to a lack of tools for its description and annotation. Objectives: This work introduces the Software Diversity Card, a structured approach for documenting and sharing diversity-related aspects within software projects. It aims to profile the various teams involved in software development and governance, including user groups in testing and software adaptations for diverse social groups. Methods: We conducted a literature review on diversity and inclusion in software development and analyzed 1,000 top-starred Open Source Software (OSS) repositories on GitHub to identify diversity-related information. Moreover, we present a diversity modeling language, a toolkit for generating cards using it, and a study of its application in two real-world software projects. Results: Despite the growing awareness of diversity in the research community, our analysis found a notable lack of diversity reporting in OSS projects. Applying the card to real-world examples highlighted challenges such as balancing anonymity and transparency, managing sensitive data, and ensuring authenticity. Conclusion: Our proposal can enhance diversity practices in software development, support public administrations in software assessment, and help businesses promote diversity as a key asset.
comment: To appear
♻ ☆ Selective Parallel Loading of Large-Scale Compressed Graphs with ParaGrapher
Comprehensive evaluation is one of the basis of experimental science. In High-Performance Graph Processing, a thorough evaluation of contributions becomes more achievable by supporting common input formats over different frameworks. However, each framework creates its specific format, which may not support reading large-scale real-world graph datasets. This shows a demand for high-performance libraries capable of loading graphs to (i) accelerate designing new graph algorithms, (ii) to evaluate the contributions on a wide range of graph algorithms, and (iii) to facilitate easy and fast comparison over different graph frameworks. To that end, we present ParaGrapher, a high-performance API and library for loading large-scale and compressed graphs. ParaGrapher supports different types of requests for accessing graphs in shared- and distributed-memory and out-of-core graph processing. We explain the design of ParaGrapher and present a performance model of graph decompression, which is used for evaluation of ParaGrapher over three storage types. Our evaluation shows that by decompressing compressed graphs in WebGraph format, ParaGrapher delivers up to 3.2 times speedup in loading and up to 5.2 times speedup in end-to-end execution (i.e., through interleaved loading and execution) in comparison to the binary and textual formats. ParaGrapher is available online on https://blogs.qub.ac.uk/DIPSA/ParaGrapher/.
♻ ☆ MATCH: Task-Driven Code Evaluation through Contrastive Learning
AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.
♻ ☆ CITADEL: Context Similarity Based Deep Learning Framework Bug Finding
With the application of deep learning technology, tools of DL framework testing are in high demand. Existing DL framework testing tools have limited coverage of bug types. For example, they lack the capability of effectively finding performance bugs, which are critical for DL models regarding performance, economics, and the environment. Moreover, existing tools are inefficient, generating hundreds of test cases with few trigger bugs. In this paper, we propose Citadel, a method that accelerates bug finding in terms of efficiency and effectiveness. We observe that many DL framework bugs are similar due to the similarity of operators and algorithms belonging to the same family. Orthogonal to existing bug-finding tools, Citadel aims to find new bugs that are similar to reported ones that have known test oracles. Citadel defines context similarity to measure the similarity of DL framework API pairs and automatically generates test cases with oracles for APIs that are similar to the problematic APIs in existing bug reports. Citadel effectively detects 58 and 66 API bugs on PyTorch and TensorFlow (excluding those rejected by developers or duplicates of prior reports), many of which, e.g., 13 performance bugs, cannot be detected by existing tools. Moreover, 35.40% of test cases generated by Citadel can trigger bugs significantly transcending the state-of-the-art method (3.90%).
comment: 28 pages, 9 figures
♻ ☆ 10 quick tips for making your software outlive your job
Loss of key personnel has always been a risk for research software projects. Key members of the team may have to step away due to illness or burnout, to care for a family member, from a loss of financial support, or because their career is going in a new direction. Today, though, political and financial changes are putting large numbers of researchers out of work simultaneously, potentially leaving large amounts of research software abandoned. This article presents ten tips to help researchers ensure that the software they have built will continue to be usable after they have left their present job -- whether in the course of voluntary career moves or researcher mobility, but particularly in cases of involuntary departure due to political or institutional changes.
comment: 14 pages; 1 figure. New version uploaded to remove inaccurate, most likely hallucinated (!) refs in Rule 7, which we did not catch or suspect on previous submission
♻ ☆ A Systematic Literature Review of the Use of GenAI Assistants for Code Comprehension: Implications for Computing Education Research and Practice
The ability to comprehend code has long been recognized as an essential skill in software engineering. As programmers lean more heavily on generative artificial intelligence (GenAI) assistants to develop code solutions, it is becoming increasingly important for programmers to comprehend GenAI solutions so that they can verify their appropriateness and properly integrate them into existing code. At the same time, GenAI tools are increasingly being enlisted to provide programmers with tailored explanations of code written both by GenAI and humans. Thus, in computing education, GenAI presents new challenges and opportunities for learners who are trying to comprehend computer programs. To provide computing educators with evidence-based guidance on the use of GenAI to facilitate code comprehension and to identify directions for future research, we present a systematic literature review (SLR) of state-of-the-art approaches and tools that leverage GenAI to enhance code comprehension. Our SLR focuses on 31 studies published between 2022 and 2024. Despite their potential, GenAI assistants often yield inaccurate or unclear explanations, and novice programmers frequently struggle to craft effective prompts, thereby impeding their ability to leverage GenAI to aid code comprehension. Our review classifies GenAI-based approaches and tools, identifies methods used to study them, and summarizes the empirical evaluations of their effectiveness. We consider the implications of our findings for computing education research and practice, and identify directions for future research.
♻ ☆ LLM-Guided Scenario-based GUI Testing
The assurance of mobile app GUIs has become increasingly important, as the GUI serves as the primary medium of interaction between users and apps. Although numerous automated GUI testing approaches have been developed with diverse strategies, a substantial gap remains between these approaches and the underlying app business logic. Most existing approaches focus on general exploration rather than the completion of specific testing scenarios, often missing critical functionalities. Inspired by manual testing, which treats business logic-driven scenarios as the fundamental unit of testing, this paper introduces an approach that leverages large language models to comprehend GUI semantics and contextual relevance to given scenarios. Building on this capability, we propose ScenGen, an LLM-guided scenario-based GUI testing framework employing multi-agent collaboration to simulate and automate manual testing phases. Specifically, ScenGen integrates five agents: the Observer, Decider, Executor, Supervisor, and Recorder. The Observer perceives the app GUI state by extracting and structuring GUI widgets and layouts, interpreting semantic information. This is passed to the Decider, which makes scenario-driven decisions with LLM guidance to identify target widgets and determine actions toward fulfilling specific goals. The Executor performs these operations, while the Supervisor verifies alignment with intended scenario completion, ensuring traceability and consistency. Finally, the Recorder logs GUI operations into context memory as a knowledge base for subsequent decision-making and monitors runtime bugs. Comprehensive evaluations demonstrate that ScenGen effectively generates scenario-based GUI tests guided by LLM collaboration, achieving higher relevance to business logic and improving the completeness of automated GUI testing.
♻ ☆ BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents. We introduce a novel method for synthetic generation of difficult and diverse bugs. Our method instructs SWE Agents to introduce a feature into the codebase whereby they may unintentionally break tests, resulting in bugs. Prior approaches often induce an out-of-distribution effect by generating bugs intentionally (e.g. by introducing local perturbation to existing code), which does not reflect realistic development processes. We perform qualitative analysis to demonstrate that our approach for generating bugs more closely reflects the patterns found in human-authored edits. Through extensive experiments, we demonstrate that our bugs provide more efficient training data for supervised fine-tuning, outperforming other bug datasets by 2% with half the training data (1.2k vs. 3k bugs). We train on our newly generated bugs in addition to existing bug datasets to get FrogBoss a state-of-the-art 32B parameter model on SWE-bench Verified with a pass@1 of 54.6% and FrogMini a state-of-the-art 14B model on SWE-bench Verified with a pass@1 of 45.3% on SWE-bench Verified averaged over three seeds.
♻ ☆ Security Implications of User Non-compliance Behavior to Software Updates: A Risk Assessment Study
Software updates are essential to enhance security, fix bugs, and add better features to the existing software. While some users accept software updates, non-compliance remains a widespread issue. While some users accept software updates, non-compliance remains a widespread issue. End users' systems remain vulnerable to security threats when security updates are not installed or are installed with a delay. Despite research efforts, users' noncompliance behavior with software updates is still prevalent. In this study, we explored how psychological factors influence users' perception and behavior toward software updates. In addition, we investigated how information about potential vulnerabilities and risk scores influences their behavior. Next, we proposed a model that utilizes attributes from the National Vulnerability Database (NVD) to effectively assess the overall risk score associated with delaying software updates. Next, we conducted a user study with Windows OS users, showing that providing a risk score for not updating their systems and information about vulnerabilities significantly increased users' willingness to update their systems. Additionally, we examined the influence of demographic factors, gender, on users' decision-making regarding software updates. Our results show no statistically significant difference in male and female users' responses in terms of concerns about securing their systems. The implications of this study are relevant for software developers and manufacturers as they can use this information to design more effective software update notification messages. The communication of the potential risks and their corresponding risk scores may motivate users to take action and update their systems in a timely manner, which can ultimately improve the overall security of the system.
comment: 15 page, 10 tables, 6 figures
Systems and Control 30
☆ Feature Matching-Based Gait Phase Prediction for Obstacle Crossing Control of Powered Transfemoral Prosthesis
For amputees with powered transfemoral prosthetics, navigating obstacles or complex terrain remains challenging. This study addresses this issue by using an inertial sensor on the sound ankle to guide obstacle-crossing movements. A genetic algorithm computes the optimal neural network structure to predict the required angles of the thigh and knee joints. A gait progression prediction algorithm determines the actuation angle index for the prosthetic knee motor, ultimately defining the necessary thigh and knee angles and gait progression. Results show that when the standard deviation of Gaussian noise added to the thigh angle data is less than 1, the method can effectively eliminate noise interference, achieving 100\% accuracy in gait phase estimation under 150 Hz, with thigh angle prediction error being 8.71\% and knee angle prediction error being 6.78\%. These findings demonstrate the method's ability to accurately predict gait progression and joint angles, offering significant practical value for obstacle negotiation in powered transfemoral prosthetics.
comment: 6 pages, conference
☆ Learning to Drive Safely with Hybrid Options
Out of the many deep reinforcement learning approaches for autonomous driving, only few make use of the options (or skills) framework. That is surprising, as this framework is naturally suited for hierarchical control applications in general, and autonomous driving tasks in specific. Therefore, in this work the options framework is applied and tailored to autonomous driving tasks on highways. More specifically, we define dedicated options for longitudinal and lateral manoeuvres with embedded safety and comfort constraints. This way, prior domain knowledge can be incorporated into the learning process and the learned driving behaviour can be constrained more easily. We propose several setups for hierarchical control with options and derive practical algorithms following state-of-the-art reinforcement learning techniques. By separately selecting actions for longitudinal and lateral control, the introduced policies over combined and hybrid options obtain the same expressiveness and flexibility that human drivers have, while being easier to interpret than classical policies over continuous actions. Of all the investigated approaches, these flexible policies over hybrid options perform the best under varying traffic conditions, outperforming the baseline policies over actions.
☆ Efficient Network Reconfiguration by Randomized Switching
We present an algorithm that efficiently computes nearly-optimal solutions to a class of combinatorial reconfiguration problems on weighted, undirected graphs. Inspired by societally relevant applications in networked infrastructure systems, these problems consist of simultaneously finding an unreweighted sparsified graph and nodal potentials that satisfy fixed demands, where the objective is to minimize some congestion criterion, e.g., a Laplacian quadratic form. These are mixed-integer nonlinear programming problems that are NP-hard in general. To circumvent these challenges, instead of solving for a single best configuration, the proposed randomized switching algorithm seeks to design a distribution of configurations that, when sampled, ensures that congestion concentrates around its optimum. We show that the proposed congestion metric is a generalized self-concordant function in the space of switching probabilities, which enables the use of efficient and simple conditional gradient methods. We implement our algorithm and show that it outperforms a state-of-the-art commercial mixed-integer second-order cone programming (MISOCP) solver by orders of magnitude over a large range of problem sizes.
☆ Flatness-based trajectory planning for 3D overhead cranes with friction compensation and collision avoidance
This paper presents an optimal trajectory generation method for 3D overhead cranes by leveraging differential flatness. This framework enables the direct inclusion of complex physical and dynamic constraints, such as nonlinear friction and collision avoidance for both payload and rope. Our approach allows for aggressive movements by constraining payload swing only at the final point. A comparative simulation study validates our approach, demonstrating that neglecting dry friction leads to actuator saturation and collisions. The results show that friction modeling is a fundamental requirement for fast and safe crane trajectories.
comment: 8 pages, 11 figures
☆ Analyzing Parametric Oscillator Ising Machines through the Kuramoto Lens
Networks of coupled nonlinear oscillators are emerging as powerful physical platforms for implementing Ising machines. Yet the relationship between parametric-oscillator implementations and traditional oscillator-based Ising machines remains underexplored. In this work, we develop a Kuramoto-style, canonical phase description of parametric oscillator Ising machines by starting from the Stuart-Landau oscillator model -- the canonical normal form near a Hopf bifurcation, and a natural reduced description for many parametric oscillator implementations such as the degenerate optical parametric oscillator (DOPO) among others. The resulting phase dynamics combine the usual phase-difference coupling observed in the standard Kuramoto model along with an intrinsic phase sum term that is generated when conjugate coupling is considered. Moreover, our formulation helps explain why explicit second-harmonic driving is unnecessary in parametric oscillators and also reveals how quasi-steady amplitude heterogeneity scales the original strength of the spin interaction with potentially adverse impacts on the solution quality. Our work helps develop a unifying view of the oscillator-based approach to designing Ising machines.
☆ Contributions to Semialgebraic-Set-Based Stability Verification of Dynamical Systems with Neural-Network-Based Controllers
Neural-network-based controllers (NNCs) can represent complex, highly nonlinear control laws, but verifying the closed-loop stability of dynamical systems using them remains challenging. This work presents contributions to a state-of-the-art stability verification procedure for NNC-controlled systems which relies on semialgebraic-set-based input-output modeling to pose the search for a Lyapunov function as an optimization problem. Specifically, this procedure's conservatism when analyzing NNCs using transcendental activation functions and the restriction to feedforward NNCs are addressed by a) introducing novel semialgebraic activation functions that preserve key properties of common transcendental activations and b) proving compatibility of NNCs from the broader class of recurrent equilibrium networks (RENs) with this procedure. Furthermore, the indirect optimization of a local region of attraction (RoA) estimate using a restricted set of candidate Lyapunov functions is greatly improved via c) the introduction of a richer parameterization of candidate Lyapunov functions than previously reported and d) the formulation of novel semidefinite programs (SDPs) that directly optimize the resulting RoA estimate. The value of these contributions is highlighted in two numerical examples.
comment: Submitted to the IEEE for possible publication, 16 pages, 6 figures
☆ Development of a Digital Twin for an Electric Vehicle Emulator Modeling, Control, and Experimental Validation
This paper presents the development and validation of a digital twin for a scaled-down electric vehicle (EV) emulator, designed to replicate longitudinal vehicle dynamics under diverse operating conditions. The emulator integrates a separately excited DC motor (SEDCM), a four-quadrant DC-DC converter, a battery emulator, and a mechanical load emulator. The system models tractive effort, aerodynamic drag, and gradient resistance using Newton's second law. In contrast to conventional graphical modeling tools (e.g., block diagrams and bond graphs), the adopted Energetic Macroscopic Representation (EMR) framework offers clear advantages by explicitly representing energy interactions and facilitating the systematic derivation of control structures. A control strategy developed within this framework governs energy flow across the powertrain, enabling accurate speed control via armature voltage regulation. Experimental tests conducted on a Lucas-Nulle test bench show strong correlation with simulation results. The study also introduces a methodology to compute the maximum admissible vehicle mass - determined to be 13.5 kg for a 180 W motor operating at 1900 rpm - based on acceleration and slope constraints. Furthermore, a switching algorithm for the bidirectional converter ensures reliable four quadrant operation. Overall, the proposed framework provides a scalable and effective approach for EV emulation, control design, and energy management validation.
comment: 6 pages, Accepted at CODIT 2025 (Conference on Decision and Control in Intelligent Technology)
☆ Mechanism-Guided Residual Lifting and Control Consistent Modeling for Pneumatic Drying Processes
Pneumatic drying processes in industries such as agriculture, chemicals,and pharmaceuticals are notoriously difficult to model and control due to multi-source disturbances,coupled stage dynamics, and significant measurement delays. Traditional modeling paradigms often fail to simultaneously deliver accuracy, interpretability, and closed-loop applicability. To address this challenge, this paper introduces a unified hybrid modeling framework, termed Physics-Guided Residual Lifting with Control-Consistent Correction,which integrates a transient mechanistic model with a stability-constrained data-driven component. The framework covers the complete process chain of drying, transport, and winnowing. On the mechanistic level, the model unifies mass transfer dynamics using the partial pressure difference of water vapor, incorporates water activity clamping and latent heat corrections for bound water, and ensures energy closure with moisture-dependent specific heat. On the data-driven level,we propose an orthogonal residual learning scheme. It leverages intermediate states from the mechanistic model as proxy variables to construct a physics-inspired dictionary, preventing parameter compensation and overfitting during ridge regression. Furthermore, to ensure suitability for predictive control, a Control-Consistent Extended Dynamic Mode Decomposition with stability constraints is employed to learn the residual dynamics, for which we provide boundedness proofs and stability guarantees. The framework was validated on 10 industrial batches, comprising 63,000 samples. On unseen test data, the hybrid model achieved a Mean Absolute Error of 0.016% for outlet moisture and 0.015 {\deg}C for outlet temperature, with values improving to 0.986 and 0.995, respectively. The resulting prediction residuals exhibit white-noise characteristics, with significantly reduced spectral energy at low frequencies.
comment: 6figs,4tables
☆ An N-of-1 Artificial Intelligence Ecosystem for Precision Medicine
Artificial intelligence in medicine is built to serve the average patient. By minimizing error across large datasets, most systems deliver strong aggregate accuracy yet falter at the margins: patients with rare variants, multimorbidity, or underrepresented demographics. This average patient fallacy erodes both equity and trust. We propose a different design: a multi-agent ecosystem for N-of-1 decision support. In this environment, agents clustered by organ systems, patient populations, and analytic modalities draw on a shared library of models and evidence synthesis tools. Their results converge in a coordination layer that weighs reliability, uncertainty, and data density before presenting the clinician with a decision-support packet: risk estimates bounded by confidence ranges, outlier flags, and linked evidence. Validation shifts from population averages to individual reliability, measured by error in low-density regions, calibration in the small, and risk--coverage trade-offs. Anticipated challenges include computational demands, automation bias, and regulatory fit, addressed through caching strategies, consensus checks, and adaptive trial frameworks. By moving from monolithic models to orchestrated intelligence, this approach seeks to align medical AI with the first principle of medicine: care that is transparent, equitable, and centered on the individual.
comment: This study has been supported by grants from the National Institutes of Health: The National Institute on Aging R01AG074372 and The National Institute of Allergy and Infectious Diseases R01AI165535
Survey and Tutorial of Reinforcement Learning Methods in Process Systems Engineering
Sequential decision making under uncertainty is central to many Process Systems Engineering (PSE) challenges, where traditional methods often face limitations related to controlling and optimizing complex and stochastic systems. Reinforcement Learning (RL) offers a data-driven approach to derive control policies for such challenges. This paper presents a survey and tutorial on RL methods, tailored for the PSE community. We deliver a tutorial on RL, covering fundamental concepts and key algorithmic families including value-based, policy-based and actor-critic methods. Subsequently, we survey existing applications of these RL techniques across various PSE domains, such as in fed-batch and continuous process control, process optimization, and supply chains. We conclude with PSE focused discussion of specialized techniques and emerging directions. By synthesizing the current state of RL algorithm development and implications for PSE this work identifies successes, challenges, trends, and outlines avenues for future research at the interface of these fields.
☆ A comparison between joint and dual UKF implementations for state estimation and leak localization in water distribution networks
The sustainability of modern cities highly depends on efficient water distribution management, including effective pressure control and leak detection and localization. Accurate information about the network hydraulic state is therefore essential. This article presents a comparison between two data-driven state estimation methods based on the Unscented Kalman Filter (UKF), fusing pressure, demand and flow data for head and flow estimation. One approach uses a joint state vector with a single estimator, while the other uses a dual-estimator scheme. We analyse their main characteristics, discussing differences, advantages and limitations, and compare them theoretically in terms of accuracy and complexity. Finally, we show several estimation results for the L-TOWN benchmark, allowing to discuss their properties in a real implementation.
comment: This work has been submitted to ECC2026 for review. It has 7 pages and 2 figures
☆ Sample-based Moving Horizon Estimation
In this paper, we propose a sample-based moving horizon estimation (MHE) scheme for general nonlinear systems to estimate the current system state using irregularly and/or infrequently available measurements. The cost function of the MHE optimization problem is suitably designed to accommodate these irregular output sequences. We also establish that, under a suitable sample-based detectability condition known as sample-based incremental input/output-to-state stability (i-IOSS), the proposed sample-based MHE achieves robust global exponential stability (RGES). Additionally, for the case of linear systems, we draw connections between sample-based observability and sample-based i-IOSS. This demonstrates that previously established conditions for linear systems to be sample-based observable can be utilized to verify or design sampling strategies that satisfy the conditions to guarantee RGES of the sample-based MHE. Finally, the effectiveness of the proposed sample-based MHE is illustrated through a simulation example.
☆ Improved Accuracy of Robot Localization Using 3-D LiDAR in a Hippocampus-Inspired Model
Boundary Vector Cells (BVCs) are a class of neurons in the brains of vertebrates that encode environmental boundaries at specific distances and allocentric directions, playing a central role in forming place fields in the hippocampus. Most computational BVC models are restricted to two-dimensional (2D) environments, making them prone to spatial ambiguities in the presence of horizontal symmetries in the environment. To address this limitation, we incorporate vertical angular sensitivity into the BVC framework, thereby enabling robust boundary detection in three dimensions, and leading to significantly more accurate spatial localization in a biologically-inspired robot model. The proposed model processes LiDAR data to capture vertical contours, thereby disambiguating locations that would be indistinguishable under a purely 2D representation. Experimental results show that in environments with minimal vertical variation, the proposed 3D model matches the performance of a 2D baseline; yet, as 3D complexity increases, it yields substantially more distinct place fields and markedly reduces spatial aliasing. These findings show that adding a vertical dimension to BVC-based localization can significantly enhance navigation and mapping in real-world 3D spaces while retaining performance parity in simpler, near-planar scenarios.
comment: 8 pages, 9 figures, Presented at the 2025 International Joint Conference on Neural Networks, Rome, July 2025
☆ Defect Mitigation for Robot Arm-based Additive Manufacturing Utilizing Intelligent Control and IOT
This paper presents an integrated robotic fused deposition modeling additive manufacturing system featuring closed-loop thermal control and intelligent in-situ defect correction using a 6-degree of freedom robotic arm and an Oak-D camera. The robot arm end effector was modified to mount an E3D hotend thermally regulated by an IoT microcontroller, enabling precise temperature control through real-time feedback. Filament extrusion system was synchronized with robotic motion, coordinated via ROS2, ensuring consistent deposition along complex trajectories. A vision system based on OpenCV detects layer-wise defects position, commanding autonomous re-extrusion at identified sites. Experimental validation demonstrated successful defect mitigation in printing operations. The integrated system effectively addresses challenges real-time quality assurance. Inverse kinematics were used for motion planning, while homography transformations corrected camera perspectives for accurate defect localization. The intelligent system successfully mitigated surface anomalies without interrupting the print process. By combining real-time thermal regulation, motion control, and intelligent defect detection & correction, this architecture establishes a scalable and adaptive robotic additive manufacturing framework suitable for aerospace, biomedical, and industrial applications.
comment: This Paper Has Accepted at ASME 2025 International Mechanical Engineering Congress and Exposition (IMECE 2025)
☆ A Hamilton-Jacobi Reachability Framework with Soft Constraints for Safety-Critical Systems
Traditional reachability methods provide formal guarantees of safety under bounded disturbances. However, they strictly enforce state constraints as inviolable, which can result in overly conservative or infeasible solutions in complex operational scenarios. Many constraints encountered in practice, such as bounds on battery state of charge in electric vehicles, recommended speed envelopes, and comfort constraints in passenger-carrying vehicles, are inherently soft. Soft constraints allow temporary violations within predefined safety margins to accommodate uncertainty and competing operational demands, albeit at a cost such as increased wear or higher operational expenses. This paper introduces a novel soft-constrained reachability framework that extends Hamilton-Jacobi reachability analysis for the formal verification of safety-critical systems subject to both hard and soft constraints. Specifically, the framework characterizes a subset of the state space, referred to as the soft-constrained reach-avoid set, from which the system is guaranteed to reach a desired set safely, under worst-case disturbances, while ensuring that cumulative soft-constraint violations remain within a user-specified budget. The framework comprises two principal components: (i) an augmented-state model with an auxiliary budget state that tracks soft-constraint violations, and (ii) a regularization-based approximation of the discontinuous Hamilton-Jacobi value function associated with the reach-avoid differential game studied herein. The effectiveness of the proposed framework is demonstrated through numerical examples involving the landing of a simple point-mass model and a fixed-wing aircraft executing an emergency descent, both under wind disturbances. The simulation results validate the framework's ability to simultaneously manage both hard and soft constraints in safety-critical settings
☆ Delay Tolerant Control for Autonomous Driving Using CDOB
With the rapid growth of autonomous vehicle technologies, effective path-tracking control has become a critical component in ensuring safety and efficiency in complex traffic scenarios. When a high level decision making agent generates a collision free path, a robust low level controller is required to precisely follow this trajectory. However, connected autonomous vehicles (CAV) are inherently affected by communication delays and computation delays, which significantly degrade the performance of conventional controllers such as PID or other more advanced controllers like disturbance observers (DOB). While DOB-based designs have shown effectiveness in rejecting disturbances under nominal conditions, their performance deteriorates considerably in the presence of unknown time delays. To address this challenge, this paper proposes a delay-tolerant communication disturbance observer (CDOB) framework for path-tracking control in delayed systems. The proposed CDOB compensates for the adverse effects of time delays, maintaining accurate trajectory tracking even under uncertain and varying delay conditions. It is shown through a simulation study that the proposed control architecture maintains close alignment with the reference trajectory across various scenarios, including single lane change, double-= lane change, and Elastic Band generated collision avoidance paths under various time delays. Simulation results further demonstrate that the proposed method outperforms conventional approaches in both tracking accuracy and delay robustness, making it well suited for autonomous driving applications.
☆ Decentralized Merging Control of Connected and Automated Vehicles to Enhance Safety and Energy Efficiency using Control Barrier Functions
This paper presents a decentralized Control Barrier Function (CBF) based approach for highway merging of Connected and Automated Vehicles (CAVs). In this control algorithm, each "host" vehicle negotiates with other agents in a control zone of the highway network, and enacts its own action, to perform safe and energy-efficient merge maneuvers. It uses predictor-corrector loops within the robust CBF setting for negotiation and to reconcile disagreements that may arise. There is no explicit order of vehicles and no priority. A notable feature is absence of gridlocks due to instability of the inter-agent system. Results from Monte Carlo simulations show significant improvement in the system-wide energy efficiency and traffic flow compared to a first-in-first-out approach, as well as enhanced robustness of the proposed decentralized controller compared to its centralized counterpart.
comment: This work has been submitted to a conference for possible publication and is under review. Paper summary: 8 pages, 5 figures, 2 tables
☆ Deep Reinforcement Learning Approach to QoSAware Load Balancing in 5G Cellular Networks under User Mobility and Observation Uncertainty
Efficient mobility management and load balancing are critical to sustaining Quality of Service (QoS) in dense, highly dynamic 5G radio access networks. We present a deep reinforcement learning framework based on Proximal Policy Optimization (PPO) for autonomous, QoS-aware load balancing implemented end-to-end in a lightweight, pure-Python simulation environment. The control problem is formulated as a Markov Decision Process in which the agent periodically adjusts Cell Individual Offset (CIO) values to steer user-cell associations. A multi-objective reward captures key performance indicators (aggregate throughput, latency, jitter, packet loss rate, Jain's fairness index, and handover count), so the learned policy explicitly balances efficiency and stability under user mobility and noisy observations. The PPO agent uses an actor-critic neural network trained from trajectories generated by the Python simulator with configurable mobility (e.g., Gauss-Markov) and stochastic measurement noise. Across 500+ training episodes and stress tests with increasing user density, the PPO policy consistently improves KPI trends (higher throughput and fairness, lower delay, jitter, packet loss, and handovers) and exhibits rapid, stable convergence. Comparative evaluations show that PPO outperforms rule-based ReBuHa and A3 as well as the learning-based CDQL baseline across all KPIs while maintaining smoother learning dynamics and stronger generalization as load increases. These results indicate that PPO's clipped policy updates and advantage-based training yield robust, deployable control for next-generation RAN load balancing using an entirely Python-based toolchain.
♻ ☆ NeuroDOB: A Deep Neural Observer-Based Controller for Vehicle Lateral Dynamics
This paper proposes NeuroDOB, a deep neural network based observer controller for vehicle lateral dynamics, which replaces the conventional disturbance observer (DOB) with a deep neural network (DNN) to enhance personalized lateral control. Unlike conventional DOBs that compensate for general disturbances such as road friction variation and crosswind, NeuroDOB explicitly addresses unmodeled vehicle dynamics and driver-specific behaviors by learning the steering compensation signal from driver-in-the-loop simulations using CarSim's embedded controller as a surrogate driver. The proposed architecture integrates NeuroDOB with a linear quadratic regulator (LQR), where the DNN outputs a delta error correction added to the baseline LQR steering input to produce the final control command. Input features to the DNN include lateral position and yaw angle errors, and the LQR control input. Experimental validation using a lateral dynamic bicycle model within CarSim demonstrates that NeuroDOB effectively adapts to individual driving habits, improving lateral control performance beyond what conventional LQR controllers achieve. The results indicate the potential of deep neural network based observer to enable personalized and adaptive autonomous vehicle control. In cognitive terms, the proposed architecture can be viewed as a dual-system control structure. The baseline LQR corresponds to System 1, a model-based, fast, and analytic reasoning layer ensuring stability. The NeuroDOB acts as System 2, a reflective, data-driven layer that learns compensation from experience and corrects the analytical bias of System 1. Together, they form an integrated decision process analogous to human intuition-reflection interaction, enabling both stability and adaptability in lateral control.
comment: 12 pages, 16 figures
♻ ☆ Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning
We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.
♻ ☆ Assessment of Power System Stability Considering Multiple Time-Scale Dynamics: Insights into Hopf Bifurcations in Presence of GFL and GFM IBRs
Real power systems exhibit dynamics that evolve across a wide range of time scales, from very fast to very slow phenomena. Historically, incorporating these wide-ranging dynamics into a single model has been impractical. As a result, power engineers rely on time-scale decomposition to simplify models. When fast phenomena are evaluated, slow dynamics are neglected (assumed stable), and vice versa. This paper challenges this paradigm by showing the importance of assessing power system stability while considering multiple time scales simultaneously. Using the concept of Hopf bifurcations, it exemplifies instability issues that would be missed if multi-time-scale dynamics are not considered. Although this work employs both grid-following and grid-forming inverter-based resource models, it is not a direct comparison. Instead, it presents a case study demonstrating how one technology can complement the other from a multi time-scale dynamics perspective.
comment: 7 pages
♻ ☆ Two-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion NeurIPS 2025
Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimates of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their frameworks. In this work, we propose a novel two-stage training framework to jointly synthesize a controller and a Lyapunov function for continuous-time systems. By leveraging a Zubov-inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training-data sampling strategy and a domain-updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $\alpha,\!\beta$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10{,}000$ times faster compared to the traditional SMT solver dReal. Our code is available at https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.
comment: NeurIPS 2025
♻ ☆ MathBode: Understanding LLM Reasoning with Dynamical Systems
This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G \approx 1$, $\phi \approx 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.
♻ ☆ Learning Wireless Interference Patterns: Decoupled GNN for Throughput Prediction in Heterogeneous Multi-Hop p-CSMA Networks
The p-persistent CSMA protocol is central to random-access MAC analysis, but predicting saturation throughput in heterogeneous multi-hop wireless networks remains a hard problem. Simplified models that assume a single, shared interference domain can underestimate throughput by 48-62% in sparse topologies. Exact Markov-chain analyses are accurate but scale exponentially in computation time, making them impractical for large networks. These computational barriers motivate structural machine learning approaches like GNNs for scalable throughput prediction in general network topologies. Yet off-the-shelf GNNs struggle here: a standard GCN yields 63.94% normalized mean absolute error (NMAE) on heterogeneous networks because symmetric normalization conflates a node's direct interference with higher-order, cascading effects that pertain to how interference propagates over the network graph. Building on these insights, we propose the Decoupled Graph Convolutional Network (D-GCN), a novel architecture that explicitly separates processing of a node's own transmission probability from neighbor interference effects. D-GCN replaces mean aggregation with learnable attention, yielding interpretable, per-neighbor contribution weights while capturing complex multihop interference patterns. D-GCN attains 3.3% NMAE, outperforms strong baselines, remains tractable even when exact analytical methods become computationally infeasible, and enables gradient-based network optimization that achieves within 1% of theoretical optima.
♻ ☆ Flight-Ready Precise and Robust Carrier-Phase GNSS Navigation Software for Distributed Space Systems
This paper presents the full requirements analysis, design, development, and testing of high-precision navigation flight software for Distributed Space Systems (DSS) using Carrier Phase Differential GNSS (CDGNSS). Five main contributions are made. First, a survey of flown and upcoming DSS missions with stringent precision requirements is conducted, from which a thorough requirements analysis is distilled to guide development and testing. Second, a real-time navigation functional architecture is designed, and adopts a sparse and regularized Consider Kalman Filter with options for numerical stability in-flight. The filter rigorously accounts for uncertainties in process noise, measurement noise, and biases. It tracks float ambiguities with integer resolution where possible. The covariance correlation structure is preserved under all navigation modes, including contingencies and outages. Third, a lightweight, memoryless Fault Detection, Isolation, and Recovery (FDIR) module is developed to guard against anomalous measurements, providing statistical screening and ensuring robust navigation. Fourth, the software architecture is proposed for ease of integration, with strategies presented for modularity and computational efficiency tailored to constrained flight systems. Fifth, a comprehensive test campaign is conducted, mapped to a requirements verification matrix, spanning unit, interface, software-in-the-loop, and real-time hardware-in-the-loop tests, emphasizing gradual test fidelity for efficient fault isolation. Finally, flight-like results are demonstrated using the VISORS mission, due to the generalizability of the VISORS navigation operations, and the stringency which demands sub-centimeter relative position and sub-millimeter-per-second velocity accuracy. This architecture aims to serve as a reference for next-generation DSS missions adopting CDGNSS.
♻ ☆ A Volumetric Privacy Measure for Dynamical Systems With Bounded Disturbance
In this paper, we first present a volumetric privacy measure for dynamical systems with bounded disturbances, wherein the states of the system contain private information and an adversary with access to sensor measurements attempts to infer the set of potential values of the private information. Under the proposed privacy measure, the volume of the uncertainty set of the adversary given the sensor measurements is considered as the privacy level of the system. We next characteristic the time evolution of the proposed privacy measure and study its properties for a particular system with both public and private states, where a set containing the public state is shared as the observation. Approximate set-membership estimation techniques are developed to compute the private-state uncertainty set, and the properties of the privacy measure are analyzed, demonstrating that the uncertainty reduction of the adversary is bounded by the information gain from the observation set. Furthermore, an optimization-based privacy filter design problem is formulated, employing randomization and linear programming to enhance the privacy level. The effectiveness of the proposed approach is validated through a production-inventory case study. Results show that the optimal privacy filter significantly improves robustness against inference attacks and outperforms two baseline mechanisms based on additive noise and quantization.
♻ ☆ A cutting-surface consensus approach for distributed robust optimization of multi-agent systems
A novel and fully distributed optimization method is proposed for the distributed robust convex program (DRCP) over a time-varying unbalanced directed network under the uniformly jointly strongly connected (UJSC) assumption. Firstly, an approximated DRCP (ADRCP) is introduced by discretizing the semi-infinite constraints into a finite number of inequality constraints to ensure tractability and restricting the right-hand side of the constraints with a positive parameter to ensure a feasible solution for (DRCP) can be obtained. This problem is iteratively solved by a distributed projected gradient algorithm proposed in this paper, which is based on epigraphic reformulation and gradient projected operations. Secondly, a cutting-surface consensus approach is proposed for locating an approximately optimal consensus solution of the DRCP with guaranteed local feasibility for each agent. This approach is based on iteratively approximating the DRCP by successively reducing the restriction parameter of the right-hand constraints and adding the cutting-surfaces into the existing finite set of constraints. Thirdly, to ensure finite-time termination of the distributed optimization, a distributed termination algorithm is developed based on consensus and zeroth-order stopping conditions under UJSC graphs. Fourthly, it is proved that the cutting-surface consensus approach terminates finitely and yields a feasible and approximate optimal solution for each agent. Finally, the effectiveness of the approach is illustrated through a numerical example.
comment: 16 pages, 8 figures, published to IEEE TAC
♻ ☆ Online Adaptation for Flying Quadrotors in Tight Formations
The task of flying in tight formations is challenging for teams of quadrotors because the complex aerodynamic wake interactions can destabilize individual team members as well as the team. Furthermore, these aerodynamic effects are highly nonlinear and fast-paced, making them difficult to model and predict. To overcome these challenges, we present L1 KNODE-DW MPC, an adaptive, mixed expert learning based control framework that allows individual quadrotors to accurately track trajectories while adapting to time-varying aerodynamic interactions during formation flights. We evaluate L1 KNODE-DW MPC in two different three-quadrotor formations and show that it outperforms several MPC baselines. Our results show that the proposed framework is capable of enabling the three-quadrotor team to remain vertically aligned in close proximity throughout the flight. These findings show that the L1 adaptive module compensates for unmodeled disturbances most effectively when paired with an accurate dynamics model. A video showcasing our framework and the physical experiments is available here: https://youtu.be/9QX1Q5Ut9Rs
comment: 10 pages, 4 figures
♻ ☆ Operational Risks in Grid Integration of Large Data Center Loads: Characteristics, Stability Assessments, and Sensitivity Studies
This paper investigates the dynamic interactions between large-scale data centers and the power grid, focusing on reliability challenges arising from sudden fluctuations in demand. With the rapid growth of AI-driven workloads, such fluctuations, along with fast ramp patterns, are expected to exacerbate stressed grid conditions and system instabilities. We consider a few open-source AI data center consumption profiles from the MIT supercloud datasets, along with generating a few experimental HPC job-distribution-based inference profiles. Subsequently, we develop analytical methodologies for real-time assessment of grid stability, focusing on both transient and small-signal stability assessments. Energy-flow-like metrics for nonlinear transient stability, formulated by computing localized data center bus kinetic-like flows and coupling interactions with neighboring buses over varying time windows, help provide operators with real-time assessments of the regional grid stress in the data center hubs. On the other hand, small-signal stability metrics, constructed from analytical state matrices under variable operating conditions during a fast ramping period, enable snapshot-based assessments of data center load fluctuations and provide enhanced observability into evolving grid conditions. By quantifying the stability impacts of large data center clusters, studies conducted in the modified IEEE benchmark $68-$bus model support improved operator situational awareness to capture risks in reliable integration of large data center loads.
comment: 13 pages, 8 figures, 3 tables
♻ ☆ Efficient Path Planning and Task Allocation Algorithm for Boolean Specifications
This paper presents a novel path-planning and task assignment algorithm for multi-robot systems that should fulfill a global Boolean specification. The proposed method is based on Integer Linear Programming (ILP) formulations, which are combined with structural insights from Petri nets to improve scalability and computational efficiency. By proving that the \emph{constraint matrix} is totally unimodular (TU) for certain classes of problems, the ILP formulation can be relaxed into a Linear Programming (LP) problem without losing the integrality of the solution. This relaxation eliminates complex combinatorial techniques, significantly reducing computational overhead and thus ensuring scalability for large-scale systems. Using the approach proposed in this paper, we can solve path-planning problems for teams made up to 500 robots. The method guarantees computational tractability, handles collision avoidance and reduces computational demands through iterative LP optimization techniques. Case studies demonstrate the efficiency of the algorithm in generating scalable, collision-free paths for large robot teams navigating in complex environments. While the conservative nature of collision avoidance introduces additional constraints, and thus, computational requirements, the solution remains practical and impactful for diverse applications. The algorithm is particularly applicable to real-world scenarios, including warehouse logistics where autonomous robots must efficiently coordinate tasks or search-and-rescue operations in various environments. This work contributes both theoretically and practically to scalable multi-robot path planning and task allocation, offering an efficient framework for coordinating autonomous agents in shared environments.
Graphics 2
☆ Fast and accurate neural reflectance transformation imaging through knowledge distillation
Reflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). ...
comment: 18 pages
♻ ☆ Procedural Generation of Articulated Simulation-Ready Assets
We introduce Infinigen-Articulated, a toolkit for generating realistic, procedurally generated articulated assets for robotics simulation. We include procedural generators for 18 common articulated object categories along with high-level utilities for use creating custom articulated assets in Blender. We also provide an export pipeline to integrate the resulting assets along with their physical properties into common robotics simulators. Experiments demonstrate that assets sampled from these generators are effective for movable object segmentation, training generalizable reinforcement learning policies, and sim-to-real transfer of imitation learning policies.
comment: Updated to include information on newly implemented assets, new experimental results (both simulation and real world), and additional features including material and dynamics parameters
Computation and Language 149
☆ MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task
In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.
comment: Accepted to WMT25
☆ ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.
☆ Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.
☆ Tongyi DeepResearch Technical Report
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
comment: https://tongyi-agent.github.io/blog
☆ ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking
Parallel thinking expands exploration breadth, complementing the deep exploration of information-seeking (IS) agents to further enhance problem-solving capability. However, conventional parallel thinking faces two key challenges in this setting: inefficiency from repeatedly rolling out from scratch, and difficulty in integrating long-horizon reasoning trajectories during answer generation, as limited context capacity prevents full consideration of the reasoning process. To address these issues, we propose ParallelMuse, a two-stage paradigm designed for deep IS agents. The first stage, Functionality-Specified Partial Rollout, partitions generated sequences into functional regions and performs uncertainty-guided path reuse and branching to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, exploits reasoning redundancy to losslessly compress information relevant to answer derivation and synthesize a coherent final answer. Experiments across multiple open-source agents and benchmarks demonstrate up to 62% performance improvement with a 10--30% reduction in exploratory token consumption.
☆ AgentFold: Long-Horizon Web Agents with Proactive Context Management
LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.
comment: 26 pages, 9 figures
☆ WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
Large Language Model (LLM)-based agents have emerged as a transformative approach for open-ended problem solving, with information seeking (IS) being a core capability that enables autonomous reasoning and decision-making. While prior research has largely focused on improving retrieval depth, we observe that current IS agents often suffer from low search efficiency, which in turn constrains overall performance. A key factor underlying this inefficiency is the sparsity of target entities in training tasks, which limits opportunities for agents to learn and generalize efficient search behaviors. To address these challenges, we propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories. We formulate IS as a tree-structured reasoning problem, enabling a substantially larger set of target entities to be embedded within a constrained context. Leveraging curated Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic, Union, and Reverse-Union, to systematically increase both IS efficiency and efficacy. Finally, we curate training trajectories by retaining only those that are simultaneously accurate and efficient, ensuring that the model is optimized for both correctness and search performance. Extensive experiments on both basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp, GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.
☆ AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis
Training large language model agents on tasks at the frontier of their capabilities is key to unlocking advanced reasoning. We introduce a data synthesis approach inspired by the educational theory of the Zone of Proximal Development (ZPD), which defines this frontier as tasks an LLM cannot solve alone but can master with guidance. To operationalize this, we present the AgentFrontier Engine, an automated pipeline that synthesizes high-quality, multidisciplinary data situated precisely within the LLM's ZPD. This engine supports both continued pre-training with knowledge-intensive data and targeted post-training on complex reasoning tasks. From the same framework, we derive the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on our synthesized data, which achieves state-of-the-art results on demanding benchmarks like Humanity's Last Exam, even surpassing some leading proprietary agents. Our work demonstrates that a ZPD-guided approach to data synthesis offers a scalable and effective path toward building more capable LLM agents.
comment: https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
☆ Repurposing Synthetic Data for Fine-grained Search Agent Supervision
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
☆ STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
comment: Homepage: https://internlm.github.io/StarBench/
☆ SPICE: Self-Play In Corpus Environments Improves Reasoning
Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.
☆ Dissecting Role Cognition in Medical LLMs via Neuronal Ablation
Large language models (LLMs) have gained significant traction in medical decision support systems, particularly in the context of medical question answering and role-playing simulations. A common practice, Prompt-Based Role Playing (PBRP), instructs models to adopt different clinical roles (e.g., medical students, residents, attending physicians) to simulate varied professional behaviors. However, the impact of such role prompts on model reasoning capabilities remains unclear. This study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to evaluate whether role prompts induce distinct, role-specific cognitive processes in LLMs or merely modify linguistic style. We test this framework on three medical QA datasets, employing neuron ablation and representation analysis techniques to assess changes in reasoning pathways. Our results demonstrate that role prompts do not significantly enhance the medical reasoning abilities of LLMs. Instead, they primarily affect surface-level linguistic features, with no evidence of distinct reasoning pathways or cognitive differentiation across clinical roles. Despite superficial stylistic changes, the core decision-making mechanisms of LLMs remain uniform across roles, indicating that current PBRP methods fail to replicate the cognitive complexity found in real-world medical practice. This highlights the limitations of role-playing in medical AI and emphasizes the need for models that simulate genuine cognitive processes rather than linguistic imitation.We have released the related code in the following repository:https: //github.com/IAAR-Shanghai/RolePlay_LLMDoctor
comment: 15 pages, 9 figures
☆ InteractComp: Evaluating Search Agents With Ambiguous Queries
Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.
☆ MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.
☆ Evolving Diagnostic Agents in a Virtual Clinical Environment
In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.
☆ Optimizing Retrieval for RAG via Reinforced Contrastive Learning
As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever's self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.
☆ Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia
We ask where, and under what conditions, dyslexic reading costs arise in a large-scale naturalistic reading dataset. Using eye-tracking aligned to word-level features (word length, frequency, and predictability), we model how each feature influences dyslexic time costs. We find that all three features robustly change reading times in both typical and dyslexic readers, and that dyslexic readers show stronger sensitivities to each, especially predictability. Counterfactual manipulations of these features substantially narrow the dyslexic-control gap by about one third, with predictability showing the strongest effect, followed by length and frequency. These patterns align with dyslexia theories that posit heightened demands on linguistic working memory and phonological encoding, and they motivate further work on lexical complexity and parafoveal preview benefits to explain the remaining gap. In short, we quantify when extra dyslexic costs arise, how large they are, and offer actionable guidance for interventions and computational models for dyslexics.
☆ OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning
Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model's internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.
☆ "Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.
comment: 9 pages
☆ Relative Scaling Laws for LLMs
Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.
☆ Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.
comment: 12 Pages
☆ Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs NeurIPS 2025
The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.
comment: Accepted to NeurIPS 2025 Workshop on Efficient Reasoning
☆ Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way
Diffusion-based large language models (dLLMs) have exhibited substantial potential for parallel text generation, which may enable more efficient generation compared to autoregressive models. However, current dLLMs suffer from fixed generation lengths, which indicates the generation lengths of dLLMs have to be determined before decoding as a hyper-parameter, leading to issues in efficiency and flexibility. To solve these problems, in this work, we propose to train a diffusion LLM with native variable generation lengths, abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately predict the [EOS] token in the generated text, which makes a dLLM be able to natively infer in a block diffusion manner, while still maintaining the ability of global bi-directional (full) attention and high parallelism. Experiments on standard benchmarks demonstrate that our method achieves a 30.1x speedup over traditional dLLM inference paradigms and a 2.4x speedup relative to autoregressive models such as Qwen and Llama. Our method achieves higher accuracy and faster inference, elevating dLLMs beyond mere academic novelty and supporting their practical use in real-world applications. Codes and models have been released.
☆ ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.
comment: Ongoing Work
☆ ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.
☆ BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation ICASSP 2026
Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.
comment: Submitted to ICASSP 2026
☆ Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.
comment: Dataset and code available at https://github.com/seyoungsong/OKHC
☆ Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written
Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand "bad" humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton
☆ Levée d'ambiguïtés par grammaires locales
Many words are ambiguous in terms of their part of speech (POS). However, when a word appears in a text, this ambiguity is generally much reduced. Disambiguating POS involves using context to reduce the number of POS associated with words, and is one of the main challenges of lexical tagging. The problem of labeling words by POS frequently arises in natural language processing, for example for spelling correction, grammar or style checking, expression recognition, text-to-speech conversion, text corpus analysis, etc. Lexical tagging systems are thus useful as an initial component of many natural language processing systems. A number of recent lexical tagging systems produce multiple solutions when the text is lexically ambiguous or the uniquely correct solution cannot be found. These contributions aim to guarantee a zero silence rate: the correct tag(s) for a word must never be discarded. This objective is unrealistic for systems that tag each word uniquely. This article concerns a lexical disambiguation method adapted to the objective of a zero silence rate and implemented in Silberztein's INTEX system (1993). We present here a formal description of this method. We show that to verify a local disambiguation grammar in this framework, it is not sufficient to consider the transducer paths separately: one needs to verify their interactions. Similarly, if a combination of multiple transducers is used, the result cannot be predicted by considering them in isolation. Furthermore, when examining the initial labeling of a text as produced by INTEX, ideas for disambiguation rules come spontaneously, but grammatical intuitions may turn out to be inaccurate, often due to an unforeseen construction or ambiguity. If a zero silence rate is targeted, local grammars must be carefully tested. This is where a detailed specification of what a grammar will do once applied to texts would be necessary.
comment: in French language
☆ Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.
☆ CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?
Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM's reliability.
☆ A word association network methodology for evaluating implicit biases in LLMs compared to humans
As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of LLMs with human cognition. To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party. Our results reveal both convergences and divergences between LLM and human biases, providing new perspectives on the potential risks of using LLMs. Our methodology contributes to a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing the goal of transparent and socially responsible language technologies.
comment: 24 pages, 13 figures, 3 tables
☆ Talk2Ref: A Dataset for Reference Prediction from Scientific Talks
Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.
☆ Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems
Hallucination remains one of the key obstacles to the reliable deployment of large language models (LLMs), particularly in real-world applications. Among various mitigation strategies, Retrieval-Augmented Generation (RAG) and reasoning enhancement have emerged as two of the most effective and widely adopted approaches, marking a shift from merely suppressing hallucinations to balancing creativity and reliability. However, their synergistic potential and underlying mechanisms for hallucination mitigation have not yet been systematically examined. This survey adopts an application-oriented perspective of capability enhancement to analyze how RAG, reasoning enhancement, and their integration in Agentic Systems mitigate hallucinations. We propose a taxonomy distinguishing knowledge-based and logic-based hallucinations, systematically examine how RAG and reasoning address each, and present a unified framework supported by real-world applications, evaluations, and benchmarks.
comment: 25 pages, 7 figures, 3 tables
☆ Iterative Critique-Refine Framework for Enhancing LLM Personalization
Personalized text generation requires models not only to produce coherent text but also to align with a target user's style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.
☆ Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices LREC 2026
While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.
comment: 12 pages, 1 figure. Submitted to the LREC 2026 conference
☆ SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.
☆ Law in Silico: Simulating Legal Society with LLM-Based Agents
Since real-world legal experiments are often costly or infeasible, simulating legal societies with Artificial Intelligence (AI) systems provides an effective alternative for verifying and developing legal theory, as well as supporting legal administration. Large Language Models (LLMs), with their world knowledge and role-playing capabilities, are strong candidates to serve as the foundation for legal society simulation. However, the application of LLMs to simulate legal systems remains underexplored. In this work, we introduce Law in Silico, an LLM-based agent framework for simulating legal scenarios with individual decision-making and institutional mechanisms of legislation, adjudication, and enforcement. Our experiments, which compare simulated crime rates with real-world data, demonstrate that LLM-based agents can largely reproduce macro-level crime trends and provide insights that align with real-world observations. At the same time, micro-level simulations reveal that a well-functioning, transparent, and adaptive legal system offers better protection of the rights of vulnerable individuals.
☆ Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content NeurIPS 2025
Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.
comment: Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: 5th Muslims in Machine Learning (MusIML) Workshop
☆ LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.
☆ SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
☆ Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models EMNLP 2025
Recent efforts leverage knowledge distillation techniques to develop lightweight and practical sentiment analysis models. These methods are grounded in human-written instructions and large-scale user texts. Despite the promising results, two key challenges remain: (1) manually written instructions are limited in diversity and quantity, making them insufficient to ensure comprehensive coverage of distilled knowledge; (2) large-scale user texts incur high computational cost, hindering the practicality of these methods. To this end, we introduce COMPEFFDIST, a comprehensive and efficient distillation framework for sentiment analysis. Our framework consists of two key modules: attribute-based automatic instruction construction and difficulty-based data filtering, which correspondingly tackle the aforementioned challenges. Applying our method across multiple model series (Llama-3, Qwen-3, and Gemma-3), we enable 3B student models to match the performance of 20x larger teacher models on most tasks. In addition, our approach greatly outperforms baseline methods in data efficiency, attaining the same performance level with only 10% of the data.
comment: Accepted by EMNLP 2025. 22 pages, 9 figures. The first two authors contribute equally
☆ OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.
comment: work in progress
☆ Text Simplification with Sentence Embeddings
Sentence embeddings can be decoded to give approximations of the original texts used to create them. We explore this effect in the context of text simplification, demonstrating that reconstructed text embeddings preserve complexity levels. We experiment with a small feed forward neural network to effectively learn a transformation between sentence embeddings representing high-complexity and low-complexity texts. We provide comparison to a Seq2Seq and LLM-based approach, showing encouraging results in our much smaller learning setting. Finally, we demonstrate the applicability of our transformation to an unseen simplification dataset (MedEASI), as well as datasets from languages outside the training data (ES,DE). We conclude that learning transformations in sentence embedding space is a promising direction for future research and has potential to unlock the ability to develop small, but powerful models for text simplification and other natural language generation tasks.
☆ Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs) and widely adopted tools. However, existing benchmarks for code agent evaluation face two major limitations: high annotation cost and expertise requirements, and rigid evaluation metrics that rely primarily on unit tests. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse and challenging project-level tasks. Based on this approach, we introduce PRDBench, a novel benchmark comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Document (PRD) requirements, comprehensive evaluation criteria, and reference implementations. PRDBench features rich data sources, high task complexity, and flexible metrics. We further employ an Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of various test types beyond unit tests. Extensive experiments on PRDBench demonstrate its effectiveness in assessing the capabilities of both code agents and evaluation agents, providing a scalable and robust framework for annotation and evaluation.
☆ LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability EMNLP
Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.
comment: EMNLP Findings 2025
☆ Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.
comment: Cultural Knowledge, Everyday Knowledge, Open-Ended Question, Chain-of-Thought, Large Language Models, Native, Multilingual, Language Diversity
☆ Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning
Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
comment: Preprint, 25 pages, 9 figures. Code: https://github.com/WooooDyy/Critique-RL
☆ Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.
☆ MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference
In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.
comment: Pre-print
☆ ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.
☆ Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?
Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs -- GPT, Claude, Deepseek and Grok -- across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.
☆ From Memorization to Reasoning in the Spectrum of Loss Curvature
We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.
☆ Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations
Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.
comment: 11 pages excluding references and appendix. 3 figures and 6 tables
☆ Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations
In this work, we tackle the Diacritic Restoration (DR) task for Arabic dialectal sentences using a multimodal approach that combines both textual and speech information. We propose a model that represents the text modality using an encoder extracted from our own pre-trained model named CATT. The speech component is handled by the encoder module of the OpenAI Whisper base model. Our solution is designed following two integration strategies. The former consists of fusing the speech tokens with the input at an early stage, where the 1500 frames of the audio segment are averaged over 10 consecutive frames, resulting in 150 speech tokens. To ensure embedding compatibility, these averaged tokens are processed through a linear projection layer prior to merging them with the text tokens. Contextual encoding is guaranteed by the CATT encoder module. The latter strategy relies on cross-attention, where text and speech embeddings are fused. The cross-attention output is then fed to the CATT classification head for token-level diacritic prediction. To further improve model robustness, we randomly deactivate the speech input during training, allowing the model to perform well with or without speech. Our experiments show that the proposed approach achieves a word error rate (WER) of 0.25 and a character error rate (CER) of 0.9 on the development set. On the test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.
☆ Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models? NeurIPS 2025
Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
☆ HACK: Hallucinations Along Certainty and Knowledge Axes
Hallucinations in LLMs present a critical barrier to their reliable usage. Existing research usually categorizes hallucination by their external properties rather than by the LLMs' underlying internal properties. This external focus overlooks that hallucinations may require tailored mitigation strategies based on their underlying mechanism. We propose a framework for categorizing hallucinations along two axes: knowledge and certainty. Since parametric knowledge and certainty may vary across models, our categorization method involves a model-specific dataset construction process that differentiates between those types of hallucinations. Along the knowledge axis, we distinguish between hallucinations caused by a lack of knowledge and those occurring despite the model having the knowledge of the correct response. To validate our framework along the knowledge axis, we apply steering mitigation, which relies on the existence of parametric knowledge to manipulate model activations. This addresses the lack of existing methods to validate knowledge categorization by showing a significant difference between the two hallucination types. We further analyze the distinct knowledge and hallucination patterns between models, showing that different hallucinations do occur despite shared parametric knowledge. Turning to the certainty axis, we identify a particularly concerning subset of hallucinations where models hallucinate with certainty despite having the correct knowledge internally. We introduce a new evaluation metric to measure the effectiveness of mitigation methods on this subset, revealing that while some methods perform well on average, they fail disproportionately on these critical cases. Our findings highlight the importance of considering both knowledge and certainty in hallucination analysis and call for targeted mitigation approaches that consider the hallucination underlying factors.
comment: The code is available at https://github.com/technion-cs-nlp/HACK_Hallucinations_Along_Certainty_and_Knowledge_axes
☆ Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment
Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.
comment: an early-stage version
☆ Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability
This paper explores the influence of external knowledge integration in Natural Language Generation (NLG), focusing on a commonsense generation task. We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input concept sets with retrieved semantic relations from ConceptNet and includes manually annotated outputs. Using the T5-Large model, we compare sentence generation under two conditions: with full external knowledge and with filtered knowledge where highly relevant relations were deliberately removed. Our interpretability benchmark follows a three-stage method: (1) identifying and removing key knowledge, (2) regenerating sentences, and (3) manually assessing outputs for commonsense plausibility and concept coverage. Results show that sentences generated with full knowledge achieved 91\% correctness across both criteria, while filtering reduced performance drastically to 6\%. These findings demonstrate that relevant external knowledge is critical for maintaining both coherence and concept coverage in NLG. This work highlights the importance of designing interpretable, knowledge-enhanced NLG systems and calls for evaluation frameworks that capture the underlying reasoning beyond surface-level metrics.
☆ MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.
☆ Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean ACL
We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
comment: submitted to ACL ARR Rolling Review
☆ Beyond Line-Level Filtering for the Pretraining Corpora of LLMs ACL
While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.
comment: submitted to ACL ARR Rolling Review
☆ VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.
comment: Accepted by EMNLP 2025
☆ Reinforcement Learning for Long-Horizon Multi-Turn Search Agents NeurIPS 2025
Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.
comment: 4 pages plus references and appendices. Accepted into the First Workshop on Multi-Turn Interactions in Large Language Models at NeurIPS 2025
☆ Squrve: A Unified and Modular Framework for Complex Real-World Text-to-SQL Tasks
Text-to-SQL technology has evolved rapidly, with diverse academic methods achieving impressive results. However, deploying these techniques in real-world systems remains challenging due to limited integration tools. Despite these advances, we introduce Squrve, a unified, modular, and extensive Text-to-SQL framework designed to bring together research advances and real-world applications. Squrve first establishes a universal execution paradigm that standardizes invocation interfaces, then proposes a multi-actor collaboration mechanism based on seven abstracted effective atomic actor components. Experiments on widely adopted benchmarks demonstrate that the collaborative workflows consistently outperform the original individual methods, thereby opening up a new effective avenue for tackling complex real-world queries. The codes are available at https://github.com/Satissss/Squrve.
☆ RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects
The Bengali language, spoken extensively across South Asia and among diasporic communities, exhibits considerable dialectal diversity shaped by geography, culture, and history. Phonological and pronunciation-based classifications broadly identify five principal dialect groups: Eastern Bengali, Manbhumi, Rangpuri, Varendri, and Rarhi. Within Bangladesh, further distinctions emerge through variation in vocabulary, syntax, and morphology, as observed in regions such as Chittagong, Sylhet, Rangpur, Rajshahi, Noakhali, and Barishal. Despite this linguistic richness, systematic research on the computational processing of Bengali dialects remains limited. This study seeks to document and analyze the phonetic and morphological properties of these dialects while exploring the feasibility of building computational models particularly Automatic Speech Recognition (ASR) systems tailored to regional varieties. Such efforts hold potential for applications in virtual assistants and broader language technologies, contributing to both the preservation of dialectal diversity and the advancement of inclusive digital tools for Bengali-speaking communities. The dataset created for this study is released for public use.
comment: 26 pages
☆ Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures
To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.
comment: Preprint
☆ Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation
Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.
☆ Pie: A Programmable Serving System for Emerging LLM Applications SOSP 2025
Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.
comment: SOSP 2025. Source code available at https://github.com/pie-project/pie
☆ GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler Research
We introduce GraphNet, a dataset of 2.7K real-world deep learning computational graphs with rich metadata, spanning six major task categories across multiple deep learning frameworks. To evaluate tensor compiler performance on these samples, we propose the benchmark metric Speedup Score S(t), which jointly considers runtime speedup and execution correctness under tunable tolerance levels, offering a reliable measure of general optimization capability. Furthermore, we extend S(t) to the Error-aware Speedup Score ES(t), which incorporates error information and helps compiler developers identify key performance bottlenecks. In this report, we benchmark the default tensor compilers, CINN for PaddlePaddle and TorchInductor for PyTorch, on computer vision (CV) and natural language processing (NLP) samples to demonstrate the practicality of GraphNet. The full construction pipeline with graph extraction and compiler evaluation tools is available at https://github.com/PaddlePaddle/GraphNet .
☆ Success and Cost Elicit Convention Formation for Efficient Communication
Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.
☆ SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs
Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (LLMs) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher's confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher's uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the "propose-and-verify" paradigm of speculative decoding. At each step, the student's token proposal is verified against the teacher's distribution; the distillation loss is selectively applied only to "accepted" tokens, while "rejected" tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.
☆ Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward
Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.
comment: 23pages, 4figures
☆ TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents ACL 2025
The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB
comment: ACL 2025. Source code: https://github.com/yzjiao/Text2DB
☆ META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine
Evidence-based medicine (EBM) holds a crucial role in clinical application. Given suitable medical articles, doctors effectively reduce the incidence of misdiagnoses. Researchers find it efficient to use large language models (LLMs) techniques like RAG for EBM tasks. However, the EBM maintains stringent requirements for evidence, and RAG applications in EBM struggle to efficiently distinguish high-quality evidence. Therefore, inspired by the meta-analysis used in EBM, we provide a new method to re-rank and filter the medical evidence. This method presents multiple principles to filter the best evidence for LLMs to diagnose. We employ a combination of several EBM methods to emulate the meta-analysis, which includes reliability analysis, heterogeneity analysis, and extrapolation analysis. These processes allow the users to retrieve the best medical evidence for the LLMs. Ultimately, we evaluate these high-quality articles and show an accuracy improvement of up to 11.4% in our experiments and results. Our method successfully enables RAG to extract higher-quality and more reliable evidence from the PubMed dataset. This work can reduce the infusion of incorrect knowledge into responses and help users receive more effective replies.
☆ PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine
Evidence-based medicine (EBM) research has always been of paramount importance. It is important to find appropriate medical theoretical support for the needs from physicians or patients to reduce the occurrence of medical accidents. This process is often carried out by human querying relevant literature databases, which lacks objectivity and efficiency. Therefore, researchers utilize retrieval-augmented generation (RAG) to search for evidence and generate responses automatically. However, current RAG methods struggle to handle complex queries in real-world clinical scenarios. For example, when queries lack certain information or use imprecise language, the model may retrieve irrelevant evidence and generate unhelpful answers. To address this issue, we present the PICOs-RAG to expand the user queries into a better format. Our method can expand and normalize the queries into professional ones and use the PICO format, a search strategy tool present in EBM, to extract the most important information used for retrieval. This approach significantly enhances retrieval efficiency and relevance, resulting in up to an 8.8\% improvement compared to the baseline evaluated by our method. Thereby the PICOs-RAG improves the performance of the large language models into a helpful and reliable medical assistant in EBM.
☆ M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems
Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.
☆ emg2speech: synthesizing speech from electromyography using self-supervised speech models
We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials: SS features can be linearly mapped to EMG power with a correlation of $r = 0.85$. Moreover, EMG power vectors corresponding to different articulatory gestures form structured and separable clusters in feature space. This relationship: $\text{SS features}$ $\xrightarrow{\texttt{linear mapping}}$ $\text{EMG power}$ $\xrightarrow{\texttt{gesture-specific clustering}}$ $\text{articulatory movements}$, highlights that SS models implicitly encode articulatory mechanisms. Leveraging this property, we directly map EMG signals to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory models and vocoder training.
☆ Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs
There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.
☆ StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems
Automatically configuring storage systems is hard: parameter spaces are large and conditions vary across workloads, deployments, and versions. Heuristic and ML tuners are often system specific, require manual glue, and degrade under changes. Recent LLM-based approaches help but usually treat tuning as a single-shot, system-specific task, which limits cross-system reuse, constrains exploration, and weakens validation. We present StorageXTuner, an LLM agent-driven auto-tuning framework for heterogeneous storage engines. StorageXTuner separates concerns across four agents - Executor (sandboxed benchmarking), Extractor (performance digest), Searcher (insight-guided configuration exploration), and Reflector (insight generation and management). The design couples an insight-driven tree search with layered memory that promotes empirically validated insights and employs lightweight checkers to guard against unsafe actions. We implement a prototype and evaluate it on RocksDB, LevelDB, CacheLib, and MySQL InnoDB with YCSB, MixGraph, and TPC-H/C. Relative to out-of-the-box settings and to ELMo-Tune, StorageXTuner reaches up to 575% and 111% higher throughput, reduces p99 latency by as much as 88% and 56%, and converges with fewer trials.
comment: ArXiv version; Affiliations: Arizona State University (Lin, Zhang, Thakkar, Sun, Cao) and Iowa State University (Zheng)
☆ Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers
Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.
comment: 9 pages, 10 figures
☆ POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.
comment: 14 pages, under review
☆ Sequences of Logits Reveal the Low Rank Structure of Language Models
A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation -- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.
☆ Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale NeurIPS 2025
We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.
comment: To be presented at NeurIPS 2025
☆ Finding Culture-Sensitive Neurons in Vision-Language Models
Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.
comment: 22 pages, 13 figures
☆ SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens
The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM's hidden embeddings (termed ``implicit reasoning'') rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at https://github.com/YinhanHe123/SemCoT/.
☆ Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction NeurIPS 2025
Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.
comment: Accepted to the First Workshop on Interpreting Cognition in Deep Learning Models (CogInterp @ NeurIPS 2025)
☆ RiddleBench: A New Generative Reasoning Benchmark for LLMs
Large Language Models have demonstrated strong performance on many established reasoning benchmarks. However, these benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that are central to human intelligence. These abilities require integrating logical deduction with spatial awareness and constraint satisfaction, which current evaluations do not measure well. To address this, we introduce RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities. Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses. Even top proprietary models like Gemini 2.5 Pro, o3, and Claude 4 Sonnet achieve accuracy just above 60% (60.30%, 63.37%, and 63.16%). Analysis further reveals deep failures, including hallucination cascades (accepting flawed reasoning from other models) and poor self-correction due to a strong self-confirmation bias. Their reasoning is also fragile, with performance degrading significantly when constraints are reordered or irrelevant information is introduced. RiddleBench functions as a diagnostic tool for these issues and as a resource for guiding the development of more robust and reliable language models.
☆ Idea2Plan: Exploring AI-Powered Research Planning
Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs' research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs' capability for research planning and lay the groundwork for future progress.
☆ Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation
We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
comment: https://github.com/alexmartin1722/mirage
☆ Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.
♻ ☆ Retrieval-Augmented Generation-based Relation Extraction
Information Extraction (IE) is a transformative process that converts unstructured text data into a structured format by employing entity and relation extraction (RE) methodologies. The identification of the relation between a pair of entities plays a crucial role within this framework. Despite the existence of various techniques for relation extraction, their efficacy heavily relies on access to labeled data and substantial computational resources. In addressing these challenges, Large Language Models (LLMs) emerge as promising solutions; however, they might return hallucinating responses due to their own training data. To overcome these limitations, Retrieved-Augmented Generation-based Relation Extraction (RAG4RE) in this work is proposed, offering a pathway to enhance the performance of relation extraction tasks. This work evaluated the effectiveness of our RAG4RE approach utilizing different LLMs. Through the utilization of established benchmarks, such as TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to comprehensively evaluate the efficacy of our RAG4RE approach. In particularly, we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our investigation. The results of our study demonstrate that our RAG4RE approach surpasses performance of traditional RE approaches based solely on LLMs, particularly evident in the TACRED dataset and its variations. Furthermore, our approach exhibits remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets, underscoring its efficacy and potential for advancing RE tasks in natural language processing.
comment: published at the Semantic Web journal. The last version is available: https://doi.org/10.1177/22104968251385519
♻ ☆ Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines. This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison. The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on \href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}
comment: 8 pages for main body, 19 pages in total
♻ ☆ Says Who? Effective Zero-Shot Annotation of Focalization
Focalization describes the way in which access to narrative information is restricted or controlled based on the knowledge available to knowledge of the narrator. It is encoded via a wide range of lexico-grammatical features and is subject to reader interpretation. Even trained annotators frequently disagree on correct labels, suggesting this task is both qualitatively and computationally challenging. In this work, we test how well five contemporary large language model (LLM) families and two baselines perform when annotating short literary excerpts for focalization. Despite the challenging nature of the task, we find that LLMs show comparable performance to trained human annotators, with GPT-4o achieving an average F1 of 84.79%. Further, we demonstrate that the log probabilities output by GPT-family models frequently reflect the difficulty of annotating particular excerpts. Finally, we provide a case study analyzing sixteen Stephen King novels, demonstrating the usefulness of this approach for computational literary studies and the insights gleaned from examining focalization at scale.
comment: Accepted at CHR 2025
♻ ☆ TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models
Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification (MTSC). Effective adaptation of LLMs for MTSC necessitates informative data representations. Existing LLM-based methods directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs. Despite their effectiveness, we reveal that these methods conceal three inherent bottlenecks: (1) they struggle to encode temporal and channel-specific information in a lossless manner, both of which are critical components of multivariate time series; (2) it is much difficult to align the learned representation space with the semantic space of the LLMs; (3) they require task-specific retraining, which is both computationally expensive and labor-intensive. To bridge these gaps, we propose TableTime, which reformulates MTSC as a table understanding task. Specifically, TableTime introduces the following strategies: (1) convert multivariate time series into a tabular form, thus minimizing information loss to the greatest extent; (2) represent tabular time series in text format to achieve natural alignment with the semantic space of LLMs; (3) design a reasoning framework that integrates contextual text information, neighborhood assistance, multi-path inference and problem decomposition to enhance the reasoning ability of LLMs and realize zero-shot classification. Extensive experiments performed on 10 publicly representative datasets from UEA archive verify the superiorities of the TableTime.
♻ ☆ BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.
comment: 25 pages
♻ ☆ The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness NeurIPS 2025
Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.
comment: NeurIPS 2025 (Spotlight). Code is available at: https://github.com/microsoft/Test_Awareness_Steering
♻ ☆ TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.
♻ ☆ Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
comment: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence in Measurement and Education Conference (AIME-Con)
♻ ☆ AutoJudge: Judge Decoding Without Manual Annotation
We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to $\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$ drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting $\ge 25$ tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.
♻ ☆ Mano Technical Report
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
♻ ☆ Are you sure? Measuring models bias in content moderation through uncertainty ACL
Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are being increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the $F_1$ score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.
comment: accepted at Findings of ACL: EMNLP 2025
♻ ☆ From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users
The pursuit of human-level artificial intelligence (AI) has significantly advanced the development of autonomous agents and Large Language Models (LLMs). LLMs are now widely utilized as decision-making agents for their ability to interpret instructions, manage sequential tasks, and adapt through feedback. This review examines recent developments in employing LLMs as autonomous agents and tool users and comprises seven research questions. We only used the papers published between 2023 and 2025 in conferences of the A* and A rank and Q1 journals. A structured analysis of the LLM agents' architectural design principles, dividing their applications into single-agent and multi-agent systems, and strategies for integrating external tools is presented. In addition, the cognitive mechanisms of LLM, including reasoning, planning, and memory, and the impact of prompting methods and fine-tuning procedures on agent performance are also investigated. Furthermore, we evaluated current benchmarks and assessment protocols and have provided an analysis of 68 publicly available datasets to assess the performance of LLM-based agents in various tasks. In conducting this review, we have identified critical findings on verifiable reasoning of LLMs, the capacity for self-improvement, and the personalization of LLM-based agents. Finally, we have discussed ten future research directions to overcome these gaps.
comment: Submitted to Artificial Intelligence Review for peer review
♻ ☆ Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.
comment: Under review. Both the dataset and code will be published
♻ ☆ Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs NeurIPS 2025
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
comment: Accepted by NeurIPS 2025 Dataset and Benchmark Track, Project page: https://liuxuannan.github.io/Video-SafetyBench.github.io/
♻ ☆ Zero-Shot Tokenizer Transfer NeurIPS 2024
Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.
comment: NeurIPS 2024
♻ ☆ Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings
Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.
comment: Code and data at https://github.com/drusso98/face-the-facts - Accepted for publication at INLG 2025
♻ ☆ Provable Scaling Laws for the Test-Time Compute of Large Language Models NeurIPS 2025
We propose two simple, principled and practical algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs). The first one is a two-stage knockout-style algorithm: given an input problem, it first generates multiple candidate solutions, and then aggregate them via a knockout tournament for the final output. Assuming that the LLM can generate a correct solution with non-zero probability and do better than a random guess in comparing a pair of correct and incorrect solutions, we prove theoretically that the failure probability of this algorithm decays to zero exponentially or by a power law (depending on the specific way of scaling) as its test-time compute grows. The second one is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents, rather than eliminated upon loss to a single opponent. Under analogous but more robust assumptions, we prove that its failure probability also decays to zero exponentially with more test-time compute. Both algorithms require a black-box LLM and nothing else (e.g., no verifier or reward model) for a minimalistic implementation, which makes them appealing for practical applications and easy to adapt for different tasks. Through extensive experiments with diverse models and datasets, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.
comment: NeurIPS 2025 camera-ready version
♻ ☆ Offline Learning and Forgetting for Reasoning with Large Language Models
Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model's search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.
comment: Published in Transactions on Machine Learning Research (TMLR), 2025. Code: https://github.com/twni2016/llm-reasoning-uft
♻ ☆ LittleBit: Ultra Low-Bit Quantization via Latent Factorization NeurIPS 2025
Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off--unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level--and makes powerful LLMs practical for resource-constrained environments.
comment: Accepted to NeurIPS 2025. Banseok Lee and Dongkyu Kim contributed equally
♻ ☆ NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables NeurIPS 2025
Processing structured tabular data, particularly large and lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack primarily focus on unstructured text, neglecting the challenge of diverse structured tables. Meanwhile, previous tabular benchmarks mainly consider downstream tasks that require high-level reasoning abilities, and overlook models' underlying fine-grained perception of individual table cells, which is crucial for practical and robust LLM-based table applications. To address this gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular benchmark that treats each table cell as a ``needle'' and requires models to extract the target cell based on cell locations or lookup questions. Our comprehensive evaluation of various LLMs and multimodal LLMs reveals a substantial performance gap between popular downstream tabular tasks and the simpler NIAT task, suggesting that they may rely on dataset-specific correlations or shortcuts to obtain better benchmark results but lack truly robust long-context understanding towards structured tables. Furthermore, we demonstrate that using synthesized NIAT training data can effectively improve performance on both NIAT task and downstream tabular tasks, which validates the importance of NIAT capability for LLMs' genuine table understanding ability.
comment: Accepted by NeurIPS 2025
♻ ☆ DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations
Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs' capabilities. Experimental results demonstrate that DRVOICE-7B establishes new state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, while achieving performance comparable to the SOTA on VoiceBench and UltraEval-Audio benchmarks, making it a leading open-source speech foundation model in ~7B models.
comment: Work in progress
♻ ☆ Evaluation of Geographical Distortions in Language Models
Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with ten widely used language models. Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.
comment: Accepted version. Published in Machine Learning (Springer) 114:263 (2025). Open access under a CC BY-NC-ND 4.0 license. DOI: 10.1007/s10994-025-06916-9
♻ ☆ Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views NeurIPS 2025
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
comment: 10 pages, 6 figures, 2 tables. Accepted to the NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE). Dataset: https://huggingface.co/datasets/annadeichler/KTH-ARIA-referential
♻ ☆ LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models. Our code and datasets are available at https://github.com/DEEP-PolyU/LinearRAG.
♻ ☆ MATCH: Task-Driven Code Evaluation through Contrastive Learning
AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.
♻ ☆ GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training
Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.
comment: Under review
♻ ☆ Context-level Language Modeling by Learning Predictive Context Embeddings
Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model's capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.
comment: 16pages,6 figures
♻ ☆ Surface Reading LLMs: Synthetic Text and its Styles
Despite a potential plateau in ML advancement, the societal impact of large language models lies not in approaching superintelligence but in generating text surfaces indistinguishable from human writing. While Critical AI Studies provides essential material and socio-technical critique, it risks overlooking how LLMs phenomenologically reshape meaning-making. This paper proposes a semiotics of "surface integrity" as attending to the immediate plane where LLMs inscribe themselves into human communication. I distinguish three knowledge interests in ML research (epistemology, epist\=em\=e, and epistemics) and argue for integrating surface-level stylistic analysis alongside depth-oriented critique. Through two case studies examining stylistic markers of synthetic text, I argue how attending to style as a semiotic phenomenon reveals LLMs as cultural actors that transform the conditions of meaning emergence and circulation in contemporary discourse, independent of questions about machine consciousness.
comment: 12 pages, 1 figure
♻ ☆ SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture ACL 2025
Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models' comprehension of India's rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India's cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.
comment: ACL 2025 Findings
♻ ☆ MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards
Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.
♻ ☆ TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration NeurIPS 2025
Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modeling. However, the heterogeneity of data and the diversity of trajectory tasks make effective and reliable trajectory modeling an important yet highly challenging endeavor, even for domain experts. In this paper, we propose TrajAgent, an agent framework powered by large language models, designed to facilitate robust and efficient trajectory modeling through automation modeling. This framework leverages and optimizes diverse specialized models to address various trajectory modeling tasks across different datasets effectively. In TrajAgent, we first develop UniEnv, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on UniEnv, we introduce an agentic workflow designed for automatic trajectory modeling across various trajectory tasks and data. Furthermore, we introduce collaborative learning schema between LLM-based agents and small speciallized models, to enhance the performance of the whole framework effectively. Extensive experiments on five tasks using four real-world datasets demonstrate the effectiveness of TrajAgent in automated trajectory modeling, achieving a performance improvement of 2.38%-69.91% over baseline methods. The codes and data can be accessed via https://github.com/tsinghua-fib-lab/TrajAgent.
comment: Accepted by NeurIPS 2025, https://github.com/tsinghua-fib-lab/TrajAgent
♻ ☆ The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability
An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.
♻ ☆ FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82$\times$ in prefill and 2.87$\times$ in decoding compared to the full-context baseline, while matching the accuracy of the baselines that only accelerate the decoding stage. Our code is available at https://github.com/dongwonjo/FastKV.
♻ ☆ Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay NeurIPS 2025
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism inspired by experience replay in traditional RL. This technique reuses recent rollouts, lowering per-step computation while maintaining stable updates. Experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 23% to 62% while reaching the same level of performance as the original GRPO algorithm. Our code is available at https://github.com/ASTRAL-Group/data-efficient-llm-rl.
comment: Accepted at NeurIPS 2025
♻ ☆ ReCode: Unify Plan and Action for Universal Granularity Control
Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.
♻ ☆ AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation EMNLP 2025
Prompting-based conversational query reformulation has emerged as a powerful approach for conversational search, refining ambiguous user queries into standalone search queries. Best-of-N reformulation over the generated candidates via prompting shows impressive potential scaling capability. However, both the previous tuning methods (training time) and adaptation approaches (test time) can not fully unleash their benefits. In this paper, we propose AdaRewriter, a novel framework for query reformulation using an outcome-supervised reward model via test-time adaptation. By training a lightweight reward model with contrastive ranking loss, AdaRewriter selects the most promising reformulation during inference. Notably, it can operate effectively in black-box systems, including commercial LLM APIs. Experiments on five conversational search datasets show that AdaRewriter significantly outperforms the existing methods across most settings, demonstrating the potential of test-time adaptation for conversational query reformulation.
comment: Accepted by EMNLP 2025
♻ ☆ MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models
Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs' ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.
comment: project page:https://mined-lmm.github.io/
♻ ☆ Navigation with VLM framework: Towards Going to Any Language
Navigating towards fully open language goals and exploring open scenes in an intelligent way have always raised significant challenges. Recently, Vision Language Models (VLMs) have demonstrated remarkable capabilities to reason with both language and visual data. Although many works have focused on leveraging VLMs for navigation in open scenes, they often require high computational cost, rely on object-centric approaches, or depend on environmental priors in detailed human instructions. We introduce Navigation with VLM (NavVLM), a training-free framework that harnesses open-source VLMs to enable robots to navigate effectively, even for human-friendly language goal such as abstract places, actions, or specific objects in open scenes. NavVLM leverages the VLM as its cognitive core to perceive environmental information and constantly provides exploration guidance achieving intelligent navigation with only a neat target rather than a detailed instruction with environment prior. We evaluated and validated NavVLM in both simulation and real-world experiments. In simulation, our framework achieves state-of-the-art performance in Success weighted by Path Length (SPL) on object-specifc tasks in richly detailed environments from Matterport 3D (MP3D), Habitat Matterport 3D (HM3D) and Gibson. With navigation episode reported, NavVLM demonstrates the capabilities to navigate towards any open-set languages. In real-world validation, we validated our framework's effectiveness in real-world robot at indoor scene.
comment: under review
♻ ☆ Discourse Features Enhance Detection of Document-Level Machine-Generated Content IJCNN 2025
The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets - 15.5% absolute improvement on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvemene on M4 compared to SOTA approaches. The data and code are available at: https://github.com/myxp-lyp/Discourse-Features-Enhance-Detection-of-Document-Level-Machine-Generated-Content.git.
comment: Accepted by IJCNN 2025
♻ ☆ BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world clinical data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. It covers eight major task types spanning the entire continuum of patient care across six clinical stages and 20 representative applications, including triage and referral, consultation, information extraction, diagnosis, prognosis, and billing coding, and involves 14 clinical specialties. We systematically evaluated 95 LLMs (including DeepSeek-R1, GPT-4o, Gemini series, and Qwen3 series) under various inference strategies. Our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding. The BRIDGE leaderboard: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard
♻ ☆ SEER: The Span-based Emotion Evidence Retrieval Benchmark
We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.
♻ ☆ PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings ACL 2025
Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensive datasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion.
comment: ACL 2025 Main. Code and dataset are released at: https://github.com/holi-lab/PVP_Personalized_Visual_Persuasion
♻ ☆ DBLPLink 2.0 -- An Entity Linker for the DBLP Scholarly Knowledge Graph
In this work we present an entity linker for DBLP's 2025 version of RDF-based Knowledge Graph. Compared to the 2022 version, DBLP now considers publication venues as a new entity type called dblp:Stream. In the earlier version of DBLPLink, we trained KG-embeddings and re-rankers on a dataset to produce entity linkings. In contrast, in this work, we develop a zero-shot entity linker using LLMs using a novel method, where we re-rank candidate entities based on the log-probabilities of the "yes" token output at the penultimate layer of the LLM.
♻ ☆ Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models NeurIPS 2025
Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.
comment: NeurIPS 2025
♻ ☆ OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs
The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different papers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified framework for building customized automatic fact-checking systems, benchmarking their accuracy, evaluating factuality of LLMs, and verifying claims in a document. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM's factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets. Data and code are publicly available at https://github.com/yuxiaw/openfactcheck.
comment: 23 pages, 8 tables, 11 figures, Published In Proceedings of the 31st International Conference on Computational Linguistics 2025
♻ ☆ OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures, which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.
comment: 11 pages, 4 Figures, 3 Tables, Published In Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing
♻ ☆ Quantifying Phonosemantic Iconicity Distributionally in 6 Languages AACL 2025
Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations--both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes' phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.
comment: IJCNLP-AACL 2025 Main Conference Proceedings
♻ ☆ UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking
The rapid adoption of Large Language Models (LLMs) has raised important concerns about the factual reliability of their outputs, particularly in low-resource languages such as Urdu. Existing automated fact-checking systems are predominantly developed for English, leaving a significant gap for the more than 200 million Urdu speakers worldwide. In this work, we present UrduFactBench and UrduFactQA, two novel hand-annotated benchmarks designed to enable fact-checking and factual consistency evaluation in Urdu. While UrduFactBench focuses on claim verification, UrduFactQA targets the factuality of LLMs in question answering. These resources, the first of their kind for Urdu, were developed through a multi-stage annotation process involving native Urdu speakers. To complement these benchmarks, we introduce UrduFactCheck, a modular fact-checking framework that incorporates both monolingual and translation-based evidence retrieval strategies to mitigate the scarcity of high-quality Urdu evidence. Leveraging these resources, we conduct an extensive evaluation of twelve LLMs and demonstrate that translation-augmented pipelines consistently enhance performance compared to monolingual ones. Our findings reveal persistent challenges for open-source LLMs in Urdu and underscore the importance of developing targeted resources. All code and data are publicly available at https://github.com/mbzuai-nlp/UrduFactCheck.
comment: 15 pages, 4 figures, 5 tables, 6 Listings, Published in Proceeding of The 2025 Conference on Empirical Methods in Natural Language Processing
♻ ☆ p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding
Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p$-less sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p$-less sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p$-less sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p$-less through qualitative examples, case studies, and diversity assessments. The code is available at https://github.com/ryttry/p-less .
♻ ☆ CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models
This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL), focusing on incomplete and corrupted data in preference datasets. We propose a novel method for robustly and completely recalibrating values within these datasets to enhance LLMs' resilience against the issues. In particular, we devise a guaranteed polynomial time ranking algorithm that robustifies several existing models, such as the classic Bradley-Terry-Luce (BTL) (Bradley and Terry, 1952) model and certain generalizations of it. To the best of our knowledge, our present work is the first to propose an algorithm that provably recovers an $\epsilon$-optimal ranking with high probability while allowing as large as $O(n)$ perturbed pairwise comparison results per model response. Furthermore, we show robust recovery results in the partially observed setting. Our experiments confirm that our algorithms handle adversarial noise and unobserved comparisons well in both general and LLM preference dataset settings. This work contributes to the development and scaling of more reliable and ethically aligned AI models by equipping the dataset curation pipeline with the ability to handle missing and maliciously manipulated inputs.
♻ ☆ Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models
We document the capability of large language models (LLMs) like ChatGPT to predict stock market reactions from news headlines without direct financial training. Using post-knowledge-cutoff headlines, GPT-4 captures initial market responses, achieving approximately 90% portfolio-day hit rates for the non-tradable initial reaction. GPT-4 scores also significantly predict the subsequent drift, especially for small stocks and negative news. Forecasting ability generally increases with model size, suggesting that financial reasoning is an emerging capacity of complex LLMs. Strategy returns decline as LLM adoption rises, consistent with improved price efficiency. To rationalize these findings, we develop a theoretical model that incorporates LLM technology, information-processing capacity constraints, underreaction, and limits to arbitrage.
comment: Previously posted in SSRN https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4412788
♻ ☆ MMD-Flagger: Leveraging Maximum Mean Discrepancy to Detect Hallucinations
Large language models (LLMs) have become pervasive in our everyday life. Yet, a fundamental obstacle prevents their use in many critical applications: their propensity to generate fluent, human-quality content that is not grounded in reality. The detection of such hallucinations is thus of the highest importance. In this work, we propose a new method to flag hallucinated content: MMD-Flagger. It relies on Maximum Mean Discrepancy (MMD), a non-parametric distance between distributions. On a high-level perspective, MMD-Flagger tracks the MMD between the output to inspect and counterparts generated with various temperature parameters. We show empirically that inspecting the shape of this trajectory is sufficient to detect most hallucinations. This novel method is benchmarked on machine translation and summarization datasets, on which it exhibits competitive performance relative to natural competitors.
♻ ☆ Do predictability factors towards signing avatars hold across cultures? ICASSP 2023
Avatar technology can offer accessibility possibilities and improve the Deaf-and-Hard of Hearing sign language users access to communication, education and services, such as the healthcare system. However, sign language users acceptance of signing avatars as well as their attitudes towards them vary and depend on many factors. Furthermore, research on avatar technology is mostly done by researchers who are not Deaf. The study examines the extent to which intrinsic or extrinsic factors contribute to predict the attitude towards avatars across cultures. Intrinsic factors include the characteristics of the avatar, such as appearance, movements and facial expressions. Extrinsic factors include users technology experience, their hearing status, age and their sign language fluency. This work attempts to answer questions such as, if lower attitude ratings are related to poor technology experience with ASL users, for example, is that also true for Moroccan Sign Language (MSL) users? For the purposes of the study, we designed a questionnaire to understand MSL users attitude towards avatars. Three groups of participants were surveyed: Deaf (57), Hearing (20) and Hard-of-Hearing (3). The results of our study were then compared with those reported in other relevant studies.
comment: In Proceedings of SLTAT 2023: Eighth International Workshop on Sign Language Translation and Avatar Technology, held in conjunction with ICASSP 2023: IEEE International Conference on Acoustics, Speech, and Signal Processing, Rhodes, Greece, June 4-10, 2023
♻ ☆ WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models ACL 2025
Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.
comment: Accepted by ACL 2025
♻ ☆ Non-Markovian Discrete Diffusion with Causal Language Models NeurIPS 2025
Discrete diffusion models offer a flexible, controllable approach to structured sequence generation, yet they still lag behind causal language models in expressive power. A key limitation lies in their reliance on the Markovian assumption, which restricts each step to condition only on the current state, leading to potential uncorrectable error accumulation. In this paper, we introduce CaDDi (Causal Discrete Diffusion Model), a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states. By unifying sequential (causal) and temporal (diffusion) reasoning in a single non-Markovian transformer, CaDDi also treats standard causal language models as a special case and permits the direct reuse of pretrained LLM weights with no architectural changes. Empirically, CaDDi outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks, substantially narrowing the remaining gap to large autoregressive transformers.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents. We introduce a novel method for synthetic generation of difficult and diverse bugs. Our method instructs SWE Agents to introduce a feature into the codebase whereby they may unintentionally break tests, resulting in bugs. Prior approaches often induce an out-of-distribution effect by generating bugs intentionally (e.g. by introducing local perturbation to existing code), which does not reflect realistic development processes. We perform qualitative analysis to demonstrate that our approach for generating bugs more closely reflects the patterns found in human-authored edits. Through extensive experiments, we demonstrate that our bugs provide more efficient training data for supervised fine-tuning, outperforming other bug datasets by 2% with half the training data (1.2k vs. 3k bugs). We train on our newly generated bugs in addition to existing bug datasets to get FrogBoss a state-of-the-art 32B parameter model on SWE-bench Verified with a pass@1 of 54.6% and FrogMini a state-of-the-art 14B model on SWE-bench Verified with a pass@1 of 45.3% on SWE-bench Verified averaged over three seeds.
♻ ☆ InsurTech innovation using natural language processing
With the rapid rise of InsurTech, traditional insurance companies are increasingly exploring alternative data sources and advanced technologies to sustain their competitive edge. This paper provides both a conceptual overview and practical case studies of natural language processing (NLP) and its emerging applications within insurance operations, focusing on transforming raw, unstructured text into structured data suitable for actuarial analysis and decision-making. Leveraging real-world alternative data provided by an InsurTech industry partner that enriches traditional insurance data sources, we apply various NLP techniques to demonstrate feature de-biasing, feature compression, and industry classification in the commercial insurance context. These enriched, text-derived insights not only add to and refine traditional rating factors for commercial insurance pricing but also offer novel perspectives for assessing underlying risk by introducing novel industry classification techniques. Through these demonstrations, we show that NLP is not merely a supplementary tool but a foundational element of modern, data-driven insurance analytics.
♻ ☆ Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
Information Retrieval 17
☆ Tongyi DeepResearch Technical Report
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
comment: https://tongyi-agent.github.io/blog
☆ Optimizing Retrieval for RAG via Reinforced Contrastive Learning
As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever's self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.
☆ Iterative Critique-Refine Framework for Enhancing LLM Personalization
Personalized text generation requires models not only to produce coherent text but also to align with a target user's style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.
☆ MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation
The recent success of large language models (LLMs) has renewed interest in whether recommender systems can achieve similar scaling benefits. Conventional recommenders, dominated by massive embedding tables, tend to plateau as embedding dimensions grow. In contrast, the emerging generative paradigm replaces embeddings with compact Semantic ID (SID) sequences produced by autoregressive Transformers. Yet most industrial deployments remain proprietary, leaving two fundamental questions open: (1) Do the expected scaling laws hold on public benchmarks? (2) What is the minimal post-training recipe that enables competitive performance? We present MiniOneRec, to the best of our knowledge, the first fully open-source generative recommendation framework, which provides an end-to-end workflow spanning SID construction, supervised fine-tuning, and recommendation-oriented reinforcement learning. We generate SIDs via a Residual Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters on the Amazon Review dataset. Our experiments reveal a consistent downward trend in both training and evaluation losses with increasing model size, validating the parameter efficiency of the generative approach. To further enhance performance, we propose a lightweight yet effective post-training pipeline that (1) enforces full-process SID alignment and (2) applies reinforcement learning with constrained decoding and hybrid rewards. Together, these techniques yield significant improvements in both ranking accuracy and candidate diversity.
comment: Technical Report
☆ From Time and Place to Preference: LLM-Driven Geo-Temporal Context in Recommendations
Most recommender systems treat timestamps as numeric or cyclical values, overlooking real-world context such as holidays, events, and seasonal patterns. We propose a scalable framework that uses large language models (LLMs) to generate geo-temporal embeddings from only a timestamp and coarse location, capturing holidays, seasonal trends, and local/global events. We then introduce a geo-temporal embedding informativeness test as a lightweight diagnostic, demonstrating on MovieLens, LastFM, and a production dataset that these embeddings provide predictive signal consistent with the outcomes of full model integrations. Geo-temporal embeddings are incorporated into sequential models through (1) direct feature fusion with metadata embeddings or (2) an auxiliary loss that enforces semantic and geo-temporal alignment. Our findings highlight the need for adaptive or hybrid recommendation strategies, and we release a context-enriched MovieLens dataset to support future research.
☆ Metadata-Driven Retrieval-Augmented Generation for Financial Question Answering
Retrieval-Augmented Generation (RAG) struggles on long, structured financial filings where relevant evidence is sparse and cross-referenced. This paper presents a systematic investigation of advanced metadata-driven Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a novel, multi-stage RAG architecture that leverages LLM-generated metadata. We introduce a sophisticated indexing pipeline to create contextually rich document chunks and benchmark a spectrum of enhancements, including pre-retrieval filtering, post-retrieval reranking, and enriched embeddings, benchmarked on the FinanceBench dataset. Our results reveal that while a powerful reranker is essential for precision, the most significant performance gains come from embedding chunk metadata directly with text ("contextual chunks"). Our proposed optimal architecture combines LLM-driven pre-retrieval optimizations with these contextual embeddings to achieve superior performance. Additionally, we present a custom metadata reranker that offers a compelling, cost-effective alternative to commercial solutions, highlighting a practical trade-off between peak performance and operational efficiency. This study provides a blueprint for building robust, metadata-aware RAG systems for financial document analysis.
comment: Preprint version submitted to the International Journal of Accounting Information Systems; currently under major revision. 20 pages, 1 figure, 1 table
☆ DUET: Dual Model Co-Training for Entire Space CTR Prediction
The pre-ranking stage plays a pivotal role in large-scale recommender systems but faces an intrinsic trade-off between model expressiveness and computational efficiency. Owing to the massive candidate pool and strict latency constraints, industry systems often rely on lightweight two-tower architectures, which are computationally efficient yet limited in estimation capability. As a result, they struggle to capture the complex synergistic and suppressive relationships among candidate items, which are essential for producing contextually coherent and diverse recommendation lists. Moreover, this simplicity further amplifies the Sample Selection Bias (SSB) problem, as coarse-grained models trained on biased exposure data must generalize to a much larger candidate space with distinct distributions. To address these issues, we propose \textbf{DUET} (\textbf{DU}al Model Co-Training for \textbf{E}ntire Space C\textbf{T}R Prediction), a set-wise pre-ranking framework that achieves expressive modeling under tight computational budgets. Instead of scoring items independently, DUET performs set-level prediction over the entire candidate subset in a single forward pass, enabling information-aware interactions among candidates while amortizing the computational cost across the set. Moreover, a dual model co-training mechanism extends supervision to unexposed items via mutual pseudo-label refinement, effectively mitigating SSB. Validated through extensive offline experiments and online A/B testing, DUET consistently outperforms state-of-the-art baselines and achieves improvements across multiple core business metrics. At present, DUET has been fully deployed in Kuaishou and Kuaishou Lite Apps, serving the main traffic for hundreds of millions of users.
☆ Resource-Efficient LLM Application for Structured Transformation of Unstructured Financial Contracts
The transformation of unstructured legal contracts into standardized, machine-readable formats is essential for automating financial workflows. The Common Domain Model (CDM) provides a standardized framework for this purpose, but converting complex legal documents like Credit Support Annexes (CSAs) into CDM representations remains a significant challenge. In this paper, we present an extension of the CDMizer framework, a template-driven solution that ensures syntactic correctness and adherence to the CDM schema during contract-to-CDM conversion. We apply this extended framework to a real-world task, comparing its performance with a benchmark developed by the International Swaps and Derivatives Association (ISDA) for CSA clause extraction. Our results show that CDMizer, when integrated with a significantly smaller, open-source Large Language Model (LLM), achieves competitive performance in terms of accuracy and efficiency against larger, proprietary models. This work underscores the potential of resource-efficient solutions to automate legal contract transformation, offering a cost-effective and scalable approach that can meet the needs of financial institutions with constrained resources or strict data privacy requirements.
comment: 5 pages, 1 figure, 2 tables
☆ Secure Retrieval-Augmented Generation against Poisoning Attacks
Large language models (LLMs) have transformed natural language processing (NLP), enabling applications from content generation to decision support. Retrieval-Augmented Generation (RAG) improves LLMs by incorporating external knowledge but also introduces security risks, particularly from data poisoning, where the attacker injects poisoned texts into the knowledge database to manipulate system outputs. While various defenses have been proposed, they often struggle against advanced attacks. To address this, we introduce RAGuard, a detection framework designed to identify poisoned texts. RAGuard first expands the retrieval scope to increase the proportion of clean texts, reducing the likelihood of retrieving poisoned content. It then applies chunk-wise perplexity filtering to detect abnormal variations and text similarity filtering to flag highly similar texts. This non-parametric approach enhances RAG security, and experiments on large-scale datasets demonstrate its effectiveness in detecting and mitigating poisoning attacks, including strong adaptive attacks.
comment: To appear in IEEE BigData 2025
☆ Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation
We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
comment: https://github.com/alexmartin1722/mirage
☆ LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature
The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis--structure--property relationships.
comment: 29 pages, 13 figures, 6 tables
♻ ☆ Comparing Retrieval Strategies to Capture Interdisciplinary Scientific Research: A Bibliometric Evaluation of the Integration of Neuroscience and Computer Science
Interdisciplinary scientific research is increasingly important in knowledge production, funding policies, and academic discussions on scholarly communication. While many studies focus on interdisciplinary corpora defined a priori -- usually through keyword-based searches within assumed interdisciplinary domains -- few explore interdisciplinarity as an emergent intersection between two distinct fields. Thus, methodological proposals for building databases at the intersection of two fields of knowledge are scarce. The goal of this article is to develop and compare different strategies for defining an interdisciplinary corpus between two bodies of knowledge. As a case study, we focus on the intersection between neuroscience and computer science. To this end, we develop and compare four retrieval strategies, two of them based on keywords and two based on citation and reference patterns. Our results show that the reference-based strategy provides better retrieval, pseudorecall, and F1. While we focus on comparing strategies for the study of the intersection between the fields of neuroscience and computer science, this methodological reflection is applicable to a wide range of interdisciplinary domains.
♻ ☆ CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora
Dense embedding models have become critical for modern information retrieval, particularly in RAG pipelines, but their performance often degrades when applied to specialized corpora outside their pre-training distribution. To address thi we introduce CustomIR, a framework for unsupervised adaptation of pre-trained language embedding models to domain-specific corpora using synthetically generated query-document pairs. CustomIR leverages large language models (LLMs) to create diverse queries grounded in a known target corpus, paired with LLM-verified hard negatives, eliminating the need for costly human annotation. Experiments on enterprise email and messaging datasets show that CustomIR consistently improves retrieval effectiveness with small models gaining up to 2.3 points in Recall@10. This performance increase allows these small models to rival the performance of much larger alternatives, allowing for cheaper RAG deployments. These results highlight that targeted synthetic fine-tuning offers a scalable and cost-efficient strategy for increasing domain-specific performance.
♻ ☆ Cross-Scenario Unified Modeling of User Interests at Billion Scale
User interests on content platforms are inherently diverse, manifesting through complex behavioral patterns across heterogeneous scenarios such as search, feed browsing, and content discovery. Traditional recommendation systems typically prioritize business metric optimization within isolated specific scenarios, neglecting cross-scenario behavioral signals and struggling to integrate advanced techniques like LLMs at billion-scale deployments, which finally limits their ability to capture holistic user interests across platform touchpoints. We propose RED-Rec, an LLM-enhanced hierarchical Recommender Engine for Diversified scenarios, tailored for industry-level content recommendation systems. RED-Rec unifies user interest representations across multiple behavioral contexts by aggregating and synthesizing actions from varied scenarios, resulting in comprehensive item and user modeling. At its core, a two-tower LLM-powered framework enables nuanced, multifaceted representations with deployment efficiency, and a scenario-aware dense mixing and querying policy effectively fuses diverse behavioral signals to capture cross-scenario user intent patterns and express fine-grained, context-specific intents during serving. We validate RED-Rec through online A/B testing on hundreds of millions of users in RedNote through online A/B testing, showing substantial performance gains in both content recommendation and advertisement targeting tasks. We further introduce a million-scale sequential recommendation dataset, RED-MMU, for comprehensive offline training and evaluation. Our work advances unified user modeling, unlocking deeper personalization and fostering more meaningful user engagement in large-scale UGC platforms.
comment: https://github.com/ariesssxu/RedSeqRec
♻ ☆ OneRec-V2 Technical Report
Recent breakthroughs in generative AI have transformed recommender systems through end-to-end generation. OneRec reformulates recommendation as an autoregressive generation task, achieving high Model FLOPs Utilization. While OneRec-V1 has shown significant empirical success in real-world deployment, two critical challenges hinder its scalability and performance: (1) inefficient computational allocation where 97.66% of resources are consumed by sequence encoding rather than generation, and (2) limitations in reinforcement learning relying solely on reward models. To address these challenges, we propose OneRec-V2, featuring: (1) Lazy Decoder-Only Architecture: Eliminates encoder bottlenecks, reducing total computation by 94% and training resources by 90%, enabling successful scaling to 8B parameters. (2) Preference Alignment with Real-World User Interactions: Incorporates Duration-Aware Reward Shaping and Adaptive Ratio Clipping to better align with user preferences using real-world feedback. Extensive A/B tests on Kuaishou demonstrate OneRec-V2's effectiveness, improving App Stay Time by 0.467%/0.741% while balancing multi-objective recommendations. This work advances generative recommendation scalability and alignment with real-world feedback, representing a step forward in the development of end-to-end recommender systems.
♻ ☆ MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems
Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.
♻ ☆ Your Dense Retriever is Secretly an Expeditious Reasoner
Dense retrievers enhance retrieval by encoding queries and documents into continuous vectors, but they often struggle with reasoning-intensive queries. Although Large Language Models (LLMs) can reformulate queries to capture complex reasoning, applying them universally incurs significant computational cost. In this work, we propose Adaptive Query Reasoning (AdaQR), a hybrid query rewriting framework. Within this framework, a Reasoner Router dynamically directs each query to either fast dense reasoning or deep LLM reasoning. The dense reasoning is achieved by the Dense Reasoner, which performs LLM-style reasoning directly in the embedding space, enabling a controllable trade-off between efficiency and accuracy. Experiments on large-scale retrieval benchmarks BRIGHT show that AdaQR reduces reasoning cost by 28% while preserving-or even improving-retrieval performance by 7%.
comment: 16 pages, 11 figures
Multimedia 7
☆ BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning
Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbf{Boundless Large Model (BLM$_1$)}, a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM$_1$ integrates three key capabilities -- \textit{cross-space transfer, cross-task learning, and cross-embodiment generalization} -- via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM$_1$ instance outperforms four model families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving $\sim\!\textbf{6%}$ gains in digital tasks and $\sim\!\textbf{3%}$ in physical tasks.
☆ Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation NeurIPS 2025
We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio
comment: accepted by NeurIPS 2025
☆ Resi-VidTok: An Efficient and Decomposed Progressive Tokenization Framework for Ultra-Low-Rate and Lightweight Video Transmission
Real-time transmission of video over wireless networks remains highly challenging, even with advanced deep models, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose Resi-VidTok, a Resilient Tokenization-Enabled framework designed for ultra-low-rate and lightweight video transmission that delivers strong robustness while preserving perceptual and semantic fidelity on commodity digital hardware. By reorganizing spatio--temporal content into a discrete, importance-ordered token stream composed of key tokens and refinement tokens, Resi-VidTok enables progressive encoding, prefix-decodable reconstruction, and graceful quality degradation under constrained channels. A key contribution is a resilient 1D tokenization pipeline for video that integrates differential temporal token coding, explicitly supporting reliable recovery from incomplete token sets using a single shared framewise decoder--without auxiliary temporal extractors or heavy generative models. Furthermore, stride-controlled frame sparsification combined with a lightweight decoder-side interpolator reduces transmission load while maintaining motion continuity. Finally, a channel-adaptive source--channel coding and modulation scheme dynamically allocates rate and protection according to token importance and channel condition, yielding stable quality across adverse SNRs. Evaluation results indicate robust visual and semantic consistency at channel bandwidth ratios (CBR) as low as 0.0004 and real-time reconstruction at over 30 fps, demonstrating the practicality of Resi-VidTok for energy-efficient, latency-sensitive, and reliability-critical wireless applications.
☆ MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition
Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.
comment: The paper will be published in the MMAsia2025 conference proceedings
☆ GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment ICASSP 2026
Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements. Existing methods typically rely on coarse rhythm embeddings, such as global motion features or binarized joint-based rhythm values, which discard fine-grained motion cues and result in weak rhythmic alignment. Moreover, temporal mismatches introduced by feature downsampling further hinder precise synchronization between dance and music. To address these problems, we propose \textbf{GACA-DiT}, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation. First, a \textbf{genre-adaptive rhythm extraction} module combines multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting to capture fine-grained, genre-specific rhythm patterns. Second, a \textbf{context-aware temporal alignment} module resolves temporal mismatches using learnable context queries to align music latents with relevant dance rhythm features. Extensive experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation. Project page: https://beria-moon.github.io/GACA-DiT/.
comment: 5 pages, 3 figures, submitted to ICASSP 2026
♻ ☆ Mano Technical Report
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
♻ ☆ Does CLIP perceive art the same way we do?
CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it 'see' the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.
Image and Video Processing 11
☆ Dipole-lets: a new multiscale decomposition for MR phase and quantitative susceptibility mapping
Identifying and suppressing streaking artifacts is one of the most challenging problems in quantitative susceptibility mapping. The measured phase from tissue magnetization is assumed to be the convolution by the magnetic dipole kernel; direct inversion or standard regularization methods tend to create streaking artifacts in the estimated susceptibility. This is caused by extreme noise and by the presence of non-dipolar phase contributions, which are amplified by the dipole kernel following the streaking pattern. In this work, we introduce a multiscale transform, called Dipole-lets, as an optimal decomposition method for identifying dipole incompatibilities in measured field data by extracting features of different characteristic size and orientation with respect to the dipole kernel's zero-valued double-cone surface (the magic cone). We provide experiments that showcase that non-dipolar content can be extracted by Dipole-lets from phase data through artifact localization. We also present implementations of Dipole-lets as a optimization functional regularizator, through simple Tikhonov and infinity norm.
comment: This preprint is a work in progress and is not the final manuscript for submission
☆ Fast algorithms enabling optimization and deep learning for photoacoustic tomography in a circular detection geometry
The inverse source problem arising in photoacoustic tomography and in several other coupled-physics modalities is frequently solved by iterative algorithms. Such algorithms are based on the minimization of a certain cost functional. In addition, novel deep learning techniques are currently being investigated to further improve such optimization approaches. All such methods require multiple applications of the operator defining the forward problem, and of its adjoint. In this paper, we present new asymptotically fast algorithms for numerical evaluation of the forward and adjoint operators, applicable in the circular acquisition geometry. For an $(n \times n)$ image, our algorithms compute these operators in $\mathcal{O}(n^2 \log n)$ floating point operations. We demonstrate the performance of our algorithms in numerical simulations, where they are used as an integral part of several iterative image reconstruction techniques: classic variational methods, such as non-negative least squares and total variation regularized least squares, as well as deep learning methods, such as learned primal dual. A Python implementation of our algorithms and computational examples is available to the general public.
☆ High-Quality and Large-Scale Image Downscaling for Modern Display Devices
In modern display technology and visualization tools, downscaling images is one of the most important activities. This procedure aims to maintain both visual authenticity and structural integrity while reducing the dimensions of an image at a large scale to fit the dimension of the display devices. In this study, we proposed a new technique for downscaling images that uses co-occurrence learning to maintain structural and perceptual information while reducing resolution. The technique uses the input image to create a data-driven co-occurrence profile that captures the frequency of intensity correlations in nearby neighborhoods. A refined filtering process is guided by this profile, which acts as a content-adaptive range kernel. The contribution of each input pixel is based on how closely it resembles pair-wise intensity values with it's neighbors. We validate our proposed technique on four datasets: DIV2K, BSD100, Urban100, and RealSR to show its effective downscaling capacity. Our technique could obtain up to 39.22 dB PSNR on the DIV2K dataset and PIQE up to 26.35 on the same dataset when downscaling by 8x and 16x, respectively. Numerous experimental findings attest to the ability of the suggested picture downscaling method to outperform more contemporary approaches in terms of both visual quality and performance measures. Unlike most existing methods, which did not focus on the large-scale image resizing scenario, we achieve high-quality downscaled images without texture loss or edge blurring. Our method, LSID (large scale image downscaling), successfully preserves high-frequency structures like edges, textures, and repeating patterns by focusing on statistically consistent pixels while reducing aliasing and blurring artifacts that are typical of traditional downscaling techniques.
comment: 10 pages, 3 tables, and 6 figures
☆ Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes
Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.
☆ Listening without Looking: Modality Bias in Audio-Visual Captioning
Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.
comment: under review
☆ Resi-VidTok: An Efficient and Decomposed Progressive Tokenization Framework for Ultra-Low-Rate and Lightweight Video Transmission
Real-time transmission of video over wireless networks remains highly challenging, even with advanced deep models, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose Resi-VidTok, a Resilient Tokenization-Enabled framework designed for ultra-low-rate and lightweight video transmission that delivers strong robustness while preserving perceptual and semantic fidelity on commodity digital hardware. By reorganizing spatio--temporal content into a discrete, importance-ordered token stream composed of key tokens and refinement tokens, Resi-VidTok enables progressive encoding, prefix-decodable reconstruction, and graceful quality degradation under constrained channels. A key contribution is a resilient 1D tokenization pipeline for video that integrates differential temporal token coding, explicitly supporting reliable recovery from incomplete token sets using a single shared framewise decoder--without auxiliary temporal extractors or heavy generative models. Furthermore, stride-controlled frame sparsification combined with a lightweight decoder-side interpolator reduces transmission load while maintaining motion continuity. Finally, a channel-adaptive source--channel coding and modulation scheme dynamically allocates rate and protection according to token importance and channel condition, yielding stable quality across adverse SNRs. Evaluation results indicate robust visual and semantic consistency at channel bandwidth ratios (CBR) as low as 0.0004 and real-time reconstruction at over 30 fps, demonstrating the practicality of Resi-VidTok for energy-efficient, latency-sensitive, and reliability-critical wireless applications.
☆ MSRANetV2: An Explainable Deep Learning Architecture for Multi-class Classification of Colorectal Histopathological Images
Colorectal cancer (CRC) is a leading worldwide cause of cancer-related mortality, and the role of prompt precise detection is of paramount interest in improving patient outcomes. Conventional diagnostic methods such as colonoscopy and histological examination routinely exhibit subjectivity, are extremely time-consuming, and are susceptible to variation. Through the development of digital pathology, deep learning algorithms have become a powerful approach in enhancing diagnostic precision and efficiency. In our work, we proposed a convolutional neural network architecture named MSRANetV2, specially optimized for the classification of colorectal tissue images. The model employs a ResNet50V2 backbone, extended with residual attention mechanisms and squeeze-and-excitation (SE) blocks, to extract deep semantic and fine-grained spatial features. With channel alignment and upsampling operations, MSRANetV2 effectively fuses multi-scale representations, thereby enhancing the robustness of the classification. We evaluated our model on a five-fold stratified cross-validation strategy on two publicly available datasets: CRC-VAL-HE-7K and NCT-CRC-HE-100K. The proposed model achieved remarkable average Precision, recall, F1-score, AUC, and test accuracy were 0.9884 plus-minus 0.0151, 0.9900 plus-minus 0.0151, 0.9900 plus-minus 0.0145, 0.9999 plus-minus 0.00006, and 0.9905 plus-minus 0.0025 on the 7K dataset. On the 100K dataset, they were 0.9904 plus-minus 0.0091, 0.9900 plus-minus 0.0071, 0.9900 plus-minus 0.0071, 0.9997 plus-minus 0.00016, and 0.9902 plus-minus 0.0006. Additionally, Grad-CAM visualizations were incorporated to enhance model interpretability by highlighting tissue areas that are medically relevant. These findings validate that MSRANetV2 is a reliable, interpretable, and high-performing architectural model for classifying CRC tissues.
☆ Which LiDAR scanning pattern is better for roadside perception: Repetitive or Non-repetitive?
LiDAR-based roadside perception is a cornerstone of advanced Intelligent Transportation Systems (ITS). While considerable research has addressed optimal LiDAR placement for infrastructure, the profound impact of differing LiDAR scanning patterns on perceptual performance remains comparatively under-investigated. The inherent nature of various scanning modes - such as traditional repetitive (mechanical/solid-state) versus emerging non-repetitive (e.g. prism-based) systems - leads to distinct point cloud distributions at varying distances, critically dictating the efficacy of object detection and overall environmental understanding. To systematically investigate these differences in infrastructure-based contexts, we introduce the "InfraLiDARs' Benchmark," a novel dataset meticulously collected in the CARLA simulation environment using concurrently operating infrastructure-based LiDARs exhibiting both scanning paradigms. Leveraging this benchmark, we conduct a comprehensive statistical analysis of the respective LiDAR scanning abilities and evaluate the impact of these distinct patterns on the performance of various leading 3D object detection algorithms. Our findings reveal that non-repetitive scanning LiDAR and the 128-line repetitive LiDAR were found to exhibit comparable detection performance across various scenarios. Despite non-repetitive LiDAR's limited perception range, it's a cost-effective option considering its low price. Ultimately, this study provides insights for setting up roadside perception system with optimal LiDAR scanning patterns and compatible algorithms for diverse roadside applications, and publicly releases the "InfraLiDARs' Benchmark" dataset to foster further research.
♻ ☆ Caption-Driven Explainability: Probing CNNs for Bias via CLIP ICIP 2025
Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at .
comment: Accepted and presented at the IEEE ICIP 2025 Satellite Workshop "Generative AI for World Simulations and Communications & Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos", Anchorage, Alaska, USA, September 14, 2025. Camera-ready preprint; the official IEEE Xplore publication will follow. Code is available at
♻ ☆ TraceTrans: Translation and Spatial Tracing for Surgical Prediction
Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.
♻ ☆ Cyst-X: A Federated AI System Outperforms Clinical Guidelines to Detect Pancreatic Cancer Precursors and Reduce Unnecessary Surgery
Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we developed Cyst-X, an AI framework for IPMN risk prediction trained on a unique, multi-center dataset of 1,461 MRI scans from 764 patients. Cyst-X achieves significantly higher accuracy (AUC = 0.82) than both the established Kyoto guidelines (AUC = 0.75) and expert radiologists, particularly in correct identification of high-risk lesions. Clinically, this translates to a 20% increase in cancer detection sensitivity (87.8% vs. 64.1%) for high-risk lesions. We demonstrate that this performance is maintained in a federated learning setting, allowing for collaborative model training without compromising patient privacy. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset and models, providing the first large-scale, multi-center MRI resource for pancreatic cyst analysis.
Computer Vision and Pattern Recognition 13
☆ Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations NeurIPS 2025
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
comment: NeurIPS 2025, produced by Pointcept, project page: https://pointcept.github.io/Concerto
☆ Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling NeurIPS 2025
Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.
comment: NeurIPS 2025, 38 pages, 22 figures
☆ PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.
comment: 22 pages, 13 figures
☆ PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
We introduce \textbf{PRISM-Bench}, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.
☆ InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras NeurIPS 2025
Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, which is often not true for many real-world in-the-wild videos. A major obstacle in this field is a lack of dynamic camera intrinsics benchmarks--existing benchmarks typically offer limited diversity in scene content and intrinsics variation, and none provide per-frame intrinsic changes for consecutive video frames. In this paper, we present Intrinsics in Flux (InFlux), a real-world benchmark that provides per-frame ground truth intrinsics annotations for videos with dynamic intrinsics. Compared to prior benchmarks, InFlux captures a wider range of intrinsic variations and scene diversity, featuring 143K+ annotated frames from 386 high-resolution indoor and outdoor videos with dynamic camera intrinsics. To ensure accurate per-frame intrinsics, we build a comprehensive lookup table of calibration experiments and extend the Kalibr toolbox to improve its accuracy and robustness. Using our benchmark, we evaluate existing baseline methods for predicting camera intrinsics and find that most struggle to achieve accurate predictions on videos with dynamic intrinsics. For the dataset, code, videos, and submission, please visit https://influx.cs.princeton.edu/.
comment: Accepted at NeurIPS 2025 DB Track, Camera Ready Version. Supplementary material included
☆ FARMER: Flow AutoRegressive Transformer over Pixels
Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
comment: Bytedance Seed Technical Report
☆ Adaptive Training of INRs via Pruning and Densification
Encoding input coordinates with sinusoidal functions into multilayer perceptrons (MLPs) has proven effective for implicit neural representations (INRs) of low-dimensional signals, enabling the modeling of high-frequency details. However, selecting appropriate input frequencies and architectures while managing parameter redundancy remains an open challenge, often addressed through heuristics and heavy hyperparameter optimization schemes. In this paper, we introduce AIRe ($\textbf{A}$daptive $\textbf{I}$mplicit neural $\textbf{Re}$presentation), an adaptive training scheme that refines the INR architecture over the course of optimization. Our method uses a neuron pruning mechanism to avoid redundancy and input frequency densification to improve representation capacity, leading to an improved trade-off between network size and reconstruction quality. For pruning, we first identify less-contributory neurons and apply a targeted weight decay to transfer their information to the remaining neurons, followed by structured pruning. Next, the densification stage adds input frequencies to spectrum regions where the signal underfits, expanding the representational basis. Through experiments on images and SDFs, we show that AIRe reduces model size while preserving, or even improving, reconstruction quality. Code and pretrained models will be released for public use.
☆ PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors NeurIPS 2025
Three-dimensional Gaussian Splatting (3DGS) has recently emerged as an efficient representation for novel-view synthesis, achieving impressive visual quality. However, in scenes dominated by large and low-texture regions, common in indoor environments, the photometric loss used to optimize 3DGS yields ambiguous geometry and fails to recover high-fidelity 3D surfaces. To overcome this limitation, we introduce PlanarGS, a 3DGS-based framework tailored for indoor scene reconstruction. Specifically, we design a pipeline for Language-Prompted Planar Priors (LP3) that employs a pretrained vision-language segmentation model and refines its region proposals via cross-view fusion and inspection with geometric priors. 3D Gaussians in our framework are optimized with two additional terms: a planar prior supervision term that enforces planar consistency, and a geometric prior supervision term that steers the Gaussians toward the depth and normal cues. We have conducted extensive experiments on standard indoor benchmarks. The results show that PlanarGS reconstructs accurate and detailed 3D surfaces, consistently outperforming state-of-the-art methods by a large margin. Project page: https://planargs.github.io
comment: Accepted by NeurIPS 2025. Project page: https://planargs.github.io
☆ TurboPortrait3D: Single-step diffusion-based fast portrait novel-view synthesis
We introduce TurboPortrait3D: a method for low-latency novel-view synthesis of human portraits. Our approach builds on the observation that existing image-to-3D models for portrait generation, while capable of producing renderable 3D representations, are prone to visual artifacts, often lack of detail, and tend to fail at fully preserving the identity of the subject. On the other hand, image diffusion models excel at generating high-quality images, but besides being computationally expensive, are not grounded in 3D and thus are not directly capable of producing multi-view consistent outputs. In this work, we demonstrate that image-space diffusion models can be used to significantly enhance the quality of existing image-to-avatar methods, while maintaining 3D-awareness and running with low-latency. Our method takes a single frontal image of a subject as input, and applies a feedforward image-to-avatar generation pipeline to obtain an initial 3D representation and corresponding noisy renders. These noisy renders are then fed to a single-step diffusion model which is conditioned on input image(s), and is specifically trained to refine the renders in a multi-view consistent way. Moreover, we introduce a novel effective training strategy that includes pre-training on a large corpus of synthetic multi-view data, followed by fine-tuning on high-quality real images. We demonstrate that our approach both qualitatively and quantitatively outperforms current state-of-the-art for portrait novel-view synthesis, while being efficient in time.
♻ ☆ LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas
Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.
comment: 9 pages, preprint. Project page: https://snap-research.github.io/layercomposer/
♻ ☆ ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports
We introduce ReXGroundingCT, the first publicly available dataset linking free-text findings to pixel-level 3D segmentations in chest CT scans. The dataset includes 3,142 non-contrast chest CT scans paired with standardized radiology reports from CT-RATE. Construction followed a structured three-stage pipeline. First, GPT-4 was used to extract and standardize findings, descriptors, and metadata from reports originally written in Turkish and machine-translated into English. Second, GPT-4o-mini categorized each finding into a hierarchical ontology of lung and pleural abnormalities. Third, 3D annotations were produced for all CT volumes: the training set was quality-assured by board-certified radiologists, and the validation and test sets were fully annotated by board-certified radiologists. Additionally, a complementary chain-of-thought dataset was created to provide step-by-step hierarchical anatomical reasoning for localizing findings within the CT volume, using GPT-4o and localization coordinates derived from organ segmentation models. ReXGroundingCT contains 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs, covering diverse radiological patterns from 3,142 non-contrast CT scans. About 79% of findings are focal abnormalities and 21% are non-focal. The dataset includes a public validation set of 50 cases and a private test set of 100 cases, both annotated by board-certified radiologists. The dataset establishes a foundation for enabling free-text finding segmentation and grounded radiology report generation in CT imaging. Model performance on the private test set is hosted on a public leaderboard at https://rexrank.ai/ReXGroundingCT. The dataset is available at https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.
♻ ☆ DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning
Multimodal learning integrates complementary information from diverse modalities to enhance the decision-making process. However, the potential of multimodal collaboration remains under-exploited due to disparities in data quality and modality representation capabilities. To address this, we introduce DynCIM, a novel dynamic curriculum learning framework designed to quantify the inherent imbalances from both sample and modality perspectives. DynCIM employs a sample-level curriculum to dynamically assess each sample's difficulty according to prediction deviation, consistency, and stability, while a modality-level curriculum measures modality contributions from global and local. Furthermore, a gating-based dynamic fusion mechanism is introduced to adaptively adjust modality contributions, minimizing redundancy and optimizing fusion effectiveness. Extensive experiments on six multimodal benchmarking datasets, spanning both bimodal and trimodal scenarios, demonstrate that DynCIM consistently outperforms state-of-the-art methods. Our approach effectively mitigates modality and sample imbalances while enhancing adaptability and robustness in multimodal learning tasks. Our code is available at https://github.com/Raymond-Qiancx/DynCIM.
♻ ☆ Faces of Fairness: Examining Bias in Facial Expression Recognition Datasets and Models
Building AI systems, including Facial Expression Recognition (FER), involves two critical aspects: data and model design. Both components significantly influence bias and fairness in FER tasks. Issues related to bias and fairness in FER datasets and models remain underexplored. This study investigates bias sources in FER datasets and models. Four common FER datasets--AffectNet, ExpW, Fer2013, and RAF-DB--are analyzed. The findings demonstrate that AffectNet and ExpW exhibit high generalizability despite data imbalances. Additionally, this research evaluates the bias and fairness of six deep models, including three state-of-the-art convolutional neural network (CNN) models: MobileNet, ResNet, XceptionNet, as well as three transformer-based models: ViT, CLIP, and GPT-4o-mini. Experimental results reveal that while GPT-4o-mini and ViT achieve the highest accuracy scores, they also display the highest levels of bias. These findings underscore the urgent need for developing new methodologies to mitigate bias and ensure fairness in datasets and models, particularly in affective computing applications. See our implementation details at https://github.com/MMHosseini/bias_in_FER.
Machine Learning 25
☆ Variational Masked Diffusion Models
Masked diffusion models have recently emerged as a flexible framework for discrete generative modeling. However, a key limitation of standard masked diffusion is its inability to effectively capture dependencies among tokens that are predicted concurrently, leading to degraded generation quality when dependencies among tokens are important. To explicitly model dependencies among tokens, we propose Variational Masked Diffusion (VMD), a framework that introduces latent variables into the masked diffusion process. Through controlled experiments on synthetic datasets, we demonstrate that VMD successfully learns dependencies that conventional masked diffusion fails to capture. We further validate the effectiveness of our approach on Sudoku puzzles and text datasets, where learning of dependencies among tokens improves global consistency. Across these domains, VMD enhances both generation quality and dependency awareness, highlighting the value of integrating variational inference into masked diffusion. Our code is available at: https://riccizz.github.io/VMD.
comment: Project Page: https://riccizz.github.io/VMD
☆ Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling NeurIPS 2025
Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.
comment: NeurIPS 2025, 38 pages, 22 figures
☆ Lightweight Robust Direct Preference Optimization
Direct Preference Optimization (DPO) has become a popular method for fine-tuning large language models (LLMs) due to its stability and simplicity. However, it is also known to be sensitive to noise in the data and prone to overfitting. Recent works have proposed using distributionally robust optimization (DRO) to address potential noise and distributional shift in the data. However, these methods often suffer from excessive conservatism and high computational cost. We propose DPO-PRO (DPO with Preference Robustness), a robust fine-tuning algorithm based on DPO which accounts for uncertainty in the preference distribution through a lightweight DRO formulation. Unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, avoiding unnecessary conservatism and incurring negligible computational overhead. We further show that DPO-PRO is equivalent to a regularized DPO objective that penalizes model overconfidence under weak preference signals. We evaluate DPO-PRO on standard alignment benchmarks and a real-world public health task. Experimental results show that our method consistently improves robustness to noisy preference signals compared to existing DPO variants.
☆ Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: https://lookahead-anchoring.github.io.
comment: Project page: https://lookahead-anchoring.github.io
☆ RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
The pursuit of robot generalists - instructable agents capable of performing diverse tasks across diverse environments - demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains and cannot assess models trained from real-world demonstrations or alternative simulation environments. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated VLM-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, such as textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.
comment: Website: https://robotarenainf.github.io
☆ ReCode: Unify Plan and Action for Universal Granularity Control
Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.
☆ Minimizing Human Intervention in Online Classification
We introduce and study an online problem arising in question answering systems. In this problem, an agent must sequentially classify user-submitted queries represented by $d$-dimensional embeddings drawn i.i.d. from an unknown distribution. The agent may consult a costly human expert for the correct label, or guess on her own without receiving feedback. The goal is to minimize regret against an oracle with free expert access. When the time horizon $T$ is at least exponential in the embedding dimension $d$, one can learn the geometry of the class regions: in this regime, we propose the Conservative Hull-based Classifier (CHC), which maintains convex hulls of expert-labeled queries and calls the expert as soon as a query lands outside all known hulls. CHC attains $\mathcal{O}(\log^d T)$ regret in $T$ and is minimax optimal for $d=1$. Otherwise, the geometry cannot be reliably learned without additional distributional assumptions. We show that when the queries are drawn from a subgaussian mixture, for $T \le e^d$, a Center-based Classifier (CC) achieves regret proportional to $N\log{N}$ where $N$ is the number of labels. To bridge these regimes, we introduce the Generalized Hull-based Classifier (GHC), a practical extension of CHC that allows for more aggressive guessing via a tunable threshold parameter. Our approach is validated with experiments, notably on real-world question-answering datasets using embeddings derived from state-of-the-art large language models.
comment: 49 pages, 8 figures
☆ A U-Net and Transformer Pipeline for Multilingual Image Translation
This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.
comment: 6 pages, 3 figures, 5 tables, and 2 algorithms. Prepared in IEEE double-column format
☆ Sequential Multi-Agent Dynamic Algorithm Configuration NeurIPS 2025
Dynamic algorithm configuration (DAC) is a recent trend in automated machine learning, which can dynamically adjust the algorithm's configuration during the execution process and relieve users from tedious trial-and-error tuning tasks. Recently, multi-agent reinforcement learning (MARL) approaches have improved the configuration of multiple heterogeneous hyperparameters, making various parameter configurations for complex algorithms possible. However, many complex algorithms have inherent inter-dependencies among multiple parameters (e.g., determining the operator type first and then the operator's parameter), which are, however, not considered in previous approaches, thus leading to sub-optimal results. In this paper, we propose the sequential multi-agent DAC (Seq-MADAC) framework to address this issue by considering the inherent inter-dependencies of multiple parameters. Specifically, we propose a sequential advantage decomposition network, which can leverage action-order information through sequential advantage decomposition. Experiments from synthetic functions to the configuration of multi-objective optimization algorithms demonstrate Seq-MADAC's superior performance over state-of-the-art MARL methods and show strong generalization across problem classes. Seq-MADAC establishes a new paradigm for the widespread dependency-aware automated algorithm configuration. Our code is available at https://github.com/lamda-bbo/seq-madac.
comment: NeurIPS 2025
☆ Direct Debiased Machine Learning via Bregman Divergence Minimization
We develop a direct debiased machine learning framework comprising Neyman targeted estimation and generalized Riesz regression. Our framework unifies Riesz regression for automatic debiased machine learning, covariate balancing, targeted maximum likelihood estimation (TMLE), and density-ratio estimation. In many problems involving causal effects or structural models, the parameters of interest depend on regression functions. Plugging regression functions estimated by machine learning methods into the identifying equations can yield poor performance because of first-stage bias. To reduce such bias, debiased machine learning employs Neyman orthogonal estimating equations. Debiased machine learning typically requires estimation of the Riesz representer and the regression function. For this problem, we develop a direct debiased machine learning framework with an end-to-end algorithm. We formulate estimation of the nuisance parameters, the regression function and the Riesz representer, as minimizing the discrepancy between Neyman orthogonal scores computed with known and unknown nuisance parameters, which we refer to as Neyman targeted estimation. Neyman targeted estimation includes Riesz representer estimation, and we measure discrepancies using the Bregman divergence. The Bregman divergence encompasses various loss functions as special cases, where the squared loss yields Riesz regression and the Kullback-Leibler divergence yields entropy balancing. We refer to this Riesz representer estimation as generalized Riesz regression. Neyman targeted estimation also yields TMLE as a special case for regression function estimation. Furthermore, for specific pairs of models and Riesz representer estimation methods, we can automatically obtain the covariate balancing property without explicitly solving the covariate balancing objective.
☆ When No Paths Lead to Rome: Benchmarking Systematic Neural Relational Reasoning NeurIPS 2025
Designing models that can learn to reason in a systematic way is an important and long-standing challenge. In recent years, a wide range of solutions have been proposed for the specific case of systematic relational reasoning, including Neuro-Symbolic approaches, variants of the Transformer architecture, and specialised Graph Neural Networks. However, existing benchmarks for systematic relational reasoning focus on an overly simplified setting, based on the assumption that reasoning can be reduced to composing relational paths. In fact, this assumption is hard-baked into the architecture of several recent models, leading to approaches that can perform well on existing benchmarks but are difficult to generalise to other settings. To support further progress in the field of systematic relational reasoning with neural networks, we introduce NoRA, a new benchmark which adds several levels of difficulty and requires models to go beyond path-based reasoning.
comment: accepted at NeurIPS 2025 D&B track
☆ Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization
Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.
☆ Toward Carbon-Neutral Human AI: Rethinking Data, Computation, and Learning Paradigms for Sustainable Intelligence
The rapid advancement of Artificial Intelligence (AI) has led to unprecedented computational demands, raising significant environmental and ethical concerns. This paper critiques the prevailing reliance on large-scale, static datasets and monolithic training paradigms, advocating for a shift toward human-inspired, sustainable AI solutions. We introduce a novel framework, Human AI (HAI), which emphasizes incremental learning, carbon-aware optimization, and human-in-the-loop collaboration to enhance adaptability, efficiency, and accountability. By drawing parallels with biological cognition and leveraging dynamic architectures, HAI seeks to balance performance with ecological responsibility. We detail the theoretical foundations, system design, and operational principles that enable AI to learn continuously and contextually while minimizing carbon footprints and human annotation costs. Our approach addresses pressing challenges in active learning, continual adaptation, and energy-efficient model deployment, offering a pathway toward responsible, human-centered artificial intelligence.
comment: 9 pages, 3 figures
☆ A Deep Latent Factor Graph Clustering with Fairness-Utility Trade-off Perspective
Fair graph clustering seeks partitions that respect network structure while maintaining proportional representation across sensitive groups, with applications spanning community detection, team formation, resource allocation, and social network analysis. Many existing approaches enforce rigid constraints or rely on multi-stage pipelines (e.g., spectral embedding followed by $k$-means), limiting trade-off control, interpretability, and scalability. We introduce \emph{DFNMF}, an end-to-end deep nonnegative tri-factorization tailored to graphs that directly optimizes cluster assignments with a soft statistical-parity regularizer. A single parameter $\lambda$ tunes the fairness--utility balance, while nonnegativity yields parts-based factors and transparent soft memberships. The optimization uses sparse-friendly alternating updates and scales near-linearly with the number of edges. Across synthetic and real networks, DFNMF achieves substantially higher group balance at comparable modularity, often dominating state-of-the-art baselines on the Pareto front. The code is available at https://github.com/SiamakGhodsi/DFNMF.git.
comment: Accepted to IEEE Big-Data 2025 main research track. The paper is 10 main pages and 4 pages of Appendix
♻ ☆ Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models
Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable, all without any extra computational overhead. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.
comment: The Thirty-Ninth Annual Conference on Neural Information Processing Systems
♻ ☆ UNDREAM: Bridging Differentiable Rendering and Photorealistic Simulation for End-to-end Adversarial Attacks
Deep learning models deployed in safety critical applications like autonomous driving use simulations to test their robustness against adversarial attacks in realistic conditions. However, these simulations are non-differentiable, forcing researchers to create attacks that do not integrate simulation environmental factors, reducing attack success. To address this limitation, we introduce UNDREAM, the first software framework that bridges the gap between photorealistic simulators and differentiable renderers to enable end-to-end optimization of adversarial perturbations on any 3D objects. UNDREAM enables manipulation of the environment by offering complete control over weather, lighting, backgrounds, camera angles, trajectories, and realistic human and object movements, thereby allowing the creation of diverse scenes. We showcase a wide array of distinct physically plausible adversarial objects that UNDREAM enables researchers to swiftly explore in different configurable environments. This combination of photorealistic simulation and differentiable optimization opens new avenues for advancing research of physical adversarial attacks.
♻ ☆ ESCA: Contextualizing Embodied Agents via Scene-Graph Generation NeurIPS 2025
Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, existing MLLMs do not reliably capture fine-grained links between low-level visual features and high-level textual semantics, leading to weak grounding and inaccurate perception. To overcome this challenge, we propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial-temporal scene graphs. At its core is SGCLIP, a novel, open-domain, promptable foundation model for generating scene graphs that is based on CLIP. SGCLIP is trained on 87K+ open-domain videos using a neurosymbolic pipeline that aligns automatically generated captions with scene graphs produced by the model itself, eliminating the need for human-labeled annotations. We demonstrate that SGCLIP excels in both prompt-based inference and task-specific fine-tuning, achieving state-of-the-art results on scene graph generation and action localization benchmarks. ESCA with SGCLIP improves perception for embodied agents based on both open-source and commercial MLLMs, achieving state of-the-art performance across two embodied environments. Notably, ESCA significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines. We release the source code for SGCLIP model training at https://github.com/video-fm/LASER and for the embodied agent at https://github.com/video-fm/ESCA.
comment: Accepted as a Spotlight Paper at NeurIPS 2025
♻ ☆ Now you see me! Attribution Distributions Reveal What is Truly Important for a Prediction
Neural networks are regularly employed in high-stakes decision-making, where understanding and transparency is key. Attribution methods have been developed to gain understanding into which input features neural networks use for a specific prediction. Although widely used in computer vision, these methods often result in unspecific saliency maps that fail to identify the relevant information that led to a decision, supported by different benchmarks results. Here, we revisit the common attribution pipeline and identify one cause for the lack of specificity in attributions as the computation of attribution of isolated logits. Instead, we suggest to combine attributions of multiple class logits in analogy to how the softmax combines the information across logits. By computing probability distributions of attributions over classes for each spatial location in the image, we unleash the true capabilities of existing attribution methods, revealing better object- and instance-specificity and uncovering discriminative as well as shared features between classes. On common benchmarks, including the grid-pointing game and randomization-based sanity checks, we show that this reconsideration of how and where we compute attributions across the network improves established attribution methods while staying agnostic to model architectures. We make the code publicly available: https://github.com/nilspwalter/var.
♻ ☆ SafeCOMM: A Study on Safety Degradation in Fine-Tuned Telecom Large Language Models
Fine-tuning large language models (LLMs) on telecom datasets is a common practice to adapt general-purpose models to the telecom domain. However, little attention has been paid to how this process may compromise model safety. Recent research has shown that even benign fine-tuning can degrade the safety alignment of LLMs, causing them to respond to harmful or unethical user queries. In this paper, we investigate this issue by fine-tuning LLMs on three representative telecom datasets and show that safety degrades even for light telecom domain adaptation. To this end, we introduce TeleHarm, the first telecom-specific red-teaming benchmark, which we use alongside established Direct-Harm and HexPhi datasets to systematically assess harmful behavior. We further extend our analysis to publicly available TeleLLMs that were continually pre-trained on large telecom corpora, revealing that safety alignment is severely lacking, primarily due to the omission of safety-focused instruction tuning. To address these issues, we evaluate three realignment defenses: SafeInstruct, SafeLoRA, SafeMERGE. We show that, across all settings, the proposed defenses can effectively restore safety without compromising telecom task performance, leading to Safe teleCOMMunication (SafeCOMM) models. Our work serves as both a diagnostic study and practical guide for safety realignment in telecom-tuned LLMs, underscoring the need for safety-aware instruction and fine-tuning in the telecom domain.
♻ ☆ Enhancing Graph Neural Networks: A Mutual Learning Approach
Knowledge distillation (KD) techniques have emerged as a powerful tool for transferring expertise from complex teacher models to lightweight student models, particularly beneficial for deploying high-performance models in resource-constrained devices. This approach has been successfully applied to graph neural networks (GNNs), harnessing their expressive capabilities to generate node embeddings that capture structural and feature-related information. In this study, we depart from the conventional KD approach by exploring the potential of collaborative learning among GNNs. In the absence of a pre-trained teacher model, we show that relatively simple and shallow GNN architectures can synergetically learn efficient models capable of performing better during inference, particularly in tackling multiple tasks. We propose a collaborative learning framework where ensembles of student GNNs mutually teach each other throughout the training process. We introduce an adaptive logit weighting unit to facilitate efficient knowledge exchange among models and an entropy enhancement technique to improve mutual learning. These components dynamically empower the models to adapt their learning strategies during training, optimizing their performance for downstream tasks. Extensive experiments conducted on three datasets each for node and graph classification demonstrate the effectiveness of our approach.
♻ ☆ On the Stability of Graph Convolutional Neural Networks: A Probabilistic Perspective
Graph convolutional neural networks (GCNNs) have emerged as powerful tools for analyzing graph-structured data, achieving remarkable success across diverse applications. However, the theoretical understanding of the stability of these models, i.e., their sensitivity to small changes in the graph structure, remains in rather limited settings, hampering the development and deployment of robust and trustworthy models in practice. To fill this gap, we study how perturbations in the graph topology affect GCNN outputs and propose a novel formulation for analyzing model stability. Unlike prior studies that focus only on worst-case perturbations, our distribution-aware formulation characterizes output perturbations across a broad range of input data. This way, our framework enables, for the first time, a probabilistic perspective on the interplay between the statistical properties of the node data and perturbations in the graph topology. We conduct extensive experiments to validate our theoretical findings and demonstrate their benefits over existing baselines, in terms of both representation stability and adversarial attacks on downstream tasks. Our results demonstrate the practical significance of the proposed formulation and highlight the importance of incorporating data distribution into stability analysis.
♻ ☆ ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods NeurIPS 2025
Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.
comment: NeurIPS 2025
♻ ☆ On the Structure of Stationary Solutions to McKean-Vlasov Equations with Applications to Noisy Transformers
We study stationary solutions of McKean-Vlasov equations on the circle. Our main contributions stem from observing an exact equivalence between solutions of the stationary McKean-Vlasov equation and an infinite-dimensional quadratic system of equations over Fourier coefficients, which allows explicit characterization of the stationary states in a sequence space rather than a function space. This framework provides a transparent description of local bifurcations, characterizing their periodicity, and resonance structures, while accommodating singular potentials. We derive analytic expressions that characterize the emergence, form and shape (supercritical, critical, subcritical or transcritical) of bifurcations involving possibly multiple Fourier modes and connect them with discontinuous phase transitions. We also characterize, under suitable assumptions, the detailed structure of the stationary bifurcating solutions that are accurate upto an arbitrary number of Fourier modes. At the global level, we establish regularity and concavity properties of the free energy landscape, proving existence, compactness, and coexistence of globally minimizing stationary measures, further identifying discontinuous phase transitions with points of non-differentiability of the minimum free energy map. As an application, we specialize the theory to the Noisy Mean-Field Transformer model, where we show how changing the inverse temperature parameter $\beta$ affects the geometry of the infinitely many bifurcations from the uniform measure. We also explain how increasing $\beta$ can lead to a rich class of approximate multi-mode stationary solutions which can be seen as `metastable states'. Further, a sharp transition from continuous to discontinuous (first-order) phase behavior is observed as $\beta$ increases.
comment: 46 pages, 5 figures
♻ ☆ WhaleVAD-BPN: Improving Baleen Whale Call Detection with Boundary Proposal Networks and Post-processing Optimisation
While recent sound event detection (SED) systems can identify baleen whale calls in marine audio, challenges related to false positive and minority-class detection persist. We propose the boundary proposal network (BPN), which extends an existing lightweight SED system. The BPN is inspired by work in image object detection and aims to reduce the number of false positive detections. It achieves this by using intermediate latent representations computed within the backbone classification model to gate the final output. When added to an existing SED system, the BPN achieves a 16.8 % absolute increase in precision, as well as 21.3 % and 9.4 % improvements in the F1-score for minority-class d-calls and bp-calls, respectively. We further consider two approaches to the selection of post-processing hyperparameters: a forward-search and a backward-search. By separately optimising event-level and frame-level hyperparameters, these two approaches lead to considerable performance improvements over parameters selected using empirical methods. The complete WhaleVAD-BPN system achieves a cross-validated development F1-score of 0.475, which is a 9.8 % absolute improvement over the baseline.
♻ ☆ DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning ECAI 2025
Cross-domain offline reinforcement learning (RL) seeks to enhance sample efficiency in offline RL by utilizing additional offline source datasets. A key challenge is to identify and utilize source samples that are most relevant to the target domain. Existing approaches address this challenge by measuring domain gaps through domain classifiers, target transition dynamics modeling, or mutual information estimation using contrastive loss. However, these methods often require large target datasets, which is impractical in many real-world scenarios. In this work, we address cross-domain offline RL under a limited target data setting, identifying two primary challenges: (1) Dataset imbalance, which is caused by large source and small target datasets and leads to overfitting in neural network-based domain gap estimators, resulting in uninformative measurements; and (2) Partial domain overlap, where only a subset of the source data is closely aligned with the target domain. To overcome these issues, we propose DmC, a novel framework for cross-domain offline RL with limited target samples. Specifically, DmC utilizes $k$-nearest neighbor ($k$-NN) based estimation to measure domain proximity without neural network training, effectively mitigating overfitting. Then, by utilizing this domain proximity, we introduce a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain, thus enhancing policy learning with more effective source samples. Through theoretical analysis and extensive experiments in diverse MuJoCo environments, we demonstrate that DmC significantly outperforms state-of-the-art cross-domain offline RL methods, achieving substantial performance gains.
comment: accepted at ECAI 2025; offline cross-domain reinforcement learning with a guided diffusion model;
Human-Computer Interaction 34
☆ Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier
The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.
comment: 16 pages, 11 figures
☆ Human-AI Collaborative Uncertainty Quantification
AI predictive systems are increasingly embedded in decision making pipelines, shaping high stakes choices once made solely by humans. Yet robust decisions under uncertainty still rely on capabilities that current AI lacks: domain knowledge not captured by data, long horizon context, and reasoning grounded in the physical world. This gap has motivated growing efforts to design collaborative frameworks that combine the complementary strengths of humans and AI. This work advances this vision by identifying the fundamental principles of Human AI collaboration within uncertainty quantification, a key component of reliable decision making. We introduce Human AI Collaborative Uncertainty Quantification, a framework that formalizes how an AI model can refine a human expert's proposed prediction set with two goals: avoiding counterfactual harm, ensuring the AI does not degrade correct human judgments, and complementarity, enabling recovery of correct outcomes the human missed. At the population level, we show that the optimal collaborative prediction set follows an intuitive two threshold structure over a single score function, extending a classical result in conformal prediction. Building on this insight, we develop practical offline and online calibration algorithms with provable distribution free finite sample guarantees. The online method adapts to distribution shifts, including human behavior evolving through interaction with AI, a phenomenon we call Human to AI Adaptation. Experiments across image classification, regression, and text based medical decision making show that collaborative prediction sets consistently outperform either agent alone, achieving higher coverage and smaller set sizes across various conditions.
☆ Shareholder Democracy with AI Representatives
A large share of retail investors hold public equities through mutual funds, yet lack adequate control over these investments. Indeed, mutual funds concentrate voting power in the hands of a few asset managers. These managers vote on behalf of shareholders despite having limited insight into their individual preferences, leaving them exposed to growing political and regulatory pressures, particularly amid rising shareholder activism. Pass-through voting has been proposed as a way to empower retail investors and provide asset managers with clearer guidance, but it faces challenges such as low participation rates and the difficulty of capturing highly individualized shareholder preferences for each specific vote. Randomly selected assemblies of shareholders, or ``investor assemblies,'' have also been proposed as more representative proxies than asset managers. As a third alternative, we propose artificial intelligence (AI) enabled representatives trained on individual shareholder preferences to act as proxies and vote on their behalf. Over time, these models could not only predict how retail investors would vote at any given moment but also how they might vote if they had significantly more time, knowledge, and resources to evaluate each proposal, leading to better overall decision-making. We argue that shareholder democracy offers a compelling real-world test bed for AI-enabled representation, providing valuable insights into both the potential benefits and risks of this approach more generally.
☆ Education Paradigm Shift To Maintain Human Competitive Advantage Over AI
Discussion about the replacement of intellectual human labour by ``thinking machines'' has been present in the public and expert discourse since the creation of Artificial Intelligence (AI) as an idea and terminology since the middle of the twentieth century. Until recently, it was more of a hypothetical concern. However, in recent years, with the rise of Generative AI, especially Large Language Models (LLM), and particularly with the widespread popularity of the ChatGPT model, that concern became practical. Many domains of human intellectual labour have to adapt to the new AI tools that give humans new functionality and opportunity, but also question the viability and necessity of some human work that used to be considered intellectual yet has now become an easily automatable commodity. Education, unexpectedly, has now become burdened by an especially crucial role of charting long-range strategies for discovering viable human skills that would guarantee their place in the world of the ubiquitous use of AI in the intellectual sphere. We highlight weaknesses of the current AI and, especially, of its LLM-based core, show that root causes of LLMs' weaknesses are unfixable by the current technologies, and propose directions in the constructivist paradigm for the changes in Education that ensure long-term advantages of humans over AI tools.
☆ Reciprocity Deficits: Observing AI in the street with everyday publics
The street has emerged as a primary site where everyday publics are confronted with AI as an infrastructural phenomenon, as machine learning-based systems are now commonly deployed in this setting in the form of automated cars, facial recognition, smart billboards and the like. While these deployments of AI in the street have attracted significant media attention and public controversy in recent years, the presence of AI in the street often remains inscrutable, and many everyday publics are unaware of it. In this paper, we explore the challenges and possibilities of everyday public engagement with AI in the situated environment of city streets under these paradoxical conditions. Combining perspectives and approaches from social and cultural studies of AI, Design Research and Science and Technology Studies (STS), we explore the affordances of the street as a site for 'material participation' in AI through design-based interventions: the creation of 'everyday AI observatories.' We narrate and reflect on our participatory observations of AI in five city streets in the UK and Australia and highlight a set of tensions that emerged from them: 1) the framing of the street as a transactional environment, 2) the designed invisibility of AI and its publics in the street 3) the stratification of street environments through statistical governance. Based on this discussion and drawing on Jane Jacobs' notion of "eyes on the street," we put forward the relational notion of "reciprocity deficits" between AI infrastructures and everyday publics in the street. The conclusion reflects on the consequences of this form of social invisibility of AI for situated engagement with AI by everyday publics in the street and for public trust in urban governance.
comment: 13 pages, 4 Figures
☆ Planning Ahead with RSA: Efficient Signalling in Dynamic Environments by Projecting User Awareness across Future Timesteps
Adaptive agent design offers a way to improve human-AI collaboration on time-sensitive tasks in rapidly changing environments. In such cases, to ensure the human maintains an accurate understanding of critical task elements, an assistive agent must not only identify the highest priority information but also estimate how and when this information can be communicated most effectively, given that human attention represents a zero-sum cognitive resource where focus on one message diminishes awareness of other or upcoming information. We introduce a theoretical framework for adaptive signalling which meets these challenges by using principles of rational communication, formalised as Bayesian reference resolution using the Rational Speech Act (RSA) modelling framework, to plan a sequence of messages which optimise timely alignment between user belief and a dynamic environment. The agent adapts message specificity and timing to the particulars of a user and scenario based on projections of how prior-guided interpretation of messages will influence attention to the interface and subsequent belief update, across several timesteps out to a fixed horizon. In a comparison to baseline methods, we show that this effectiveness depends crucially on combining multi-step planning with a realistic model of user awareness. As the first application of RSA for communication in a dynamic environment, and for human-AI interaction in general, we establish theoretical foundations for pragmatic communication in human-agent teams, highlighting how insights from cognitive science can be capitalised to inform the design of assistive agents.
comment: 11 pages, 3 figures
☆ Partnering with Generative AI: Experimental Evaluation of Human-Led and Model-Led Interaction in Human-AI Co-Creation
Large language models (LLMs) show strong potential to support creative tasks, but the role of the interface design is poorly understood. In particular, the effect of different modes of collaboration between humans and LLMs on co-creation outcomes is unclear. To test this, we conducted a randomized controlled experiment ($N = 486$) comparing: (a) two variants of reflective, human-led modes in which the LLM elicits elaboration through suggestions or questions, against (b) a proactive, model-led mode in which the LLM independently rewrites ideas. By assessing the effects on idea quality, diversity, and perceived ownership, we found that the model-led mode substantially improved idea quality but reduced idea diversity and users' perceived idea ownership. The reflective, human-led mode also improved idea quality, yet while preserving diversity and ownership. Our findings highlight the importance of designing interactions with generative AI systems as reflective thought partners that complement human strengths and augment creative processes.
☆ Arabic Little STT: Arabic Children Speech Recognition Dataset
The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Moreover, the absence of child-specific speech corpora is an essential gap that poses significant challenges. To address this gap, we present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms, containing 355 utterances from 288 children (ages 6 - 13). We further conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset and compare its performance with adult Arabic benchmarks. Our evaluation across eight Whisper variants reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on adult datasets. These results align with other research on English speech. Results highlight the critical need for dedicated child speech benchmarks and inclusive training data in ASR development. Emphasizing that such data must be governed by strict ethical and privacy frameworks to protect sensitive child information. We hope that this study provides an initial step for future work on equitable speech technologies for Arabic-speaking children. We hope that our publicly available dataset enrich the children's demographic representation in ASR datasets.
☆ Moderating Role of Presence in EEG Responses to Visuo-haptic Prediction Error in Virtual Reality
Virtual reality (VR) can create compelling experiences that evoke presence, the sense of ``being there.'' However, problems in rendering can create sensorimotor disruptions that undermine presence and task performance. Presence is typically assessed with post-hoc questionnaires, but their coarse temporal resolution limits insight into how sensorimotor disruptions shape user experience. Here, we combined questionnaires with electroencephalography (EEG) to identify neural markers of presence-affecting prediction error in immersive VR. Twenty-five participants performed a grasp-and-place task under two levels of immersion (visual-only vs.~visuo-haptic). Occasional oddball-like sensorimotor disruptions introduced premature feedback to elicit prediction errors. Overall, higher immersion enhanced self-presence but not physical presence, while accuracy and speed improved over time irrespective of immersion. At the neural level, sensorimotor disruptions elicited robust event-related potential effects at FCz and Pz, accompanied by increases in frontal midline $\theta$ and posterior $\alpha$ suppression. Through source analyses localized to anterior-- and posterior cingulate cortex (ACC/PCC) we found that PCC $\alpha$ activity showed heightened sensitivity to disruptions exclusively in visuo-haptic immersion. Exploratory moderation analyses by presence scores revealed no consistent patterns. Together, these results suggest that higher immersion amplifies both the benefits and costs of sensorimotor coherence.
☆ Multi-Stakeholder Alignment in LLM-Powered Collaborative AI Systems: A Multi-Agent Framework for Intelligent Tutoring
The integration of Large Language Models into Intelligent Tutoring Systems pre-sents significant challenges in aligning with diverse and often conflicting values from students, parents, teachers, and institutions. Existing architectures lack for-mal mechanisms for negotiating these multi-stakeholder tensions, creating risks in accountability and bias. This paper introduces the Advisory Governance Layer (AGL), a non-intrusive, multi-agent framework designed to enable distributed stakeholder participation in AI governance. The AGL employs specialized agents representing stakeholder groups to evaluate pedagogical actions against their spe-cific policies in a privacy-preserving manner, anticipating future advances in per-sonal assistant technology that will enhance stakeholder value expression. Through a novel policy taxonomy and conflict-resolution protocols, the frame-work provides structured, auditable governance advice to the ITS without altering its core pedagogical decision-making. This work contributes a reference architec-ture and technical specifications for aligning educational AI with multi-stakeholder values, bridging the gap between high-level ethical principles and practical implementation.
☆ If They Disagree, Will You Conform? Exploring the Role of Robots' Value Awareness in a Decision-Making Task
This study investigates whether the opinions of robotic agents are more likely to influence human decision-making when the robots are perceived as value-aware (i.e., when they display an understanding of human principles). We designed an experiment in which participants interacted with two Furhat robots - one programmed to be Value-Aware and the other Non-Value-Aware - during a labeling task for images representing human values. Results indicate that participants distinguished the Value-Aware robot from the Non-Value-Aware one. Although their explicit choices did not indicate a clear preference for one robot over the other, participants directed their gaze more toward the Value-Aware robot. Additionally, the Value-Aware robot was perceived as more loyal, suggesting that value awareness in a social robot may enhance its perceived commitment to the group. Finally, when both robots disagreed with the participant, conformity occurred in about one out of four trials, and participants took longer to confirm their responses, suggesting that two robots expressing dissent may introduce hesitation in decision-making. On one hand, this highlights the potential risk that robots, if misused, could manipulate users for unethical purposes. On the other hand, it reinforces the idea that social robots might encourage reflection in ambiguous situations and help users avoid scams.
☆ Reasoning About Reasoning: Towards Informed and Reflective Use of LLM Reasoning in HCI
Reasoning is a distinctive human-like characteristic attributed to LLMs in HCI due to their ability to simulate various human-level tasks. However, this work argues that the reasoning behavior of LLMs in HCI is often decontextualized from the underlying mechanics and subjective decisions that condition the emergence and human interpretation of this behavior. Through a systematic survey of 258 CHI papers from 2020-2025 on LLMs, we discuss how HCI hardly perceives LLM reasoning as a product of sociotechnical orchestration and often references it as an object of application. We argue that such abstraction leads to oversimplification of reasoning methodologies from NLP/ML and results in a distortion of LLMs' empirically studied capabilities and (un)known limitations. Finally, drawing on literature from both NLP/ML and HCI, as a constructive step forward, we develop reflection prompts to support HCI practitioners engage with LLM reasoning in an informed and reflective way.
comment: 14 pages, 3 figures, 1 table
☆ Improving Human Verification of LLM Reasoning through Interactive Explanation Interfaces
The reasoning capabilities of Large Language Models (LLMs) have led to their increasing employment in several critical applications, particularly education, where they support problem-solving, tutoring, and personalized study. While there are a plethora of works showing the effectiveness of LLMs in generating step-by-step solutions through chain-of-thought (CoT) reasoning on reasoning benchmarks, little is understood about whether the generated CoT is helpful for end-users in improving their ability to comprehend mathematical reasoning problems and detect errors/hallucinations in LLM-generated solutions. To address this gap and contribute to understanding how reasoning can improve human-AI interaction, we present three new interactive reasoning interfaces: interactive CoT (iCoT), interactive Program-of-Thought (iPoT), and interactive Graph (iGraph), and a novel framework that generates the LLM's reasoning from traditional CoT to alternative, interactive formats. Across 125 participants, we found that interactive interfaces significantly improved performance. Specifically, the iGraph interface yielded the highest clarity and error detection rate (85.6%), followed by iPoT (82.5%), iCoT (80.6%), all outperforming standard CoT (73.5%). Interactive interfaces also led to faster response times, where participants using iGraph were fastest (57.9 secs), compared to iCoT and iPoT (60 secs), and the standard CoT baseline (64.7 secs). Furthermore, participants preferred the iGraph reasoning interface, citing its superior ability to enable users to follow the LLM's reasoning process. We discuss the implications of these results and provide recommendations for the future design of reasoning models.
comment: 19 pages, 14 figures
☆ Clinic-Oriented Feasibility of a Sensor-Fused Wearable for Upper-Limb Function
Background: Upper-limb weakness and tremor (4--12 Hz) limit activities of daily living (ADL) and reduce adherence to home rehabilitation. Objective: To assess technical feasibility and clinician-relevant signals of a sensor-fused wearable targeting the triceps brachii and extensor pollicis brevis. Methods: A lightweight node integrates surface EMG (1 kHz), IMU (100--200 Hz), and flex/force sensors with on-device INT8 inference (Tiny 1D-CNN/Transformer) and a safety-bounded assist policy (angle/torque/jerk limits; stall/time-out). Healthy adults (n = 12) performed three ADL-like tasks. Primary outcomes: Tremor Index (TI), range of motion (ROM), repetitions (Reps min$^{-1}$). Secondary: EMG median-frequency slope (fatigue trend), closed-loop latency, session completion, and device-related adverse events. Analyses used subject-level paired medians with BCa 95\% CIs; exact Wilcoxon $p$-values are reported in the Results. Results: Assistance was associated with lower tremor prominence and improved task throughput: TI decreased by $-0.092$ (95\% CI [$-0.102$, $-0.079$]), ROM increased by $+12.65\%$ (95\% CI [$+8.43$, $+13.89$]), and Reps rose by $+2.99$ min$^{-1}$ (95\% CI [$+2.61$, $+3.35$]). Median on-device latency was 8.7 ms at a 100 Hz loop rate; all sessions were completed with no device-related adverse events. Conclusions: Multimodal sensing with low-latency, safety-bounded assistance produced improved movement quality (TI $\downarrow$) and throughput (ROM, Reps $\uparrow$) in a pilot technical-feasibility setting, supporting progression to IRB-approved patient studies. Trial registration: Not applicable (pilot non-clinical).
comment: 19 pages, 7 figures, 5 Tables
☆ Towards AI as Colleagues: Multi-Agent System Improves Structured Professional Ideation
Most AI systems today are designed to manage tasks and execute predefined steps. This makes them effective for process coordination but limited in their ability to engage in joint problem-solving with humans or contribute new ideas. We introduce MultiColleagues, a multi-agent conversational system that shows how AI agents can act as colleagues by conversing with each other, sharing new ideas, and actively involving users in collaborative ideation. In a within-subjects study with 20 participants, we compared MultiColleagues to a single-agent baseline. Results show that MultiColleagues fostered stronger perceptions of social presence, produced ideas rated significantly higher in quality and novelty, and encouraged deeper elaboration. These findings demonstrate the potential of AI agents to move beyond process partners toward colleagues that share intent, strengthen group dynamics, and collaborate with humans to advance ideas.
☆ MORA: AI-Mediated Story-Based practice for Speech Sound Disorder from Clinic to Home
Speech sound disorder is among the most common communication challenges in preschool children. Home-based practice is essential for effective therapy and for acquiring generalization of target sounds, yet sustaining engaging and consistent practice remains difficult. Existing story-based activities, despite their potential for sound generalization and educational benefits, are often underutilized due to limited interactivity. Moreover, many practice tools fail to sufficiently integrate speech-language pathologists into the process, resulting in weak alignment with clinical treatment plans. To address these limitations, we present MORA, an interactive story-based practice system. MORA introduces three key innovations. First, it embeds target sounds and vocabulary into dynamic, character-driven conversational narratives, requiring children to actively produce speech to progress the story, thereby creating natural opportunities for exposure, repetition, and generalization. Second, it provides visual cues, explicit instruction, and feedback, allowing children to practice effectively either independently or with caregivers. Third, it supports an AI-in-the-loop workflow, enabling SLPs to configure target materials, review logged speech with phoneme-level scoring, and adapt therapy plans asynchronously -- bridging the gap between clinic and home practice while respecting professional expertise. A formative study with six licensed SLPs informed the system's design rationale, and an expert review with seven SLPs demonstrated strong alignment with established articulation-based treatments, as well as potential to enhance children's engagement and literacy. Furthermore, discussions highlight the design considerations for professional support and configurability, adaptive and multimodal child interaction, while highlighting MORA's broader applicability across speech and language disorders.
☆ Large Language Model Agent Personality and Response Appropriateness: Evaluation by Human Linguistic Experts, LLM-as-Judge, and Natural Language Processing Model
While Large Language Model (LLM)-based agents can be used to create highly engaging interactive applications through prompting personality traits and contextual data, effectively assessing their personalities has proven challenging. This novel interdisciplinary approach addresses this gap by combining agent development and linguistic analysis to assess the prompted personality of LLM-based agents in a poetry explanation task. We developed a novel, flexible question bank, informed by linguistic assessment criteria and human cognitive learning levels, offering a more comprehensive evaluation than current methods. By evaluating agent responses with natural language processing models, other LLMs, and human experts, our findings illustrate the limitations of purely deep learning solutions and emphasize the critical role of interdisciplinary design in agent development.
☆ Spatial Orchestra: Locomotion Music Instruments through Spatial Exploration
Spatial Orchestra demonstrates how easy it is to play musical instruments using basic input like natural locomotion, which is accessible to most. Unlike many musical instruments, our work allows individuals of all skill levels to effortlessly create music by walking into virtual bubbles. Our Augmented Reality experience involves interacting with ever-shifting sound bubbles that the user engages with by stepping into color-coded bubbles within the assigned area using a standalone AR headset. Each bubble corresponds to a cello note, and omits sound from the center of the bubble, and lets the user hear and express in spatial audio, effectively transforming participants into musicians. This interactive element enables users to explore the intersection of spatial awareness, musical rhythm that extends to bodily expression through playful movements and dance-like gestures within the bubble-filled environment. This unique experience illuminates the intricate relationship between spatial awareness and the art of musical performance.
comment: Extended Abstracts (Interactivity), 5 pages. Published at the 2024 CHI Conference on Human Factors in Computing Systems
☆ Reality Distortion Room: A Study of User Locomotion Responses to Spatial Augmented Reality Effects
Reality Distortion Room (RDR) is a proof-of-concept augmented reality system using projection mapping and unencumbered interaction with the Microsoft RoomAlive system to study a user's locomotive response to visual effects that seemingly transform the physical room the user is in. This study presents five effects that augment the appearance of a physical room to subtly encourage user motion. Our experiment demonstrates users' reactions to the different distortion and augmentation effects in a standard living room, with the distortion effects projected as wall grids, furniture holograms, and small particles in the air. The augmented living room can give the impression of becoming elongated, wrapped, shifted, elevated, and enlarged. The study results support the implementation of AR experiences in limited physical spaces by providing an initial understanding of how users can be subtly encouraged to move throughout a room.
comment: Conference Paper, 10 pages. Published at the 2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)
☆ Moderating Role of Presence in EEG Responses to Visuo-haptic Prediction Error in Virtual Reality
Virtual reality (VR) can create compelling experiences that evoke presence, the sense of ``being there.'' However, problems in rendering can create sensorimotor disruptions that undermine presence and task performance. Presence is typically assessed with post-hoc questionnaires, but their coarse temporal resolution limits insight into how sensorimotor disruptions shape user experience. Here, we combined questionnaires with electroencephalography (EEG) to identify neural markers of presence-affecting prediction error in immersive VR. Twenty-five participants performed a grasp-and-place task under two levels of immersion (visual-only vs.~visuo-haptic). Occasional oddball-like sensorimotor disruptions introduced premature feedback to elicit prediction errors. Overall, higher immersion enhanced self-presence but not physical presence, while accuracy and speed improved over time irrespective of immersion. At the neural level, sensorimotor disruptions elicited robust event-related potential effects at FCz and Pz, accompanied by increases in frontal midline $\theta$ and posterior $\alpha$ suppression. Through source analyses localized to anterior- and posterior cingulate cortex (ACC/PCC) we found that PCC $\alpha$ activity showed heightened sensitivity to disruptions exclusively in visuo-haptic immersion. Exploratory moderation analyses by presence scores revealed no consistent patterns. Together, these results suggest that higher immersion amplifies both the benefits and costs of sensorimotor coherence.
♻ ☆ What Is Your AI Agent Buying? Evaluation, Implications and Emerging Questions for Agentic E-Commerce
Online marketplaces will be transformed by autonomous AI agents acting on behalf of consumers. Rather than humans browsing and clicking, AI agents can parse webpages or interact through APIs to evaluate products, and transact. This raises a fundamental question: what do AI agents buy-and why? We develop ACES, a sandbox environment that pairs a platform-agnostic agent with a fully programmable mock marketplace to study this. We first explore aggregate choices, revealing that modal choices can differ across models, with AI agents sometimes concentrating on a few products, raising competition questions. We then analyze the drivers of choices through rationality checks and randomized experiments on product positions and listing attributes. Models show sizeable and heterogeneous position effects: all favor the top row, yet different models prefer different columns, undermining the assumption of a universal ``top'' rank. They penalize sponsored tags, reward endorsements, and sensitivities to price, ratings, and reviews are directionally as expected, but vary sharply across models. Finally, we find that a seller-side agent that makes minor tweaks to product descriptions can deliver substantial market-share gains by targeting AI buyer preferences. Our findings reveal how AI agents behave in e-commerce, and surface concrete seller strategy, platform design, and regulatory questions.
♻ ☆ Validating LLM-as-a-Judge Systems under Rating Indeterminacy NeurIPS 2025
The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human--judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings "reasonable" or "correct." We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item. In this paper, we introduce a framework for validating LLM-as-a-judge systems under rating indeterminacy. We draw theoretical connections between different measures of judge system performance under different human--judge agreement metrics, and different rating elicitation and aggregation schemes. We demonstrate that differences in how humans and LLMs resolve rating indeterminacy when responding to forced-choice rating instructions can heavily bias LLM-as-a-judge validation. Through extensive experiments involving 11 real-world rating tasks and 9 commercial LLMs, we show that standard validation approaches that rely upon forced-choice ratings select judge systems that are highly suboptimal, performing as much as 31% worse than judge systems selected by our approach that uses multi-label "response set" ratings to account for rating indeterminacy. We conclude with concrete recommendations for more principled approaches to LLM-as-a-judge validation.
comment: NeurIPS 2025
♻ ☆ Estimating LLM Consistency: A User Baseline vs Surrogate Metrics EMNLP 2025
Large language models (LLMs) are prone to hallucinations and sensitiveto prompt perturbations, often resulting in inconsistent or unreliablegenerated text. Different methods have been proposed to mitigate suchhallucinations and fragility, one of which is to measure theconsistency of LLM responses -- the model's confidence in the responseor likelihood of generating a similar response when resampled. Inprevious work, measuring LLM response consistency often relied oncalculating the probability of a response appearing within a pool of resampledresponses, analyzing internal states, or evaluating logits of resopnses.However, it was not clear how well theseapproaches approximated users' perceptions of consistency of LLMresponses. To find out, we performed a user study ($n=2,976$)demonstrating that current methods for measuring LLM responseconsistency typically do not align well with humans' perceptions of LLMconsistency. We propose a logit-based ensemble method for estimatingLLM consistency and show that our method matches the performance of thebest-performing existing metric in estimating human ratings of LLMconsistency. Our results suggest that methods for estimating LLMconsistency without human evaluation are sufficiently imperfect towarrant broader use of evaluation with human input; this would avoidmisjudging the adequacy of models because of the imperfections ofautomated consistency metrics.
comment: Published as a main conference paper at EMNLP 2025
♻ ☆ Can Large Language Models Unlock Novel Scientific Research Ideas? EMNLP 2025
The widespread adoption of Large Language Models (LLMs) and publicly available ChatGPT have marked a significant turning point in the integration of Artificial Intelligence (AI) into people's everyday lives. This study examines the ability of Large Language Models (LLMs) to generate future research ideas from scientific papers. Unlike tasks such as summarization or translation, idea generation lacks a clearly defined reference set or structure, making manual evaluation the default standard. However, human evaluation in this setting is extremely challenging ie: it requires substantial domain expertise, contextual understanding of the paper, and awareness of the current research landscape. This makes it time-consuming, costly, and fundamentally non-scalable, particularly as new LLMs are being released at a rapid pace. Currently, there is no automated evaluation metric specifically designed for this task. To address this gap, we propose two automated evaluation metrics: Idea Alignment Score (IAScore) and Idea Distinctness Index. We further conducted human evaluation to assess the novelty, relevance, and feasibility of the generated future research ideas. This investigation offers insights into the evolving role of LLMs in idea generation, highlighting both its capability and limitations. Our work contributes to the ongoing efforts in evaluating and utilizing language models for generating future research ideas. We make our datasets and codes publicly available
comment: EMNLP 2025 (Main)
♻ ☆ Human-AI Collaboration: Trade-offs Between Performance and Preferences
Despite the growing interest in collaborative AI, designing systems that seamlessly integrate human input remains a major challenge. In this study, we developed a task to systematically examine human preferences for collaborative agents. We created and evaluated five collaborative AI agents with strategies that differ in the manner and degree they adapt to human actions. Participants interacted with a subset of these agents, evaluated their perceived traits, and selected their preferred agent. We used a Bayesian model to understand how agents' strategies influence the Human-AI team performance, AI's perceived traits, and the factors shaping human-preferences in pairwise agent comparisons. Our results show that agents who are more considerate of human actions are preferred over purely performance-maximizing agents. Moreover, we show that such human-centric design can improve the likability of AI collaborators without reducing performance. We find evidence for inequality-aversion effects being a driver of human choices, suggesting that people prefer collaborative agents which allow them to meaningfully contribute to the team. Taken together, these findings demonstrate how collaboration with AI can benefit from development efforts which include both subjective and objective metrics.
comment: LW Mayer & S Karny are co-first authors
♻ ☆ StereoDetect: Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings
Stereotypes are known to have very harmful effects, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases, thereby leaving the study of stereotypes in its early stages. Our study revealed that many works have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and Anti-stereotype detection is a problem that requires social knowledge; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a five-tuple definition and provide precise terminologies disentangling stereotypes, anti-stereotypes, stereotypical bias, and general bias. We provide a conceptual framework grounded in social psychology for reliable detection. We identify key shortcomings in existing benchmarks for this task of stereotype and anti-stereotype detection. To address these gaps, we developed StereoDetect, a well curated, definition-aligned benchmark dataset designed for this task. We show that sub-10B language models and GPT-4o frequently misclassify anti-stereotypes and fail to recognize neutral overgeneralizations. We demonstrate StereoDetect's effectiveness through multiple qualitative and quantitative comparisons with existing benchmarks and models fine-tuned on them. The dataset and code is available at https://github.com/KaustubhShejole/StereoDetect.
♻ ☆ Real-Time Assistive Navigation for the Visually Impaired: A Scalable Approach for Indoor and Outdoor Mobility
Navigating unfamiliar environments remains one of the most persistent and critical challenges for people who are blind or have limited vision (BLV). Existing assistive tools often rely on online services or APIs, making them costly, internet-dependent, and less reliable in real-time use. To address these limitations, we propose PathFinder, a novel mapless mobile phone-based navigation system that operates fully offline. Our method processes monocular depth images and applies an efficient pathfinding algorithm to identify the longest, clearest obstacle-free route, ensuring optimal navigation with low computational cost. Comparative evaluations show that PathFinder reduces mean absolute error (MAE), speeds decision-making, and achieves real-time responsiveness indoors and outdoors. A usability study with 15 BLV participants confirmed its practicality, where 73% learned to operate it in under a minute, and 80% praised its accuracy, responsiveness, and convenience. Despite challenges in complex indoor layouts and low light, PathFinder offers a low-cost, scalable, reliable alternative.
♻ ☆ ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions NeurIPS 2025
Recent advances in Large Language Models (LLMs) have propelled intelligent agents from reactive responses to proactive support. While promising, existing proactive agents either rely exclusively on observations from enclosed environments (e.g., desktop UIs) with direct LLM inference or employ rule-based proactive notifications, leading to suboptimal user intent understanding and limited functionality for proactive service. In this paper, we introduce ContextAgent, the first context-aware proactive agent that incorporates extensive sensory contexts surrounding humans to enhance the proactivity of LLM agents. ContextAgent first extracts multi-dimensional contexts from massive sensory perceptions on wearables (e.g., video and audio) to understand user intentions. ContextAgent then leverages the sensory contexts and personas from historical data to predict the necessity for proactive services. When proactive assistance is needed, ContextAgent further automatically calls the necessary tools to assist users unobtrusively. To evaluate this new task, we curate ContextAgentBench, the first benchmark for evaluating context-aware proactive LLM agents, covering 1,000 samples across nine daily scenarios and twenty tools. Experiments on ContextAgentBench show that ContextAgent outperforms baselines by achieving up to 8.5% and 6.0% higher accuracy in proactive predictions and tool calling, respectively. We hope our research can inspire the development of more advanced, human-centric, proactive AI assistants. The code and dataset are publicly available at https://github.com/openaiotlab/ContextAgent.
comment: Accepted by NeurIPS 2025
♻ ☆ Cross-Reality Lifestyle: Integrating Physical and Virtual Lives through Multi-Platform Metaverse
Technological advances are redefining the relationship between physical and virtual spaces. Traditionally, when users engage in virtual reality, they are completely cutoff from the physical space. Similarly, they are unable to access virtual experiences while engaged in physical activities. However, modern multiplatform metaverse environments allow simultaneous participation through mobile devices, creating new opportunities for integrated experiences. This study introduces the concept of "cross-reality lifestyles" to examine how users actively combine their physical and virtual activities. We identify three patterns of integration: first, Amplification: one space enhances experiences in the other; second, Complementary: spaces offer different but equally valuable alternatives, and third, Emergence: simultaneous engagement creates entirely new experiences. We propose the ACE cube framework that analyzes these patterns as continuous characteristics, and by integrating this analysis with technical requirements of commercial platforms, we provide practical guidelines for platform selection, technical investment prioritization, and cross-reality application development.
comment: 9 pages, preprint for IEEE Pervasive Computing Special Issue
♻ ☆ PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming
Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.
♻ ☆ Refusal as Silence: Gendered Disparities in Vision-Language Model Responses
Refusal behavior by Large Language Models is increasingly visible in content moderation, yet little is known about how refusals vary by the identity of the user making the request. This study investigates refusal as a sociotechnical outcome through a counterfactual persona design that varies gender identity--including male, female, non-binary, and transgender personas--while keeping the classification task and visual input constant. Focusing on a vision-language model (GPT-4V), we examine how identity-based language cues influence refusal in binary gender classification tasks. We find that transgender and non-binary personas experience significantly higher refusal rates, even in non-harmful contexts. Our findings also provide methodological implications for equity audits and content analysis using LLMs. Our findings underscore the importance of modeling identity-driven disparities and caution against uncritical use of AI systems for content coding. This study advances algorithmic fairness by reframing refusal as a communicative act that may unevenly regulate epistemic access and participation.
♻ ☆ Evaluating AI-Powered Learning Assistants in Engineering Higher Education: Student Engagement, Ethical Challenges, and Policy Implications
As generative AI becomes increasingly integrated into higher education, understanding how students engage with these technologies is essential for responsible adoption. This study evaluates the Educational AI Hub, an AI-powered learning framework, implemented in undergraduate civil and environmental engineering courses at a large R1 public university. Using a mixed-methods design combining pre- and post-surveys, system usage logs, and qualitative analysis of students' AI interactions, the research examines perceptions of trust, ethics, usability, and learning outcomes. Findings show that students valued the AI assistant for its accessibility and comfort, with nearly half reporting greater ease using it than seeking help from instructors or teaching assistants. The tool was most helpful for completing homework and understanding concepts, though views on its instructional quality were mixed. Ethical uncertainty, particularly around institutional policy and academic integrity, emerged as a key barrier to full engagement. Overall, students regarded AI as a supplement rather than a replacement for human instruction. The study highlights the importance of usability, ethical transparency, and faculty guidance in promoting meaningful AI engagement. A total of 71 students participated across two courses, generating over 600 AI interactions and 100 survey responses that provided both quantitative and contextual insights into learning engagement.
comment: 26 pages, 10 Figures, 6 Tables
♻ ☆ "Learning Together": AI-Mediated Support for Parental Involvement in Everyday Learning
Family learning takes place in everyday routines where children and caregivers read, practice, and develop new skills together. Although AI is increasingly present in learning environments, most systems remain child-centered and overlook the collaborative, distributed nature of family education. This paper investigates how AI can mediate family collaboration by addressing tensions of coordination, uneven workloads, and parental mediation. From a formative study with families using AI in daily learning, we identified challenges in responsibility sharing and recognition of contributions. Building on these insights, we designed FamLearn, an LLM-powered prototype that distributes tasks, visualizes contributions, and provides individualized support. A one-week field study with 11 families shows how this prototype can ease caregiving burdens, foster recognition, and enrich shared learning experiences. Our findings suggest that LLMs can move beyond the role of tutor to act as family mediators - balancing responsibilities, scaffolding intergenerational participation, and strengthening the relational fabric of family learning.
♻ ☆ Learned, Lagged, LLM-splained: LLM Responses to End User Security Questions
Answering end user security questions is challenging. While large language models (LLMs) like GPT, LLAMA, and Gemini are far from error-free, they have shown promise in answering a variety of questions outside of security. We studied LLM performance in the area of end user security by qualitatively evaluating 3 popular LLMs on 900 systematically collected end user security questions. While LLMs demonstrate broad generalist ``knowledge'' of end user security information, there are patterns of errors and limitations across LLMs consisting of stale and inaccurate answers, and indirect or unresponsive communication styles, all of which impacts the quality of information received. Based on these patterns, we suggest directions for model improvement and recommend user strategies for interacting with LLMs when seeking assistance with security.
comment: 17 pages, 7 tables
Software Engineering 30
☆ JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints will are available at https://github.com/InternLM/JanusCoder.
comment: Work in progress
☆ Tracing Distribution Shifts with Causal System Maps
Monitoring machine learning (ML) systems is hard, with standard practice focusing on detecting distribution shifts rather than their causes. Root-cause analysis often relies on manual tracing to determine whether a shift is caused by software faults, data-quality issues, or natural change. We propose ML System Maps -- causal maps that, through layered views, make explicit the propagation paths between the environment and the ML system's internals, enabling systematic attribution of distribution shifts. We outline the approach and a research agenda for its development and evaluation.
☆ Floating-Point Neural Network Verification at the Software Level
The behaviour of neural network components must be proven correct before deployment in safety-critical systems. Unfortunately, existing neural network verification techniques cannot certify the absence of faults at the software level. In this paper, we show how to specify and verify that neural networks are safe, by explicitly reasoning about their floating-point implementation. In doing so, we construct NeuroCodeBench 2.0, a benchmark comprising 912 neural network verification examples that cover activation functions, common layers, and full neural networks of up to 170K parameters. Our verification suite is written in plain C and is compatible with the format of the International Competition on Software Verification (SV-COMP). Thanks to it, we can conduct the first rigorous evaluation of eight state-of-the-art software verifiers on neural network code. The results show that existing automated verification tools can correctly solve an average of 11% of our benchmark, while producing around 3% incorrect verdicts. At the same time, a historical analysis reveals that the release of our benchmark has already had a significantly positive impact on the latter.
comment: Pre-print before submission to peer review
☆ Validating Formal Specifications with LLM-generated Test Cases
Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive (and negative) test cases that are syntactically correct and that satisfy (or not) the given requirement, and that can detect many wrong specifications written by humans.
☆ MATCH: Task-Driven Code Evaluation through Contrastive Learning
AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.
☆ Beyond Imprecise Distance Metrics: LLM-Predicted Target Call Stacks for Directed Greybox Fuzzing
Directed greybox fuzzing (DGF) aims to efficiently trigger bugs at specific target locations by prioritizing seeds whose execution paths are more likely to mutate into triggering target bugs. However, existing DGF approaches suffer from imprecise probability calculations due to their reliance on complex distance metrics derived from static analysis. The over-approximations inherent in static analysis cause a large number of irrelevant execution paths to be mistakenly considered to potentially mutate into triggering target bugs, significantly reducing fuzzing efficiency. We propose to replace static analysis-based distance metrics with precise call stack representations. Call stacks represent precise control flows, thereby avoiding false information in static analysis. We leverage large language models (LLMs) to predict vulnerability-triggering call stacks for guiding seed prioritization. Our approach constructs call graphs through static analysis to identify methods that can potentially reach target locations, then utilizes LLMs to predict the most likely call stack sequence that triggers the vulnerability. Seeds whose execution paths have higher overlap with the predicted call stack are prioritized for mutation. This is the first work to integrate LLMs into the core seed prioritization mechanism of DGF. We implement our approach and evaluate it against several state-of-the-art fuzzers. On a suite of real-world programs, our approach triggers vulnerabilities $1.86\times$ to $3.09\times$ faster compared to baselines. In addition, our approach identifies 10 new vulnerabilities and 2 incomplete fixes in the latest versions of programs used in our controlled experiments through directed patch testing, with 10 assigned CVE IDs.
comment: Preprint, under submission
☆ Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards NeurIPS 2025
Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards. Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.
comment: Accepted and to be presented at NeurIPS 2025 Workshop: Foundations of Reasoning in Language Models
☆ Checkstyle+: Reducing Technical Debt Through The Use of Linters with LLMs
Good code style improves program readability, maintainability, and collaboration, and is an integral component of software quality. Developers, however, often cut corners when following style rules, leading to the wide adoption of tools such as linters in professional software development projects. Traditional linters like Checkstyle operate using rigid, rule-based mechanisms that effectively detect many surface-level violations. However, in most programming languages, there is a subset of style rules that require a more nuanced understanding of code, and fall outside the scope of such static analysis. In this paper, we propose Checkstyle+, a hybrid approach that augments Checkstyle with large language model (LLM) capabilities, to identify style violations that elude the conventional rule-based analysis. Checkstyle+ is evaluated on a sample of 380 Java code files, drawn from a broader dataset of 30,800 real-world Java programs sourced from accepted Codeforces submissions. The results show that Checkstyle+ achieves superior performance over standard Checkstyle in detecting violations of the semantically nuanced rules.
comment: 11 pages, 9 figures, tool link: https://github.com/ellacodee/CheckstylePlus
☆ From Online User Feedback to Requirements: Evaluating Large Language Models for Classification and Specification Tasks
[Context and Motivation] Online user feedback provides valuable information to support requirements engineering (RE). However, analyzing online user feedback is challenging due to its large volume and noise. Large language models (LLMs) show strong potential to automate this process and outperform previous techniques. They can also enable new tasks, such as generating requirements specifications. [Question-Problem] Despite their potential, the use of LLMs to analyze user feedback for RE remains underexplored. Existing studies offer limited empirical evidence, lack thorough evaluation, and rarely provide replication packages, undermining validity and reproducibility. [Principal Idea-Results] We evaluate five lightweight open-source LLMs on three RE tasks: user request classification, NFR classification, and requirements specification generation. Classification performance was measured on two feedback datasets, and specification quality via human evaluation. LLMs achieved moderate-to-high classification accuracy (F1 ~ 0.47-0.68) and moderately high specification quality (mean ~ 3/5). [Contributions] We newly explore lightweight LLMs for feedback-driven requirements development. Our contributions are: (i) an empirical evaluation of lightweight LLMs on three RE tasks, (ii) a replication package, and (iii) insights into their capabilities and limitations for RE.
☆ A Multi-Store Privacy Measurement of Virtual Reality App Ecosystem
Virtual Reality (VR) has gained increasing traction among various domains in recent years, with major companies such as Meta, Pico, and Microsoft launching their application stores to support third-party developers in releasing their applications (or simply apps). These apps offer rich functionality but inherently collect privacy-sensitive data, such as user biometrics, behaviors, and the surrounding environment. Nevertheless, there is still a lack of domain-specific regulations to govern the data handling of VR apps, resulting in significant variations in their privacy practices among app stores. In this work, we present the first comprehensive multi-store study of privacy practices in the current VR app ecosystem, covering a large-scale dataset involving 6,565 apps collected from five major app stores. We assess both declarative and behavioral privacy practices of VR apps, using a multi-faceted approach based on natural language processing, reverse engineering, and static analysis. Our assessment reveals significant privacy compliance issues across all stores, underscoring the premature status of privacy protection in this rapidly growing ecosystem. For instance, one third of apps fail to declare their use of sensitive data, and 21.5\% of apps neglect to provide valid privacy policies. Our work sheds light on the status quo of privacy protection within the VR app ecosystem for the first time. Our findings should raise an alert to VR app developers and users, and encourage store operators to implement stringent regulations on privacy compliance among VR apps.
comment: 16 pages
☆ TALM: Dynamic Tree-Structured Multi-Agent Framework with Long-Term Memory for Scalable Code Generation
Agentic code generation requires large language models (LLMs) capable of complex context management and multi-step reasoning. Prior multi-agent frameworks attempt to address these challenges through collaboration, yet they often suffer from rigid workflows and high reasoning recovery costs. To overcome these limitations, we propose TALM (Tree-Structured Multi-Agent Framework with Long-Term Memory), a dynamic framework that integrates structured task decomposition, localized re-reasoning, and long-term memory mechanisms. TALM employs an extensible tree-based collaboration structure. The parent-child relationships, when combined with a divide-and-conquer strategy, enhance reasoning flexibility and enable efficient error correction across diverse task scopes. Furthermore, a long-term memory module enables semantic querying and integration of prior knowledge, supporting implicit self-improvement through experience reuse. Experimental results on HumanEval, BigCodeBench, and ClassEval benchmarks demonstrate that TALM consistently delivers strong reasoning performance and high token efficiency, highlighting its robustness and practical utility in complex code generation tasks.
☆ CodeAD: Synthesize Code of Rules for Log-based Anomaly Detection with LLMs
Log-based anomaly detection (LogAD) is critical for maintaining the reliability and availability of large-scale online service systems. While machine learning, deep learning, and large language models (LLMs)-based methods have advanced the LogAD, they often suffer from limited interpretability, high inference costs, and extensive preprocessing requirements, limiting their practicality for real-time, high-volume log analysis. In contrast, rule-based systems offer efficiency and transparency, but require significant manual effort and are difficult to scale across diverse and evolving environments. In this paper, We present CodeAD, a novel framework that automatically synthesizes lightweight Python rule functions for LogAD using LLMs. CodeAD introduces a hierarchical clustering and anchor-grounded sampling strategy to construct representative contrastive log windows, enabling LLMs to discern discriminative anomaly patterns. To ensure robustness and generalizability, CodeAD employs an agentic workflow that iteratively generates, tests, repairs, and refines the rules until it meets correctness and abstraction requirements. The synthesized rules are interpretable, lightweight, and directly executable on raw logs, supporting efficient and transparent online anomaly detection. Our comprehensive experiments on three public datasets (BGL, Hadoop, Thunderbird) demonstrate that CodeAD achieves an average absolute improvement of 3.6% F1 score over the state-of-the-art baselines, while processing large datasets up to 4x faster and at a fraction of the cost (total LLM invocation cost under 4 USD per dataset). These results highlight CodeAD as a practical and scalable solution for online monitoring systems, enabling interpretable, efficient, and automated LogAD in real-world environment.
☆ Language Server CLI Empowers Language Agents with Process Rewards
Large language models routinely hallucinate APIs and mislocalize edits, while language servers compute verified, IDE-grade facts about real code. We present Lanser-CLI, a CLI-first orchestration layer that pins and mediates a Language Server Protocol (LSP) server for coding agents and CI, exposing deterministic, replayable workflows. Our position is that language servers provide not only structural information (definitions, references, types, diagnostics) but also an actionable process reward: machine-checked, step-wise signals that align an agent's planning loop with program reality. In this work, Lanser-CLI contributes: (i) a robust addressing scheme beyond brittle "file:line:col" via a Selector DSL (symbolic, AST-path, and content-anchored selectors) with a principled relocation algorithm; (ii) deterministic Analysis Bundles that normalize Language Server responses and capture environment/capability metadata with stable content hashes; (iii) a safety envelope for mutating operations (rename, code actions) with preview, workspace jails, and Git-aware, transactional apply; and (iv) a process-reward functional derived from Language Server facts (diagnostic deltas, disambiguation confidence, and safe-apply checks) that is computable online and replayable offline. We formalize determinism under frozen snapshots and establish a monotonicity property for the process reward, making it suitable for process supervision and counterfactual analysis. Project Page: https://github.com/yifanzhang-pro/lanser-cli
comment: Project Page: https://github.com/yifanzhang-pro/lanser-cli
☆ On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset
Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable agentic reasoning systems. While large language models (LLMs) demonstrate strong performance on isolated benchmarks, their ability to transfer reasoning strategies and co-ordinate tools across diverse domains is poorly understood. In this work, we conduct a large-scale evaluation of state-of-the-art LLMs on multiple tool-calling benchmarksBFCL v3, TauBench, Tau2Bench, and AceBenchand introduce MAVEN (Math & Physics Adversarial Verification & Evaluation Network), a new out of distribution (OOD) benchmark designed to stress-test multi-step reasoning through explicit verification and adversarial task composition. Our results show that most current models achieve below 50% accuracy on MAVEN, revealing a significant generalization gap across tool-use settings. To address this, we present the CoreThink Agentic Reasoner, a framework that augments LLMs with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration. Without additional training, it generalizes across all benchmarks, achieving state-of-the-art performance with 530% improvements over existing baselines at roughly one-tenth the computational cost.
comment: Preprint
☆ Evaluating the effectiveness of LLM-based interoperability
Background: Systems of systems are becoming increasingly dynamic and heterogeneous, and this adds pressure on the long-standing challenge of interoperability. Besides its technical aspect, interoperability has also an economic side, as development time efforts are required to build the interoperability artifacts. Objectives: With the recent advances in the field of large language models (LLMs), we aim at analyzing the effectiveness of LLM-based strategies to make systems interoperate autonomously, at runtime, without human intervention. Method: We selected 13 open source LLMs and curated four versions of a dataset in the agricultural interoperability use case. We performed three runs of each model with each version of the dataset, using two different strategies. Then we compared the effectiveness of the models and the consistency of their results across multiple runs. Results: qwen2.5-coder:32b was the most effective model using both strategies DIRECT (average pass@1 >= 0.99) and CODEGEN (average pass@1 >= 0.89) in three out of four dataset versions. In the fourth dataset version, which included an unit conversion, all models using the strategy DIRECT failed, whereas using CODEGEN qwen2.5-coder:32b succeeded with an average pass@1 = 0.75. Conclusion: Some LLMs can make systems interoperate autonomously. Further evaluation in different domains is recommended, and further research on reliability strategies should be conducted.
☆ TDFlow: Agentic Workflows for Test Driven Software Engineering
We introduce TDFlow, a novel test-driven agentic workflow that frames repository-scale software engineering as a test-resolution task, specifically designed to solve human-written tests. Given a set of tests, TDFlow repeatedly proposes, revises, and debugs repository-scale patches using precisely engineered sub-agents and tightly constrained tools. The workflow decomposes software engineering program repair into four components governed by respective sub-agents. This simple, forced decoupling of patch proposing, debugging, patch revision, and optional test generation (1) reduces long-context burden on any individual sub-agent, (2) focuses each sub-agent on specific, pre-defined sub-tasks, and (3) allows for specialized performance improvement on specific sub-tasks. When provided human-written tests, TDFlow attains 88.8% pass rate on SWE-Bench Lite (an absolute improvement of 27.8% over the next best system) and 94.3% on SWE-Bench Verified. Manual inspection of the 800 TDFlow runs within SWE-Bench Lite and Verified uncover only 7 instances of test hacking, which were subsequently counted as failures. Furthermore, we show that the primary obstacle to human-level software engineering performance lies within writing successful reproduction tests. We envision a human-LLM interactive system powered by TDFlow where human developers write tests solved by LLM systems. Together, these results indicate that modern LLMs, when embedded in a narrowly engineered, test-driven workflow, already achieve human-level test resolution -- with the final frontier for fully autonomous repository repair being the accurate generation of valid reproduction tests.
☆ Beyond Prompt Engineering: Neuro-Symbolic-Causal Architecture for Robust Multi-Objective AI Agents
Large language models show promise as autonomous decision-making agents, yet their deployment in high-stakes domains remains fraught with risk. Without architectural safeguards, LLM agents exhibit catastrophic brittleness: identical capabilities produce wildly different outcomes depending solely on prompt framing. We present Chimera, a neuro-symbolic-causal architecture that integrates three complementary components - an LLM strategist, a formally verified symbolic constraint engine, and a causal inference module for counterfactual reasoning. We benchmark Chimera against baseline architectures (LLM-only, LLM with symbolic constraints) across 52-week simulations in a realistic e-commerce environment featuring price elasticity, trust dynamics, and seasonal demand. Under organizational biases toward either volume or margin optimization, LLM-only agents fail catastrophically (total loss of \$99K in volume scenarios) or destroy brand trust (-48.6% in margin scenarios). Adding symbolic constraints prevents disasters but achieves only 43-87% of Chimera's profit. Chimera consistently delivers the highest returns (\$1.52M and \$1.96M respectively, some cases +\$2.2M) while improving brand trust (+1.8% and +10.8%, some cases +20.86%), demonstrating prompt-agnostic robustness. Our TLA+ formal verification proves zero constraint violations across all scenarios. These results establish that architectural design not prompt engineering determines the reliability of autonomous agents in production environments. We provide open-source implementations and interactive demonstrations for reproducibility.
comment: 35 pages, 15 figures, 2 tables. Keywords: Large Language Models, Autonomous Agents, Neuro-Symbolic AI, Causal Inference, Formal Verification, Multi-Objective Optimization. Open-source code and interactive demo available
☆ RefleXGen:The unexamined code is not worth using
Security in code generation remains a pivotal challenge when applying large language models (LLMs). This paper introduces RefleXGen, an innovative method that significantly enhances code security by integrating Retrieval-Augmented Generation (RAG) techniques with guided self-reflection mechanisms inherent in LLMs. Unlike traditional approaches that rely on fine-tuning LLMs or developing specialized secure code datasets - processes that can be resource-intensive - RefleXGen iteratively optimizes the code generation process through self-assessment and reflection without the need for extensive resources. Within this framework, the model continuously accumulates and refines its knowledge base, thereby progressively improving the security of the generated code. Experimental results demonstrate that RefleXGen substantially enhances code security across multiple models, achieving a 13.6% improvement with GPT-3.5 Turbo, a 6.7% improvement with GPT-4o, a 4.5% improvement with CodeQwen, and a 5.8% improvement with Gemini. Our findings highlight that improving the quality of model self-reflection constitutes an effective and practical strategy for strengthening the security of AI-generated code.
☆ Compiler.next: A Search-Based Compiler to Power the AI-Native Future of Software Engineering
The rapid advancement of AI-assisted software engineering has brought transformative potential to the field of software engineering, but existing tools and paradigms remain limited by cognitive overload, inefficient tool integration, and the narrow capabilities of AI copilots. In response, we propose Compiler.next, a novel search-based compiler designed to enable the seamless evolution of AI-native software systems as part of the emerging Software Engineering 3.0 era. Unlike traditional static compilers, Compiler.next takes human-written intents and automatically generates working software by searching for an optimal solution. This process involves dynamic optimization of cognitive architectures and their constituents (e.g., prompts, foundation model configurations, and system parameters) while finding the optimal trade-off between several objectives, such as accuracy, cost, and latency. This paper outlines the architecture of Compiler.next and positions it as a cornerstone in democratizing software development by lowering the technical barrier for non-experts, enabling scalable, adaptable, and reliable AI-powered software. We present a roadmap to address the core challenges in intent compilation, including developing quality programming constructs, effective search heuristics, reproducibility, and interoperability between compilers. Our vision lays the groundwork for fully automated, search-driven software development, fostering faster innovation and more efficient AI-driven systems.
comment: 31 pages, 5 figures, submitted to ACM Transactions on Software Engineering and Methodology
♻ ☆ The Design Space of Lockfiles Across Package Managers
Software developers reuse third-party packages that are hosted in package registries. At build time, a package manager resolves and fetches the direct and indirect dependencies of a project. Most package managers also generate a lockfile, which records the exact set of resolved dependency versions. Lockfiles are used to reduce build times; to verify the integrity of resolved packages; and to support build reproducibility across environments and time. Despite these beneficial features, developers often struggle with their maintenance, usage, and interpretation. In this study, we unveil the major challenges related to lockfiles, such that future researchers and engineers can address them. We perform the first comprehensive study of lockfiles across 7 popular package managers, npm, pnpm, Cargo, Poetry, Pipenv, Gradle, and Go. First, we highlight the wide variety of design decisions that package managers make, regarding the generation process as well as the content of lockfiles. Next, we conduct a qualitative analysis based on semi-structured interviews with 15 developers. We capture first-hand insights about the benefits that developers perceive in lockfiles, as well as the challenges they face to manage these files. Following these observations, we make 5 recommendations to further improve lockfiles, for a better developer experience.
♻ ☆ ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement
We present Agentic Retrieval-Augmented Code Synthesis (ARCS), a system that improves LLM-based code generation without fine-tuning. ARCS operates through a budgeted synthesize-execute-repair loop over a frozen model: it retrieves relevant code context before generation, proposes candidates, executes them against tests, and repairs based on execution feedback. This retrieval-before-generation design reduces hallucination and accelerates convergence. We formalize ARCS as a state-action process with provable guarantees on termination, monotonic improvement, and bounded cost. A tiered controller (Small/Medium/Large) trades latency for accuracy predictably. On HumanEval, ARCS achieves up to 87.2% pass@1 with Llama-3.1-405B, surpassing CodeAgent (82.3%) while using simpler control than tree-search methods. On TransCoder, it achieves >= 90% accuracy on most translation pairs. On a LANL scientific corpus, it improves CodeBLEU by +0.115 over baseline RAG. ARCS provides a practical, reproducible approach to reliable code synthesis using existing LLM checkpoints.
♻ ☆ The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management NeurIPS '25
Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering (SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these approaches within SWE-agent on SWE-bench Verified across five diverse model configurations. Moreover, we show initial evidence of our findings generalizing to the OpenHands agent scaffold. We find that a simple environment observation masking strategy halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. Additionally, we introduce a novel hybrid approach that further reduces costs by 7% and 11% compared to just observation masking or LLM summarization, respectively. Our findings raise concerns regarding the trend towards pure LLM summarization and provide initial evidence of untapped cost reductions by pushing the efficiency-effectiveness frontier. We release code and data for reproducibility.
comment: v3: DL4C camera-ready version to be presented at the 4th DL4C workshop co-located with NeurIPS '25; added OpenHands generality probe, added hybrid context management strategy
♻ ☆ Formal Analysis of the Contract Automata Runtime Environment with Uppaal: Modelling, Verification and Testing
Recently, a distributed middleware application called contract automata runtime environment (CARE) has been introduced to realise service applications specified using a dialect of finite-state automata. In this paper, we detail the formal modelling, verification and testing of CARE. We provide a formalisation as a network of stochastic timed automata. The model is verified against the desired properties with the tool Uppaal, utilising exhaustive and statistical model checking techniques. Abstract tests are generated from the Uppaal models that are concretised for testing CARE. This research emphasises the advantages of employing formal modelling, verification and testing processes to enhance the dependability of an open-source distributed application. We discuss the methodology used for modelling the application and generating concrete tests from the abstract model, addressing the issues that have been identified and fixed.
♻ ☆ Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Development
Recent advances in large language models (LLMs) have demonstrated strong capabilities in software engineering tasks, raising expectations of revolutionary productivity gains. However, enterprise software development is largely driven by incremental evolution, where challenges extend far beyond routine coding and depend critically on tacit knowledge, including design decisions at different levels and historical trade-offs. To achieve effective AI-powered support for complex software development, we should align emerging AI capabilities with the practical realities of enterprise development. To this end, we systematically identify challenges from both software and LLM perspectives. Alongside these challenges, we outline opportunities where AI and structured knowledge frameworks can enhance decision-making in tasks such as issue localization and impact analysis. To address these needs, we propose the Code Digital Twin, a living framework that models both the physical and conceptual layers of software, preserves tacit knowledge, and co-evolves with the codebase. By integrating hybrid knowledge representations, multi-stage extraction pipelines, incremental updates, LLM-empowered applications, and human-in-the-loop feedback, the Code Digital Twin transforms fragmented knowledge into explicit and actionable representations. Our vision positions it as a bridge between AI advancements and enterprise software realities, providing a concrete roadmap toward sustainable, intelligent, and resilient development and evolution of ultra-complex systems.
comment: A vision paper that will be continuously updated
♻ ☆ Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Development
Recent advances in large language models (LLMs) have demonstrated strong capabilities in software engineering tasks, raising expectations of revolutionary productivity gains. However, enterprise software development is largely driven by incremental evolution, where challenges extend far beyond routine coding and depend critically on tacit knowledge, including design decisions at different levels and historical trade-offs. To achieve effective AI-powered support for complex software development, we should align emerging AI capabilities with the practical realities of enterprise development. To this end, we systematically identify challenges from both software and LLM perspectives. Alongside these challenges, we outline opportunities where AI and structured knowledge frameworks can enhance decision-making in tasks such as issue localization and impact analysis. To address these needs, we propose the Code Digital Twin, a living framework that models both the physical and conceptual layers of software, preserves tacit knowledge, and co-evolves with the codebase. By integrating hybrid knowledge representations, multi-stage extraction pipelines, incremental updates, LLM-empowered applications, and human-in-the-loop feedback, the Code Digital Twin transforms fragmented knowledge into explicit and actionable representations. Our vision positions it as a bridge between AI advancements and enterprise software realities, providing a concrete roadmap toward sustainable, intelligent, and resilient development and evolution of ultra-complex systems.
comment: This should be a replacement of another article (arXiv:2503.07967), I mis-submitted it as a new article
♻ ☆ SBAN: A Framework & Multi-Dimensional Dataset for Large Language Model Pre-Training and Software Code Mining
This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code. This unique multimodal structure enables research on cross-representation learning, semantic understanding of software, and automated malware detection. Beyond security applications, SBAN supports broader tasks such as code translation, code explanation, and other software mining tasks involving heterogeneous data. It is particularly suited for scalable training of deep models, including transformers and other LLM architectures. By bridging low-level machine representations and high-level human semantics, SBAN provides a robust foundation for building intelligent systems that reason about code. We believe that this dataset opens new opportunities for mining software behavior, improving security analytics, and enhancing LLM capabilities in pre-training and fine-tuning tasks for software code mining.
♻ ☆ Context Engineering for AI Agents in Open-Source Software
GenAI-based coding assistants have disrupted software development. Their next generation is agent-based, operating with more autonomy and potentially without human oversight. One challenge is to provide AI agents with sufficient context about the software projects they operate in. Like humans, AI agents require contextual information to develop solutions that are in line with the target architecture, interface specifications, coding guidelines, standard workflows, and other project-specific policies. Popular AI agents for software development (e.g., Claude Code) advocate for maintaining tool-specific version-controlled Markdown files that cover aspects such as the project structure, building and testing, or code style. The content of these files is automatically added to each prompt. AGENTS$.$md has emerged as a potential standard that consolidates tool-specific formats. However, little is known about whether and how developers adopt this format. Therefore, in this paper, we present the results of a preliminary study investigating the adoption of AI configuration files in 466 open-source software projects, what information developers provide in these files, how they present that information, and how the files evolve over time. Our findings indicate that there is no established structure yet, and that there is a lot of variation in terms of how context is provided (descriptive, prescriptive, prohibitive, explanatory, conditional). We see great potential in studying which modifications in structure or presentation can positively affect the quality of the generated content. Finally, our analysis of commits modifying AGENTS$.$md files provides first insights into how projects continuously extend and maintain these files. We conclude the paper by outlining how the adoption of AI configuration files provides a unique opportunity to study real-world prompt and context engineering.
comment: 6 pages, 1 figure, 2 tables
♻ ☆ FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping
As systems become more complex, the demand for thorough testing and virtual prototyping grows. To simulate whole systems, multiple tools are usually needed to cover different parts. These parts include the hardware of a system and the environment with which the system interacts. The Functional Mock-up Interface (FMI) standard for co-simulation can be used to connect these tools. The control part of modern systems is usually a computing unit, such as a System-on-a-Chip (SoC) or Microcontroller Unit (MCU), which executes software from a connected memory and interacts with peripherals. To develop software without requiring access to physical hardware, full-system simulators, the so-called Virtual Platforms (VPs), are commonly used. The IEEE-standardized framework for VP development is SystemC TLM. SystemC provides interfaces and concepts that enable modular design and model exchange. However, SystemC lacks native FMI support, which limits the integration into broader co-simulation environments. This paper presents a novel framework to control and interact with SystemC-based VPs using the FMI. We present a case study showing how a simulated temperature sensor in a SystemC simulation can obtain temperature values from an external tool via FMI. This approach allows the unmodified target software to run on the VP and receive realistic environmental input data such as temperature, velocity, or acceleration values from other tools. Thus, extensive software testing and verification is enabled. By having tests ready and the software pre-tested using a VP once the physical hardware is available, certifications like ISO 26262 can be done earlier.
comment: PREPRINT - accepted by the 16th International Modelica and FMI Conference 2025
♻ ☆ Your Compiler is Backdooring Your Model: Understanding and Exploiting Compilation Inconsistency Vulnerabilities in Deep Learning Compilers
Deep learning (DL) compilers are core infrastructure in modern DL systems, offering flexibility and scalability beyond vendor-specific libraries. This work uncovers a fundamental vulnerability in their design: can an official, unmodified compiler alter a model's semantics during compilation and introduce hidden backdoors? We study both adversarial and natural settings. In the adversarial case, we craft benign models where triggers have no effect pre-compilation but become effective backdoors after compilation. Tested on six models, three commercial compilers, and two hardware platforms, our attack yields 100% success on triggered inputs while preserving normal accuracy and remaining undetected by state-of-the-art detectors. The attack generalizes across compilers, hardware, and floating-point settings. In the natural setting, we analyze the top 100 HuggingFace models (including one with 220M+ downloads) and find natural triggers in 31 models. This shows that compilers can introduce risks even without adversarial manipulation. Our results reveal an overlooked threat: unmodified DL compilers can silently alter model semantics. To our knowledge, this is the first work to expose inherent security risks in DL compiler design, opening a new direction for secure and trustworthy ML.
comment: This paper is accepted to IEEE S&P 2026, the code is available at https://github.com/SeekingDream/DLCompilerAttack
♻ ☆ SBEST: Spectrum-Based Fault Localization Without Fault-Triggering Tests
Fault localization is a critical step in software maintenance. Yet, many existing techniques, such as Spectrum-Based Fault Localization (SBFL), rely heavily on the availability of fault-triggering tests to be effective. In practice, especially for crash-related bugs, such tests are frequently unavailable. Meanwhile, bug reports containing stack traces often serve as the only available evidence of runtime failures and provide valuable context for debugging. This study investigates the feasibility of using stack traces from crash reports as proxies for fault-triggering tests in SBFL. Our empirical analysis of 60 crash-report bugs in Defects4J reveals that only 3.33% of these bugs have fault-triggering tests available at the time of the bug report creation. However, 98.3% of bug fixes directly address the exception observed in the stack trace, and 78.3% of buggy methods are reachable within an average of 0.34 method calls from the stack trace. These findings underscore the diagnostic value of stack traces in the absence of failing tests. Motivated by these findings, we propose SBEST, a novel approach that integrates stack trace information with test coverage data to perform fault localization when fault-triggering tests are missing. SBEST shows an improvement, with a 32.22% increase in Mean Average Precision (MAP) and a 17.43% increase in Mean Reciprocal Rank (MRR) compared to baseline approaches under the scenario where fault-triggering tests are absent.
comment: Empirical Software Engineering (EMSE) journal
Systems and Control 45
☆ From Zonal to Nodal Capacity Expansion Planning: Spatial Aggregation Impacts on a Realistic Test-Case SC
Solving power system capacity expansion planning (CEP) problems at realistic spatial resolutions is computationally challenging. Thus, a common practice is to solve CEP over zonal models with low spatial resolution rather than over full-scale nodal power networks. Due to improvements in solving large-scale stochastic mixed integer programs, these computational limitations are becoming less relevant, and the assumption that zonal models are realistic and useful approximations of nodal CEP is worth revisiting. This work is the first to conduct a systematic computational study on the assumption that spatial aggregation can reasonably be used for ISO- and interconnect-scale CEP. By considering a realistic, large-scale test network based on the state of California with over 8,000 buses and 10,000 transmission lines, we demonstrate that well-designed small spatial aggregations can yield good approximations but that coarser zonal models result in large distortions of investment decisions.
comment: 10 pages, 4 figures, 6 tables, submitted to 2026 Power Systems Computation Conference (PSCC)
☆ Towards Stochastic (N-1)-Secure Redispatch
The intermittent nature of renewable power availability is one of the major sources of uncertainty in power systems. While markets can guarantee that the demand is covered by the available generation, transmission system operators have to often intervene via economic redispatch to ensure that the physical constraints of the network are satisfied. To account for uncertainty, the underlying optimal power flow (OPF) routines have to be modified. Recently, polynomial chaos expansion (PCE) has been suggested in the literature as a tool for stochastic OPF problems. However, the usage of PCE-based methods in security-constrained OPF for (N-1)-secure operations has not yet been explored. In this paper, we propose a procedure that iteratively solves a PCE-overloaded stochastic OPF problem by including line outage constraints until an (N-1)-secure solution is achieved. We demonstrate the efficacy of our method by comparing it with a Monte-Carlo simulation on a 118-bus example system.
comment: 7 pages, 1 figure
☆ An Error-Based Safety Buffer for Safe Adaptive Control (Extended Version)
We consider the problem of adaptive control of a class of feedback linearizable plants with matched parametric uncertainties whose states are accessible, subject to state constraints, which often arise due to safety considerations. In this paper, we combine adaptation and control barrier functions into a real-time control architecture that guarantees stability, ensures control performance, and remains safe even with the parametric uncertainties. Two problems are considered, differing in the nature of the parametric uncertainties. In both cases, the control barrier function is assumed to have an arbitrary relative degree. In addition to guaranteeing stability, it is proved that both the control objective and safety objective are met with near-zero conservatism. No excitation conditions are imposed on the command signal. Simulation results demonstrate the non-conservatism of all of the theoretical developments.
comment: Submitted to IEEE Transactions on Automatic Control
☆ IoT-Driven Smart Management in Broiler Farming: Simulation of Remote Sensing and Control Systems SC
Parameter monitoring and control systems are crucial in the industry as they enable automation processes that improve productivity and resource optimization. These improvements also help to manage environmental factors and the complex interactions between multiple inputs and outputs required for production management. This paper proposes an automation system for broiler management based on a simulation scenario that involves sensor networks and embedded systems. The aim is to create a transmission network for monitoring and controlling broiler temperature and feeding using the Internet of Things (IoT), complemented by a dashboard and a cloud-based service database to track improvements in broiler management. We look forward this work will serve as a guide for stakeholders and entrepreneurs in the animal production industry, fostering sustainable development through simple and cost-effective automation solutions. The goal is for them to scale and integrate these recommendations into their existing operations, leading to more efficient decision-making at the management level.
comment: 2025 IEEE Technology and Engineering Management Society Conference (TEMSCON LATAM), Cartagena, Colombia
☆ Flexibility aggregation via set projection for distribution grids with multiple interconnections
With the increasing number of flexible energy devices in distribution grids, coordination between Transmission System Operators (TSOs) and Distribution System Operators (DSOs) becomes critical for optimal system operation. One form of coordination is to solve the overall system operation problem in a hierarchical way, computing Feasible Operational Regions (FORs) for the interconnection between TSO/DSO. Most methods for computing FORs rely on the assumption of only one interconnection point between TSO and DSOs, which is often violated in practice. In this work, we propose a method for computing FORs in distribution grids with multiple interconnection points to the transmission grid. We test our method in a grid with two interconnecting points and analyze the properties of the resulting high-dimensional FOR from a power systems perspective.
☆ Payload trajectory tracking control for aerial transportation systems with cable length online optimization
Cable-suspended aerial transportation systems are employed extensively across various industries. The capability to flexibly adjust the relative position between the multirotor and the payload has spurred growing interest in the system equipped with variable-length cable, promising broader application potential. Compared to systems with fixed-length cables, introducing the variable-length cable adds a new degree of freedom. However, it also results in increased nonlinearity and more complex dynamic coupling among the multirotor, the cable and the payload, posing significant challenges in control design. This paper introduces a backstepping control strategy tailored for aerial transportation systems with variable-length cable, designed to precisely track the payload trajectory while dynamically adjusting cable length. Then, a cable length generator has been developed that achieves online optimization of the cable length while satisfying state constraints, thus balancing the multirotor's motion and cable length changes without the need for manual trajectory planning. The asymptotic stability of the closed-loop system is guaranteed through Lyapunov techniques and the growth restriction condition. Finally, simulation results confirm the efficacy of the proposed method in managing trajectory tracking and cable length adjustments effectively.
☆ Inertia Partitioning Modular Control Framework for Reconfigurable Multibody Systems
A novel modular control framework for reconfigurable rigid multibody systems is proposed, motivated by the challenges of modular control of systems with closed kinematic chains. In the framework, modularity is defined in the sense of degrees of freedom, and the inertial properties of each body are partitioned with respect to how they are reflected in the kinetic energy of the system through the motion induced by each degree of freedom. This approach inherently handles closed chains in the same manner as tree-like structures, eliminating the need for explicit constraint force calculations or formulations based on differential-algebraic equations. The proposed framework is implemented via simulation on a three-degree-of-freedom series-parallel manipulator, with the results being consistent with the expected stability and tracking performance, and indicating the framework's potential for scalability in trajectory-tracking control of multibody systems.
☆ Neural Networks for AC Optimal Power Flow: Improving Worst-Case Guarantees during Training SC
The AC Optimal Power Flow (AC-OPF) problem is central to power system operation but challenging to solve efficiently due to its nonconvex and nonlinear nature. Neural networks (NNs) offer fast surrogates, yet their black-box behavior raises concerns about constraint violations that can compromise safety. We propose a verification-informed NN framework that incorporates worst-case constraint violations directly into training, producing models that are both accurate and provably safer. Through post-hoc verification, we achieve substantial reductions in worst-case violations and, for the first time, verify all operational constraints of large-scale AC-OPF proxies. Practical feasibility is further enhanced via restoration and warm-start strategies for infeasible operating points. Experiments on systems ranging from 57 to 793 buses demonstrate scalability, speed, and reliability, bridging the gap between ML acceleration and safe, real-time deployment of AC-OPF solutions - and paving the way toward data-driven optimal control.
comment: Submitted to PSCC 2026 (under review)
☆ Embroidery Actuator Utilizing Embroidery Patterns to Generate Diverse Fabric Deformations
This paper presents a novel Embroidery Actuator, a fabric-integrated pneumatic actuator that enables diverse and controllable deformations through embroidery pattern design. Unlike conventional fabric actuators that rely on fiber- or thread-shaped actuators, the proposed actuator is fabricated by directly stitching an inflatable tube onto the fabric using a cord-embroidery technique. The embroidered thread and the fabric jointly form a sleeve that constrains the expansion of the inflatable tube, converting internal pressure into targeted bending or stretching deformations. By varying the embroidery pattern, such as zigzag or cross configurations, different geometric constraints can be realized, allowing for flexible control of deformation direction and magnitude. Analytical deformation models based on the Neo-Hookean model and Lagrange's equations were developed to predict the relationship between pneumatic pressure and bending angle, and were experimentally validated using motion-capture measurements. The results demonstrated that the actuator achieves strong agreement with the analytical deformation model.
comment: 8 pages, 8 figures. This work has been submitted to the IEEE for possible publication
☆ Combining High Level Scheduling and Low Level Control to Manage Fleets of Mobile Robots
The deployment of mobile robots for material handling in industrial environments requires scalable coordination of large fleets in dynamic settings. This paper presents a two-layer framework that combines high-level scheduling with low-level control. Tasks are assigned and scheduled using the compositional algorithm ComSat, which generates time-parameterized routes for each robot. These schedules are then used by a distributed Model Predictive Control (MPC) system in real time to compute local reference trajectories, accounting for static and dynamic obstacles. The approach ensures safe, collision-free operation, and supports rapid rescheduling in response to disruptions such as robot failures or environmental changes. We evaluate the method in simulated 2D environments with varying road capacities and traffic conditions, demonstrating high task completion rates and robust behavior even under congestion. The modular structure of the framework allows for computational tractability and flexibility, making it suitable for deployment in complex, real-world industrial scenarios.
☆ Context-awareness for Dependable Low-Power IoT
Dependability is the ability to consistently deliver trusted and uninterrupted service in the face of operational uncertainties. Ensuring dependable operation in large-scale, energy-constrained Internet of Things (IoT) deployments is as crucial as challenging, and calls for context-aware protocols where context refers to situational or state information. In this paper, we identify four critical context dimensions for IoT networks, namely energy status, information freshness, task relevance, and physical/medium conditions, and show how each one underpins core dependability attributes. Building on these insights, we propose a two-step protocol design framework that incorporates operation-specific context fields. Through three representative use cases, we demonstrate how context awareness can significantly enhance system dependability while imposing only minimal control-plane overhead.
☆ NeuroDOB: A Deep Neural Observer-Based Controller for Vehicle Lateral Dynamics
This paper proposes NeuroDOB, a deep neural network based observer controller for vehicle lateral dynamics, which replaces the conventional disturbance observer (DOB) with a deep neural network (DNN) to enhance personalized lateral control. Unlike conventional DOBs that compensate for general disturbances such as road friction variation and crosswind, NeuroDOB explicitly addresses unmodeled vehicle dynamics and driver-specific behaviors by learning the steering compensation signal from driver-in-the-loop simulations using CarSim's embedded controller as a surrogate driver. The proposed architecture integrates NeuroDOB with a linear quadratic regulator (LQR), where the DNN outputs a delta error correction added to the baseline LQR steering input to produce the final control command. Input features to the DNN include lateral position and yaw angle errors, and the LQR control input. Experimental validation using a lateral dynamic bicycle model within CarSim demonstrates that NeuroDOB effectively adapts to individual driving habits, improving lateral control performance beyond what conventional LQR controllers achieve. The results indicate the potential of deep neural network based observer to enable personalized and adaptive autonomous vehicle control. In cognitive terms, the proposed architecture can be viewed as a dual-system control structure. The baseline LQR corresponds to System 1, a model-based, fast, and analytic reasoning layer ensuring stability. The NeuroDOB acts as System 2, a reflective, data-driven layer that learns compensation from experience and corrects the analytical bias of System 1. Together, they form an integrated decision process analogous to human intuition-reflection interaction, enabling both stability and adaptability in lateral control.
comment: 12 pages, 16 figures
☆ zkSTAR: A zero knowledge system for time series attack detection enforcing regulatory compliance in critical infrastructure networks
Industrial control systems (ICS) form the operational backbone of critical infrastructure networks (CIN) such as power grids, water supply systems, and gas pipelines. As cyber threats to these systems escalate, regulatory agencies are imposing stricter compliance requirements to ensure system-wide security and reliability. A central challenge, however, is enabling regulators to verify the effectiveness of detection mechanisms without requiring utilities to disclose sensitive operational data. In this paper, we introduce zkSTAR, a cyberattack detection framework that leverages zk-SNARKs to reconcile these requirements and enable provable detection guarantees while preserving data confidentiality. Our approach builds on established residual-based statistical hypothesis testing methods applied to state-space detection models. Specifically, we design a two-pronged zk-SNARK architecture that enforces temporal consistency of the state-space dynamics and statistical consistency of the detection tests, allowing regulators to temporally verify alarm correctness without visibility into utility-level data. We formally analyze the soundness and zero knowledge properties of our framework and validate its practical feasibility through computational experiments on real-world ICS datasets. As a result, our work demonstrates a scalable, privacy-preserving alternative for regulatory compliance for ICS driven critical infrastructure networks.
☆ Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation
We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in realworld environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use EfficientNet-B0 as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead computing the bearing angle directly from consecutive GNSS positions. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repository at https://github.com/oskarnatan/Seq-DeepIPC.
comment: Preprint notice, this manuscript has been submitted to IEEE sensors journal for possible publication
☆ Planning Oriented Integrated Sensing and Communication
Integrated sensing and communication (ISAC) enables simultaneous localization, environment perception, and data exchange for connected autonomous vehicles. However, most existing ISAC designs prioritize sensing accuracy and communication throughput, treating all targets uniformly and overlooking the impact of critical obstacles on motion efficiency. To overcome this limitation, we propose a planning-oriented ISAC (PISAC) framework that reduces the sensing uncertainty of planning-bottleneck obstacles and expands the safe navigable path for the ego-vehicle, thereby bridging the gap between physical-layer optimization and motion-level planning. The core of PISAC lies in deriving a closed-form safety bound that explicitly links ISAC transmit power to sensing uncertainty, based on the Cram\'er-Rao Bound and occupancy inflation principles. Using this model, we formulate a bilevel power allocation and motion planning (PAMP) problem, where the inner layer optimizes the ISAC beam power distribution and the outer layer computes a collision-free trajectory under uncertainty-aware safety constraints. Comprehensive simulations in high-fidelity urban driving environments demonstrate that PISAC achieves up to 40% higher success rates and over 5% shorter traversal times than existing ISAC-based and communication-oriented benchmarks, validating its effectiveness in enhancing both safety and efficiency.
☆ An Intelligent Water-Saving Irrigation System Based on Multi-Sensor Fusion and Visual Servoing Control
This paper introduces an intelligent water-saving irrigation system designed to address critical challenges in precision agriculture, such as inefficient water use and poor terrain adaptability. The system integrates advanced computer vision, robotic control, and real-time stabilization technologies via a multi-sensor fusion approach. A lightweight YOLO model, deployed on an embedded vision processor (K210), enables real-time plant container detection with over 96% accuracy under varying lighting conditions. A simplified hand-eye calibration algorithm-designed for 'handheld camera' robot arm configurations-ensures that the end effector can be precisely positioned, with a success rate exceeding 90%. The active leveling system, driven by the STM32F103ZET6 main control chip and JY901S inertial measurement data, can stabilize the irrigation platform on slopes up to 10 degrees, with a response time of 1.8 seconds. Experimental results across three simulated agricultural environments (standard greenhouse, hilly terrain, complex lighting) demonstrate a 30-50% reduction in water consumption compared to conventional flood irrigation, with water use efficiency exceeding 92% in all test cases.
☆ End-to-End Design and Validation of a Low-Cost Stewart Platform with Nonlinear Estimation and Control
This paper presents the complete design, control, and experimental validation of a low-cost Stewart platform prototype developed as an affordable yet capable robotic testbed for research and education. The platform combines off the shelf components with 3D printed and custom fabricated parts to deliver full six degrees of freedom motions using six linear actuators connecting a moving platform to a fixed base. The system software integrates dynamic modeling, data acquisition, and real time control within a unified framework. A robust trajectory tracking controller based on feedback linearization, augmented with an LQR scheme, compensates for the platform's nonlinear dynamics to achieve precise motion control. In parallel, an Extended Kalman Filter fuses IMU and actuator encoder feedback to provide accurate and reliable state estimation under sensor noise and external disturbances. Unlike prior efforts that emphasize only isolated aspects such as modeling or control, this work delivers a complete hardware-software platform validated through both simulation and experiments on static and dynamic trajectories. Results demonstrate effective trajectory tracking and real-time state estimation, highlighting the platform's potential as a cost effective and versatile tool for advanced research and educational applications.
comment: 24 pages, journal
☆ Never Too Rigid to Reach: Adaptive Virtual Model Control with LLM- and Lyapunov-Based Reinforcement Learning
Robotic arms are increasingly deployed in uncertain environments, yet conventional control pipelines often become rigid and brittle when exposed to perturbations or incomplete information. Virtual Model Control (VMC) enables compliant behaviors by embedding virtual forces and mapping them into joint torques, but its reliance on fixed parameters and limited coordination among virtual components constrains adaptability and may undermine stability as task objectives evolve. To address these limitations, we propose Adaptive VMC with Large Language Model (LLM)- and Lyapunov-Based Reinforcement Learning (RL), which preserves the physical interpretability of VMC while supporting stability-guaranteed online adaptation. The LLM provides structured priors and high-level reasoning that enhance coordination among virtual components, improve sample efficiency, and facilitate flexible adjustment to varying task requirements. Complementarily, Lyapunov-based RL enforces theoretical stability constraints, ensuring safe and reliable adaptation under uncertainty. Extensive simulations on a 7-DoF Panda arm demonstrate that our approach effectively balances competing objectives in dynamic tasks, achieving superior performance while highlighting the synergistic benefits of LLM guidance and Lyapunov-constrained adaptation.
☆ Secure Control of Connected and Autonomous Electrified Vehicles Under Adversarial Cyber-Attacks
Connected and Autonomous Electrified Vehicles (CAEV) is the solution to the future smart mobility having benefits of efficient traffic flow and cleaner environmental impact. Although CAEV has advantages they are still susceptible to adversarial cyber attacks due to their autonomous electric operation and the involved connectivity. To alleviate this issue, we propose a secure control architecture of CAEV. Particularly, we design an additional control input using Reinforcement Learning (RL) to be applied to the vehicle powertrain along with the input commanded by the battery. We present simulation case studies to demonstrate the potential of the proposed approach in keeping the CAEV platoon operating safely without collisions by curbing the effect of adversarial attacks.
☆ Dynamical Modeling of Temperature and Smoke Evolution in a Thermal-Runaway Event of a Large-Format Lithium-ion Battery in a Mine Tunnel
Large-format lithium-ion batteries (LIBs) provide effective energy storage solutions for high-power equipment used in underground mining operations. They have high Columbic efficiency and minimal heat and emission footprints. However, improper use of LIBs, accidents, or other factors may increase the probability of thermal runaway (TR), a rapid combustion reaction that discharges toxic and flammable substances. Several such incidents have been documented in mines. Since repeatable TR experiments to uncover the transient-state propagation of TR are expensive and hazardous, high-fidelity models are usually developed to mimic the impact of these events. They are resource-intensive and are impractical to develop for many scenarios that could be observed in a mine. Therefore, dynamic models within a reduced-order framework were constructed to represent the transient-state combustion event. Reduced order models (ROMs) reasonably replicate trends in temperature and smoke, showing strong alignment with the ground-truth dataset.
☆ Modeling and Scheduling of Fusion Patterns in Autonomous Driving Systems (Extended Version)
In Autonomous Driving Systems (ADS), Directed Acyclic Graphs (DAGs) are widely used to model complex data dependencies and inter-task communication. However, existing DAG scheduling approaches oversimplify data fusion tasks by assuming fixed triggering mechanisms, failing to capture the diverse fusion patterns found in real-world ADS software stacks. In this paper, we propose a systematic framework for analyzing various fusion patterns and their performance implications in ADS. Our framework models three distinct fusion task types: timer-triggered, wait-for-all, and immediate fusion, which comprehensively represent real-world fusion behaviors. Our Integer Linear Programming (ILP)-based approach enables an optimization of multiple real-time performance metrics, including reaction time, time disparity, age of information, and response time, while generating deterministic offline schedules directly applicable to real platforms. Evaluation using real-world ADS case studies, Raspberry Pi implementation, and randomly generated DAGs demonstrates that our framework handles diverse fusion patterns beyond the scope of existing work, and achieves substantial performance improvements in comparable scenarios.
☆ Carbon-Aware Optimal Power Flow with Data-Driven Carbon Emission Tracing
Quantifying locational carbon emissions in power grids is crucial for implementing effective carbon reduction strategies for customers relying on electricity. This paper presents a carbon-aware optimal power flow (OPF) framework that incorporates data-driven carbon tracing, enabling rapid estimation of nodal carbon emissions from electric loads. By developing generator-to-load carbon emission distribution factors through data-driven technique, the analytical formulas for both average and marginal carbon emissions can be derived and integrated seamlessly into DC OPF models as linear constraints. The proposed carbon-aware OPF model enables market operators to optimize energy dispatch while reducing greenhouse gas emissions. Simulations on IEEE test systems confirm the accuracy and computational efficiency of the proposed approach, highlighting its applicability for real-time carbon-aware system operations.
☆ A Spatio-Temporal Graph Learning Approach to Real-Time Economic Dispatch with Multi-Transmission-Node DER Aggregation
The integration of distributed energy resources (DERs) into wholesale electricity markets, as mandated by FERC Order 2222, imposes new challenges on system operations. To remain consistent with existing market structures, regional transmission organizations (RTOs) have advanced the aggregation of transmission-node-level DERs (T-DERs), where a nodal virtual power plant (VPP) represents the mapping of all distribution-level DERs to their respective transmission nodes. This paper develops a real-time economic dispatch (RTED) framework that enables multi-transmission-node DER aggregation while addressing computational efficiency. To this end, we introduce a spatio-temporal graph convolutional network (ST-GCN) for adaptive prediction of distribution factors (DFs), thereby capturing the dynamic influence of individual T-DERs across the transmission system. Furthermore, an iterative constraint identification strategy is incorporated to alleviate transmission security constraints without compromising system reliability. Together, these innovations accelerate the market clearing process and support the effective participation of T-DER aggregators under current market paradigms. The proposed approach is validated on large-scale test systems, including modified 118-, 2383-, and 3012-bus networks under a rolling RTED setting with real demand data. Numerical results demonstrate significant improvements in reducing operational costs and maintaining transmission network feasibility, underscoring the scalability and practicality of the proposed framework.
☆ Neural Two-Stage Stochastic Volt-VAR Optimization for Three-Phase Unbalanced Distribution Systems with Network Reconfiguration
The increasing integration of intermittent distributed energy resources (DERs) has introduced significant variability in distribution networks, posing challenges to voltage regulation and reactive power management. This paper presents a novel neural two-stage stochastic Volt-VAR optimization (2S-VVO) method for three-phase unbalanced distribution systems considering network reconfiguration under uncertainty. To address the computational intractability associated with solving large-scale scenario-based 2S-VVO problems, a learning-based acceleration strategy is introduced, wherein the second-stage recourse model is approximated by a neural network. This neural approximation is embedded into the optimization model as a mixed-integer linear program (MILP), enabling effective enforcement of operational constraints related to the first-stage decisions. Numerical simulations on a 123-bus unbalanced distribution system demonstrate that the proposed approach achieves over 50 times speedup compared to conventional solvers and decomposition methods, while maintaining a typical optimality gap below 0.30%. These results underscore the method's efficacy and scalability in addressing large-scale stochastic VVO problems under practical operating conditions.
☆ MDP-based Energy-aware Task Scheduling for Battery-less IoT
Realizing high long-term task completion rates represents a fundamental challenge in battery-less Internet of Things (IoT) devices powered by ambient energy harvesting. This difficulty is primarily due to the stochastic and time-varying characteristics of the available energy, which significantly complicate the design of optimal task scheduling policies. In this paper, we consider a battery-less IoT device that must periodically report sensing measurements to a monitoring center. We adopt the Markov decision process (MDP) framework to handle energy variability while aiming to maximize the long-term task completion rate. For this, we first identify its components and then define two appropriate reward functions. We demonstrate the inherent properties associated with the MDP formulation and the related optimal policy. Subsequently, we solve the resulting optimization problem, leading to the optimal stationary threshold-based (OSTB) scheduling. Simulation results demonstrate that OSTB outperforms the well-known ``as late as possible'' (ALAP) scheduling strategy. For instance, an $8.6\%$ increase in the task completion rate, along with a $65\%$ reduction in power failures and a $86.29\%$ decrease in execution delays during task execution are registered assuming a $4.7$ mF capacitor.
comment: 13 pages, 11 figures
☆ A Simultaneous ECG-PCG Acquisition System with Real-Time Burst-Adaptive Noise Cancellation ISCA
Cardiac auscultation is an essential clinical skill, requiring excellent hearing to distinguish subtle differences in timing and pitch of heart sounds. However, diagnosing solely from these sounds is often challenging due to interference from surrounding noise, and the information may be limited. Existing solutions that adaptively cancel external noise are either not real-time or are computationally intensive, making them unsuitable for implementation in a portable system. This work proposes an end-to-end system with a real-time adaptive noise cancellation pipeline integrated into a device that simultaneously acquires electrocardiogram (ECG) and phonocardiogram (PCG) signals. The performance of the system is validated using real-world hospital noise datasets and recordings captured with the dual-modality device. For PCG and ECG signals recorded from the device in noisy hospital settings, the proposed algorithms achieved signal-to-noise ratio improvements of 37.01 dB and 30.32 dB, respectively. These results demonstrate the systems effectiveness in enabling reliable and accessible cardiac screening, including noisy hospital environments typical of resource-constrained settings.
comment: Paper submitted to IEEE International Symposium on Circuits and Systems (ISCAS) 2026
☆ Maximal Load Shedding Verification for Neural Network Models of AC Line Switching
Solving for globally optimal line switching decisions in AC transmission grids can be intractability slow. Machine learning (ML) models, meanwhile, can be trained to predict near-optimal decisions at a fraction of the speed. Verifying the performance and impact of these ML models on network operation, however, is a critically important step prior to their actual deployment. In this paper, we train a Neural Network (NN) to solve the optimal power shutoff line switching problem. To assess the worst-case load shedding induced by this model, we propose a bilevel attacker-defender verification approach that finds the NN line switching decisions that cause the highest quantity of network load shedding. Solving this problem to global optimality is challenging (due to AC power flow and NN nonconvexities), so our approach exploits a convex relaxation of the AC physics, combined with a local NN search, to find a guaranteed lower bound on worst--case load shedding. These under-approximation bounds are solved via MathOptAI.jl. We benchmark against a random sampling approach, and we find that our optimization-based approach always finds larger load shedding. Test results are collected on multiple PGLib test cases and on trained NN models which contain more than 10 million model parameters.
☆ Switching Network System Identification via Convex Optimizations
This paper introduces a convex optimization framework for identifying switched network systems, in which both the node dynamics and the underlying graph topology switch between a finite number of configurations. Building on our recent convex identification method for general switching systems, we extend the formulation to structured network systems where each mode corresponds to a distinct adjacency matrix. We show that both the continuous node dynamics and binary network topologies can be identified from sampled state-velocity data by solving a sequence of convex programs. The proposed framework provides a unified and scalable way to recover piecewise network structures from data without a prior knowledge of mode labels at each state. Numerical results on diffusively coupled oscillators demonstrate accurate recovery of both mode dynamics and switching graphs.
comment: 6 pages, 5 figures
♻ ☆ Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds
The mechanical complexity of soft robots creates significant challenges for their model-based control. Specifically, linear data-driven models have struggled to control soft robots on complex, spatially extended paths that explore regions with significant nonlinear behavior. To account for these nonlinearities, we develop here a model-predictive control strategy based on the recent theory of adiabatic spectral submanifolds (aSSMs). This theory is applicable because the internal vibrations of heavily overdamped robots decay at a speed that is much faster than the desired speed of the robot along its intended path. In that case, low-dimensional attracting invariant manifolds (aSSMs) emanate from the path and carry the dominant dynamics of the robot. Aided by this recent theory, we devise an aSSM-based model-predictive control scheme purely from data. We demonstrate our data-driven model's effectiveness in tracking dynamic trajectories across diverse tasks, validated on a high-fidelity, high-dimensional finite-element model of a soft trunk robot and a Cosserat rod-based elastic soft arm. Notably, we find that five- or six-dimensional aSSM-reduced models outperform the tracking performance of other data-driven modeling methods by a factor up to $10$ across all closed-loop control tasks.
comment: 41 pages, 24 figures
♻ ☆ iWalker: Imperative Visual Planning for Walking Humanoid Robot
Humanoid robots, designed to operate in human-centric environments, serve as a fundamental platform for a broad range of tasks. Although humanoid robots have been extensively studied for decades, a majority of existing humanoid robots still heavily rely on complex modular frameworks, leading to inflexibility and potential compounded errors from independent sensing, planning, and acting components. In response, we propose an end-to-end humanoid sense-plan-act walking system, enabling vision-based obstacle avoidance and footstep planning for whole body balancing simultaneously. We designed two imperative learning (IL)-based bilevel optimizations for model-predictive step planning and whole body balancing, respectively, to achieve self-supervised learning for humanoid robot walking. This enables the robot to learn from arbitrary unlabeled data, improving its adaptability and generalization capabilities. We refer to our method as iWalker and demonstrate its effectiveness in both simulated and real-world environments, representing a significant advancement toward autonomous humanoid robots.
♻ ☆ ExAMPC: the Data-Driven Explainable and Approximate NMPC with Physical Insights IROS
Amidst the surge in the use of Artificial Intelligence (AI) for control purposes, classical and model-based control methods maintain their popularity due to their transparency and deterministic nature. However, advanced controllers like Nonlinear Model Predictive Control (NMPC), despite proven capabilities, face adoption challenges due to their computational complexity and unpredictable closed-loop performance in complex validation systems. This paper introduces ExAMPC, a methodology bridging classical control and explainable AI by augmenting the NMPC with data-driven insights to improve the trustworthiness and reveal the optimization solution and closed-loop performance's sensitivities to physical variables and system parameters. By employing a low-order spline embedding, we reduce the open-loop trajectory dimensionality by over 95%, and integrate it with SHAP and Symbolic Regression from eXplainable AI (XAI) for an approximate NMPC, enabling intuitive physical insights into the NMPC's optimization routine. The prediction accuracy of the approximate NMPC is enhanced through physics-inspired continuous-time constraints penalties, reducing the predicted continuous trajectory violations by 93%. ExAMPC also enables accurate forecasting of the NMPC's computational requirements with explainable insights on worst-case scenarios. Experimental validation on automated valet parking and autonomous racing with lap-time optimization, demonstrates the methodology's practical effectiveness for potential real-world applications.
comment: This paper has been accepted for publication in the 2025 IEEE/RSJ IROS Conference
♻ ☆ Formally Verified Neural Network Controllers for Incremental Input-to-State Stability of Unknown Discrete-Time Systems
This work aims to synthesize a controller that ensures that an unknown discrete-time system is incrementally input-to-state stable ($\delta$-ISS). In this work, we introduce the notion of $\delta$-ISS control Lyapunov function ($\delta$-ISS-CLF), which, in conjunction with the controller, ensures that the closed-loop system is incrementally ISS. To address the unknown dynamics of the system, we parameterize the controller as well as the $\delta$-ISS-CLF as neural networks and learn them by utilizing the sampled data from the state space of the unknown system. To formally verify the obtained $\delta$-ISS-CLF, we develop a validity condition and incorporate the condition into the training framework to ensure a provable correctness guarantee at the end of the training process. Finally, the usefulness of the proposed approach is proved using multiple case studies - the first one is a scalar system with a non-affine non-polynomial structure, the second example is a one-link manipulator system, the third system is a nonlinear Moore-Grietzer model of the jet engine and the final one is a rotating rigid spacecraft model.
♻ ☆ A Single Motor Nano Aerial Vehicle with Novel Peer-to-Peer Communication and Sensing Mechanism
Communication and position sensing are among the most important capabilities for swarm robots to interact with their peers and perform tasks collaboratively. However, the hardware required to facilitate communication and position sensing is often too complicated, expensive, and bulky to be carried on swarm robots. Here we present Maneuverable Piccolissimo 3 (MP3), a minimalist, single motor drone capable of executing inter-robot communication via infrared light and triangulation-based sensing of relative bearing, distance, and elevation using message arrival time. Thanks to its novel design, MP3 can communicate with peers and localize itself using simple components, keeping its size and mass small and making it inherently safe for human interaction. We present the hardware and software design of MP3 and demonstrate its capability to localize itself, fly stably, and maneuver in the environment using peer-to-peer communication and sensing.
♻ ☆ Opinion Dynamics on Signed Graphs and Graphons
In this paper, we make use of graphon theory to study opinion dynamics on large undirected networks. The opinion dynamics models that we take into consideration allow for negative interactions between the individuals, whose opinions can thus grow apart. We consider both the repelling and the opposing models of negative interactions, which have been studied in the literature. We define the repelling and the opposing dynamics on signed graphons and we show that their initial value problem solutions exist and are unique. We then show that, in a suitable sense, the graphon dynamics is a good approximation of the dynamics on large graphs that converge to a graphon. This result applies to large random graphs that are sampled according to a graphon (W-random graphs), for which we provide a new convergence result under very general assumptions.
♻ ☆ FlightKooba: A Fast Interpretable FTP Model
Flight trajectory prediction (FTP) and similar time series tasks typically require capturing smooth latent dynamics hidden within noisy signals. However, existing deep learning models face significant challenges of high computational cost and insufficient interpretability due to their complex black-box nature. This paper introduces FlightKooba, a novel modeling approach designed to extract such underlying dynamics analytically. Our framework uniquely integrates HiPPO theory, Koopman operator theory, and control theory. By leveraging Legendre polynomial bases, it constructs Koopman operators analytically, thereby avoiding large-scale parameter training. The method's core strengths lie in its exceptional computational efficiency and inherent interpretability. Experiments on multiple public datasets validate our design philosophy: for signals exhibiting strong periodicity or clear physical laws (e.g., in aviation, meteorology, and traffic flow), FlightKooba delivers competitive prediction accuracy while reducing trainable parameters by several orders of magnitude and achieving the fastest training speed. Furthermore, we analyze the model's theoretical boundaries, clarifying its inherent low-pass filtering characteristics that render it unsuitable for sequences dominated by high-frequency noise. In summary, FlightKooba offers a powerful, efficient, and interpretable new alternative for time series analysis, particularly in resource-constrained environments.
comment: Version 2: Major revision of the manuscript to refine the narrative, clarify the model's theoretical limitations and application scope, and improve overall presentation for journal submission
♻ ☆ A Linear Parameter-Varying Approach to Data Predictive Control
By means of the linear parameter-varying (LPV) Fundamental Lemma, we derive novel data-driven predictive control (DPC) methods for LPV systems. In particular, we present output-feedback and state-feedback-based LPV-DPC methods with terminal ingredients, which guarantee exponential stability and recursive feasibility. We provide methods for the data-based computation of these terminal ingredients. Furthermore, an in-depth analysis of the application and implementation aspects of the LPV-DPC schemes is given, including application for nonlinear systems and handling noisy data. We compare and demonstrate the performance of the proposed methods in a detailed simulation example involving a nonlinear unbalanced disc system.
comment: To appear in IEEE Transactions on Automatic Control. Final Author Copy. Extended version (Section VI.C, Appendix C & D not in original version). 18 pages
♻ ☆ Composite Learning Adaptive Control under Non-Persistent Partial Excitation
This paper focuses on relaxing the excitation conditions for the adaptive control of uncertain nonlinear systems. By adopting the spectral decomposition technique, a linear regression equation (LRE) is constructed to quantitatively collect historical excitation information, based on which the parameter estimation error is decomposed into the excited component and the unexcited component. By sufficiently utilizing the collected excitation information, the composite learning and {\mu}-modification terms are designed and incorporated into the "Lyapunov-based" parameter update law. By developing a novel Lyapunov function, it is demonstrated that under non-persistent partial excitation, the control error and the excited parameter estimation error component converge to zero, while the unexcited component remains bounded. Furthermore, the proposed adaptive control scheme can effectively eliminate the effects of parametric uncertainties and enhance the robustness of the closed-loop systems. Simulation results are provided to verify the theoretical findings.
comment: 16 pages, 15 figures
♻ ☆ A Hybrid GNN-IZR Framework for Fast and Empirically Robust AC Power Flow Analysis in Radial Distribution Systems
The Alternating Current Power Flow (ACPF) problem forces a trade-off between the speed of data-driven models and the reliability of analytical solvers. This paper introduces a hybrid framework that synergizes a Graph Neural Network (GNN) with the Implicit Z-Bus Recursive (IZR) method, a robust, non-iterative solver for radial distribution networks. The framework employs a physics-informed GNN for rapid initial predictions and invokes the IZR solver as a failsafe for stressed cases identified by a two-stage trigger. A failure is defined as any solution with a maximum power mismatch exceeding 0.1 p.u., a significant operational deviation. On a challenging test set of 7,500 stressed scenarios for the IEEE 33-bus system, the GNN-only model failed on 13.11 % of cases. In contrast, the hybrid framework identified all potential failures, delegating them to the IZR solver to achieve a 0.00 % failure rate, empirically matching the 100 % success rate of the analytical solver on this specific test set. An expanded ablation study confirms that both physics-informed training and Z-bus sensitivity features are critical, collaboratively reducing the GNN's failure rate from 98.72 % (data-only) to 13.11 %. The hybrid approach demonstrates a pragmatic path to achieving the empirical reliability of an analytical solver while leveraging GNN speed, enabling a significant increase in the number of scenarios analyzable in near real-time.
♻ ☆ CIVIL: Causal and Intuitive Visual Imitation Learning
Today's robots attempt to learn new tasks by imitating human examples. These robots watch the human complete the task, and then try to match the actions taken by the human expert. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding which features of the system or environment factor into the human's decisions, robot learners often misinterpret the human's examples. In practice, this results in causal confusion, inefficient learning, and robot policies that fail when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to intuitively indicate why they made those decisions. Under our paradigm human teachers attach markers to task-relevant objects and use natural language prompts to describe their state representation. Our proposed algorithm, CIVIL, leverages this augmented demonstration data to filter the robot's visual observations and extract a feature representation that aligns with the human teacher. CIVIL then applies these causal features to train a transformer-based policy that -- when tested on the robot -- is able to emulate human behaviors without being confused by visual distractors or irrelevant items. Our simulations and real-world experiments demonstrate that robots trained with CIVIL learn both what actions to take and why to take those actions, resulting in better performance than state-of-the-art baselines. From the human's perspective, our user study reveals that this new training paradigm actually reduces the total time required for the robot to learn the task, and also improves the robot's performance in previously unseen scenarios. See videos at our project website: https://civil2025.github.io
♻ ☆ Learning Robust Satellite Attitude Dynamics with Physics-Informed Normalising Flow
Attitude control is a fundamental aspect of spacecraft operations. Model Predictive Control (MPC) has emerged as a powerful strategy for these tasks, relying on accurate models of the system dynamics to optimize control actions over a prediction horizon. In scenarios where physics models are incomplete, difficult to derive, or computationally expensive, machine learning offers a flexible alternative by learning the system behavior directly from data. However, purely data-driven models often struggle with generalization and stability, especially when applied to inputs outside their training domain. To address these limitations, we investigate the benefits of incorporating Physics-Informed Neural Networks (PINNs) into the learning of spacecraft attitude dynamics, comparing their performance with that of purely data-driven approaches. Using a Real-valued Non-Volume Preserving (Real NVP) neural network architecture with a self-attention mechanism, we trained several models on simulated data generated with the Basilisk simulator. Two training strategies were considered: a purely data-driven baseline and a physics-informed variant to improve robustness and stability. Our results demonstrate that the inclusion of physics-based information significantly enhances the performance in terms of the mean relative error with the best architectures found by 27.08%. These advantages are particularly evident when the learned models are integrated into an MPC framework, where PINN-based models consistently outperform their purely data-driven counterparts in terms of control accuracy and robustness, and achieve improved settling times when compared to traditional MPC approaches, yielding improvements of up to 62%, when subject to observation noise and RWs friction.
♻ ☆ Design and Optimization of EV Charging Infrastructure with Battery in Commercial Buildings
The installation of electric vehicle (EV) charging stations in buildings is inevitable, as states push for increased EV adoption to support decarbonization efforts. This transition could force the need for grid infrastructure upgrades and enhanced controls to support reliable power delivery to end-use loads, and overall economic operation. This paper evaluates strategies that address these needs on two fronts: i) optimal sizing of service transformers and battery energy storage systems (BESS), and ii) optimized coordination between EV charging, BESS operation, and building demand. These strategies are applied to a school campus setting, consisting of building and EV charging loads, to provide an illustration of energy management in commercial buildings with EV fleets. A rolling-window optimization approach is applied to determine i) optimal sizing of the service transformer and BESS and ii) optimal control of EV charging and BESS charge/discharge schedules. The design and control strategies are validated in a 20-year time horizon with an annually increasing number of EVs (buses and vans). In addition, an economic analysis is also carried out to show the costs and benefits of each design as a medium- and long-term investment.
comment: This paper contains references to several concepts that are not aligned with current DOE guidance (decarbonization, state mandates, green energy, etc.)
♻ ☆ Dynamic Dimensioning of Frequency Containment Reserves: The Case of the Nordic Grid
One of the main responsibilities of a Transmission System Operator (TSO) operating an electric grid is to maintain a designated frequency (e.g., 50 Hz in Europe). To achieve this, TSOs have created several products called frequency-supporting ancillary services. The Frequency Containment Reserve (FCR) is one of these ancillary service products. This article focuses on the TSO problem of determining the volume procured for FCR. Specifically, we investigate the potential benefits and impact on grid security when transitioning from a traditionally \textit{static} procurement method to a \textit{dynamic} strategy for FCR volume. We take the Nordic synchronous area in Europe as a case study and use a diffusion model to capture its frequency development. We introduce a controlled mean reversal parameter to assess changes in FCR obligations, in particular for the Nordic FCR-N ancillary service product. We establish closed-form expressions for exceedance probabilities and use historical frequency data as input to calibrate the model. We show that a dynamic dimensioning approach for FCR has the potential to significantly reduce the exceedance probabilities (up to $37\%$) while maintaining the total yearly procured FCR volume equal to that of the current static approach. Alternatively, a dynamic dimensioning approach could significantly increase security at limited extra cost.
comment: 12 pages, 12 figures, Accepted for publication at IEEE Transactions on Power Systems
♻ ☆ The Limits of Fairness of the Variational Generalized Nash Equilibrium
Generalized Nash equilibrium (GNE) problems are commonly used to model strategic interactions between self-interested agents who are coupled in cost and constraints. Specifically, the variational GNE, a refinement of the GNE, is often selected as the solution concept due to its non-discriminatory treatment of agents by charging a uniform ``shadow price" for shared resources. We study the fairness concept of v-GNEs from a comparability perspective and show that it makes an implicit assumption of unit comparability of agent's cost functions, one of the strongest comparability notions. Further, we introduce a new solution concept, f-GNE in which a fairness metric is chosen a priori which is compatible with the comparability at hand. We introduce an electric vehicle charging game to demonstrate the fragility of v-GNE fairness and compare it to the f-GNE under various fairness metrics.
♻ ☆ Carbon-Aware Computing for Data Centers with Probabilistic Performance Guarantees
Data centers are significant contributors to carbon emissions and can strain power systems due to their high electricity consumption. To mitigate this impact and to participate in demand response programs, cloud computing companies strive to balance and optimize operations across their global fleets by making strategic decisions about when and where to place compute jobs for execution. In this paper, we introduce a load shaping scheme which reacts to time-varying grid signals by leveraging both temporal and spatial flexibility of compute jobs to provide risk-aware management guidelines and job placement with provable performance guarantees based on distributionally robust optimization. Our approach divides the problem into two key components: (i) day-ahead planning, which generates an optimal scheduling strategy based on historical load data, and (ii) real-time job placement and (time) scheduling, which dynamically tracks the optimal strategy generated in (i). We validate our method in simulation using normalized load profiles from randomly selected Google clusters, incorporating time-varying grid signals. We can demonstrate significant reductions in carbon cost and peak power with our approach compared to myopic greedy policies, while maintaining computational efficiency and abiding to system and grid constraints.
♻ ☆ A Single Motor Nano Aerial Vehicle with Novel Peer-to-Peer Communication and Sensing Mechanism
Communication and position sensing are among the most important capabilities for swarm robots to interact with their peers and perform tasks collaboratively. However, the hardware required to facilitate communication and position sensing is often too complicated, expensive, and bulky to be carried on swarm robots. Here we present Maneuverable Piccolissimo 3 (MP3), a minimalist, single motor drone capable of executing inter-robot communication via infrared light and triangulation-based sensing of relative bearing, distance, and elevation using message arrival time. Thanks to its novel design, MP3 can communicate with peers and localize itself using simple components, keeping its size and mass small and making it inherently safe for human interaction. We present the hardware and software design of MP3 and demonstrate its capability to localize itself, fly stably, and maneuver in the environment using peer-to-peer communication and sensing.
comment: Proceedings of Robotics: Science and Systems (RSS), 2024
Graphics 8
☆ Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling NeurIPS 2025
Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.
comment: NeurIPS 2025, 38 pages, 22 figures
☆ Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap
Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.
☆ FlowCapX: Physics-Grounded Flow Capture with Long-Term Consistency
We present FlowCapX, a physics-enhanced framework for flow reconstruction from sparse video inputs, addressing the challenge of jointly optimizing complex physical constraints and sparse observational data over long time horizons. Existing methods often struggle to capture turbulent motion while maintaining physical consistency, limiting reconstruction quality and downstream tasks. Focusing on velocity inference, our approach introduces a hybrid framework that strategically separates representation and supervision across spatial scales. At the coarse level, we resolve sparse-view ambiguities via a novel optimization strategy that aligns long-term observation with physics-grounded velocity fields. By emphasizing vorticity-based physical constraints, our method enhances physical fidelity and improves optimization stability. At the fine level, we prioritize observational fidelity to preserve critical turbulent structures. Extensive experiments demonstrate state-of-the-art velocity reconstruction, enabling velocity-aware downstream tasks, e.g., accurate flow analysis, scene augmentation with tracer visualization and re-simulation.
☆ VoMP: Predicting Volumetric Mechanical Property Fields
Physical simulation relies on spatially-varying mechanical properties, often laboriously hand-crafted. VoMP is a feed-forward method trained to predict Young's modulus ($E$), Poisson's ratio ($\nu$), and density ($\rho$) throughout the volume of 3D objects, in any representation that can be rendered and voxelized. VoMP aggregates per-voxel multi-view features and passes them to our trained Geometry Transformer to predict per-voxel material latent codes. These latents reside on a manifold of physically plausible materials, which we learn from a real-world dataset, guaranteeing the validity of decoded per-voxel materials. To obtain object-level training data, we propose an annotation pipeline combining knowledge from segmented 3D datasets, material databases, and a vision-language model, along with a new benchmark. Experiments show that VoMP estimates accurate volumetric properties, far outperforming prior art in accuracy and speed.
comment: hi-res paper and other details at: https://research.nvidia.com/labs/sil/projects/vomp
☆ TRELLISWorld: Training-Free World Generation from Object Generators
Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.
♻ ☆ Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images ICCV 2025
We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.
comment: This work received the Best Paper Honorable Mention at the AMFG Workshop, ICCV 2025
♻ ☆ Improving Video Generation with Human Feedback
Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.
comment: https://github.com/KwaiVGI/VideoAlign
♻ ☆ Capturing Non-Linear Human Perspective in Line Drawings
Artist-drawn sketches only loosely conform to analytical models of perspective projection; the deviation of human-drawn perspective from analytical perspective models is persistent and well documented, but has yet to be algorithmically replicated. We encode this deviation between human and analytic perspectives as a continuous function in 3D space and develop a method to learn it. We seek deviation functions that (i)mimic artist deviation on our training data; (ii)generalize to other shapes; (iii)are consistent across different views of the same shape; and (iv)produce outputs that appear human-drawn. The natural data for learning this deviation is pairs of artist sketches of 3D shapes and best-matching analytical camera views of the same shapes. However, a core challenge in learning perspective deviation is the heterogeneity of human drawing choices, combined with relative data paucity (the datasets we rely on have only a few dozen training pairs). We sidestep this challenge by learning perspective deviation from an individual pair of an artist sketch of a 3D shape and the contours of the same shape rendered from a best-matching analytical camera view. We first match contours of the depicted shape to artist strokes, then learn a spatially continuous local perspective deviation function that modifies the camera perspective projecting the contours to their corresponding strokes. This function retains key geometric properties that artists strive to preserve when depicting 3D content, thus satisfying (i) and (iv) above. We generalize our method to alternative shapes and views (ii, iii) via a self-augmentation approach that algorithmically generates training data for nearby views, and enforces spatial smoothness and consistency across all views. We compare our results to potential alternatives, demonstrating the superiority of the proposed approach.
Computation and Language 129
☆ Variational Masked Diffusion Models
Masked diffusion models have recently emerged as a flexible framework for discrete generative modeling. However, a key limitation of standard masked diffusion is its inability to effectively capture dependencies among tokens that are predicted concurrently, leading to degraded generation quality when dependencies among tokens are important. To explicitly model dependencies among tokens, we propose Variational Masked Diffusion (VMD), a framework that introduces latent variables into the masked diffusion process. Through controlled experiments on synthetic datasets, we demonstrate that VMD successfully learns dependencies that conventional masked diffusion fails to capture. We further validate the effectiveness of our approach on Sudoku puzzles and text datasets, where learning of dependencies among tokens improves global consistency. Across these domains, VMD enhances both generation quality and dependency awareness, highlighting the value of integrating variational inference into masked diffusion. Our code is available at: https://riccizz.github.io/VMD.
comment: Project Page: https://riccizz.github.io/VMD
☆ Think Twice: Branch-and-Rethink Reasoning Reward Model
Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.
☆ Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models
The identification of hope speech has become a promised NLP task, considering the need to detect motivational expressions of agency and goal-directed behaviour on social media platforms. This proposal evaluates traditional machine learning models and fine-tuned transformers for a previously split hope speech dataset as train, development and test set. On development test, a linear-kernel SVM and logistic regression both reached a macro-F1 of 0.78; SVM with RBF kernel reached 0.77, and Na\"ive Bayes hit 0.75. Transformer models delivered better results, the best model achieved weighted precision of 0.82, weighted recall of 0.80, weighted F1 of 0.79, macro F1 of 0.79, and 0.80 accuracy. These results suggest that while optimally configured traditional machine learning models remain agile, transformer architectures detect some subtle semantics of hope to achieve higher precision and recall in hope speech detection, suggesting that larges transformers and LLMs could perform better in small datasets.
☆ ReCode: Unify Plan and Action for Universal Granularity Control
Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.
☆ ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models
Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense interest from both academic and industrial communities. However, existing LALMs are highly sensitive to how instructions are phrased, affecting both (i) instruction-following rates and (ii) task performance. Yet, no existing benchmarks offer a systematic and comprehensive evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition. We assess recent open-source and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy under controlled instruction variations. Experimental results reveal that even state-of-the-art LALMs suffer significant instruction sensitivity, leading to degraded performance on fundamental audio understanding tasks. To mitigate this issue, we fine-tune Qwen2-Audio on a specifically constructed complex instruction-variant dataset, achieving a marked improvement in instruction-following performance. However, this also induces nontrivial catastrophic forgetting: the model loses some previously mastered task capabilities when exposed to new instruction styles. Our benchmark provides a standardized basis for assessing and improving instruction sensitivity in LALMs, underscoring the need for instruction-robust audio understanding in real-world pipelines.
comment: submitted to icassp 2026
☆ A U-Net and Transformer Pipeline for Multilingual Image Translation
This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.
comment: 6 pages, 3 figures, 5 tables, and 2 algorithms. Prepared in IEEE double-column format
☆ LimRank: Less is More for Reasoning-Intensive Information Reranking EMNLP 2025
Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks, which is computationally expensive. In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision. To enable this, we design LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating diverse, challenging, and realistic reranking examples. Using this synthetic data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and FollowIR for instruction-following retrieval. Our experiments demonstrate that LIMRANK achieves competitive performance, while being trained on less than 5% of the data typically used in prior work. Further ablation studies demonstrate the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization capabilities of LIMRANK across downstream tasks, including scientific literature search and retrieval-augmented generation for knowledge-intensive problem solving.
comment: EMNLP 2025 Main (Short)
☆ JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints will are available at https://github.com/InternLM/JanusCoder.
comment: Work in progress
☆ IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering
Intent identification serves as the foundation for generating appropriate responses in personalized question answering (PQA). However, existing benchmarks evaluate only response quality or retrieval performance without directly measuring intent identification capabilities. This gap is critical because without understanding which intents users prioritize, systems cannot generate responses satisfying individual information needs. To address this, we introduce the concept of core intents: intents users prioritize when selecting answers to satisfy their information needs. To evaluate these core intents, we propose IPQA, a benchmark for core Intent identification in Personalized Question Answering. Since users do not explicitly state their prioritized intents, we derive core intents from observable behavior patterns in answer selection, grounded in satisficing theory where users choose answers meeting their acceptance thresholds. We construct a dataset with various domains through systematic filtering, LLM-based annotation, and rigorous quality control combining automated verification with human validation. Experimental evaluations across state-of-the-art language models reveal that current systems struggle with core intent identification in personalized contexts. Models fail to identify core intents from user histories, with performance degrading as question complexity increases. The code and dataset will be made publicly available to facilitate future research in this direction.
☆ M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset
Existing real-world datasets for multimodal automated fact-checking have multiple limitations: they contain few instances, focus on only one or two languages and tasks, suffer from evidence leakage, or depend on external sets of news articles for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent diverse cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks influence downstream verdict prediction performance. We make our dataset and code available.
comment: Preprint under review. Code and data available at: https://github.com/UKPLab/M4FC
☆ MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.
☆ Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts
Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs' performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.
☆ BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.
comment: 25 pages
☆ Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.
comment: 48 pages, 17 figures
☆ A Neuro-Symbolic Multi-Agent Approach to Legal-Cybersecurity Knowledge Integration
The growing intersection of cybersecurity and law creates a complex information space where traditional legal research tools struggle to deal with nuanced connections between cases, statutes, and technical vulnerabilities. This knowledge divide hinders collaboration between legal experts and cybersecurity professionals. To address this important gap, this work provides a first step towards intelligent systems capable of navigating the increasingly intricate cyber-legal domain. We demonstrate promising initial results on multilingual tasks.
comment: 7 pages
☆ EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting
The immense success of the Transformer architecture in Natural Language Processing has led to its adoption in Time Se ries Forecasting (TSF), where superior performance has been shown. However, a recent important paper questioned their effectiveness by demonstrating that a simple single layer linear model outperforms Transformer-based models. This was soon shown to be not as valid, by a better transformer-based model termed PatchTST. More re cently, TimeLLM demonstrated even better results by repurposing a Large Language Model (LLM) for the TSF domain. Again, a follow up paper challenged this by demonstrating that removing the LLM component or replacing it with a basic attention layer in fact yields better performance. One of the challenges in forecasting is the fact that TSF data favors the more recent past, and is sometimes subject to unpredictable events. Based upon these recent insights in TSF, we propose a strong Mixture of Experts (MoE) framework. Our method combines the state-of-the-art (SOTA) models including xLSTM, en hanced Linear, PatchTST, and minGRU, among others. This set of complimentary and diverse models for TSF are integrated in a Trans former based MoE gating network. Our proposed model outperforms all existing TSF models on standard benchmarks, surpassing even the latest approaches based on MoE frameworks.
☆ Detecting Religious Language in Climate Discourse
Religious language continues to permeate contemporary discourse, even in ostensibly secular domains such as environmental activism and climate change debates. This paper investigates how explicit and implicit forms of religious language appear in climate-related texts produced by secular and religious nongovernmental organizations (NGOs). We introduce a dual methodological approach: a rule-based model using a hierarchical tree of religious terms derived from ecotheology literature, and large language models (LLMs) operating in a zero-shot setting. Using a dataset of more than 880,000 sentences, we compare how these methods detect religious language and analyze points of agreement and divergence. The results show that the rule-based method consistently labels more sentences as religious than LLMs. These findings highlight not only the methodological challenges of computationally detecting religious language but also the broader tension over whether religious language should be defined by vocabulary alone or by contextual meaning. This study contributes to digital methods in religious studies by demonstrating both the potential and the limitations of approaches for analyzing how the sacred persists in climate discourse.
☆ How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes
Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand, especially in occupations affected by AI. Existing research has shown that LLMs can extract sentiment, summarize economic reports, and emulate forecaster behavior, but little work has assessed their use for forward-looking labor prediction. Our benchmark combines two complementary datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption. We format these data into forecasting tasks with clear temporal splits, minimizing the risk of information leakage. We then evaluate LLMs using multiple prompting strategies, comparing task-scaffolded, persona-driven, and hybrid approaches across model families. We assess both quantitative accuracy and qualitative consistency over time. Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends. However, performance varies significantly across sectors and horizons, highlighting the need for domain-aware prompting and rigorous evaluation protocols. By releasing our benchmark, we aim to support future research on labor forecasting, prompt design, and LLM-based economic reasoning. This work contributes to a growing body of research on how LLMs interact with real-world economic data, and provides a reproducible testbed for studying the limits and opportunities of AI as a forecasting tool in the context of labor markets.
comment: 8 pages + Limitations + References
☆ LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data
The scarcity of high-quality knowledge graphs (KGs) remains a critical bottleneck for downstream AI applications, as existing extraction methods rely heavily on error-prone pattern-matching techniques or resource-intensive large language models (LLMs). While recent tools leverage LLMs to generate KGs, their computational demands limit accessibility for low-resource environments. Our paper introduces LightKGG, a novel framework that enables efficient KG extraction from textual data using small-scale language models (SLMs) through two key technical innovations: (1) Context-integrated Graph extraction integrates contextual information with nodes and edges into a unified graph structure, reducing the reliance on complex semantic processing while maintaining more key information; (2) Topology-enhanced relationship inference leverages the inherent topology of the extracted graph to efficiently infer relationships, enabling relationship discovery without relying on complex language understanding capabilities of LLMs. By enabling accurate KG construction with minimal hardware requirements, this work bridges the gap between automated knowledge extraction and practical deployment scenarios while introducing scientifically rigorous methods for optimizing SLM efficiency in structured NLP tasks.
☆ Planning Ahead with RSA: Efficient Signalling in Dynamic Environments by Projecting User Awareness across Future Timesteps
Adaptive agent design offers a way to improve human-AI collaboration on time-sensitive tasks in rapidly changing environments. In such cases, to ensure the human maintains an accurate understanding of critical task elements, an assistive agent must not only identify the highest priority information but also estimate how and when this information can be communicated most effectively, given that human attention represents a zero-sum cognitive resource where focus on one message diminishes awareness of other or upcoming information. We introduce a theoretical framework for adaptive signalling which meets these challenges by using principles of rational communication, formalised as Bayesian reference resolution using the Rational Speech Act (RSA) modelling framework, to plan a sequence of messages which optimise timely alignment between user belief and a dynamic environment. The agent adapts message specificity and timing to the particulars of a user and scenario based on projections of how prior-guided interpretation of messages will influence attention to the interface and subsequent belief update, across several timesteps out to a fixed horizon. In a comparison to baseline methods, we show that this effectiveness depends crucially on combining multi-step planning with a realistic model of user awareness. As the first application of RSA for communication in a dynamic environment, and for human-AI interaction in general, we establish theoretical foundations for pragmatic communication in human-agent teams, highlighting how insights from cognitive science can be capitalised to inform the design of assistive agents.
comment: 11 pages, 3 figures
☆ BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning
Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult to scale up and generate realistic, contextually coherent personas. We create the first QA dataset for BaZi-based persona reasoning, where real human experiences categorized into wealth, health, kinship, career, and relationships are represented as life-event questions and answers. Furthermore, we propose the first BaZi-LLM system that integrates symbolic reasoning with large language models to generate temporally dynamic and fine-grained virtual personas. Compared with mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a 30.3%-62.6% accuracy improvement. In addition, when incorrect BaZi information is used, our model's accuracy drops by 20%-45%, showing the potential of culturally grounded symbolic-LLM integration for realistic character simulation.
☆ Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models
LLM alignment remains a critical challenge. Inference-time methods provide a flexible alternative to fine-tuning, but their uniform computational effort often yields suboptimal alignment. We hypothesize that for many alignment tasks, the initial tokens of a response are disproportionately more critical. To leverage this principle, we introduce AdaSearch, a novel blockwise search strategy. It adaptively allocates a fixed computational budget using a sampling schedule, focusing search effort on these critical tokens. We apply AdaSearch to sequential decoding and introduce its tree-search counterpart, AdaBeam. Our comprehensive evaluation across eight LLMs demonstrates that AdaSearch outperforms strong Best-of-N and fine-tuning baselines. Specifically, win-rates improve by over 10% for harmlessness generation, controlled sentiment generation, and for mathematical reasoning tasks relative to Best-of-N.
☆ LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization LREC 2026
We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers, split in a speaker-disjoint manner for robust evaluation. Baselines show that the sortformer model outperforms the pyannote pipeline in diarization, while a fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves 7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides a valuable resource for advancing multi-speaker speech processing research with realistic conversational dynamics and controlled experimental conditions.
comment: Submitted to LREC 2026
☆ Arabic Little STT: Arabic Children Speech Recognition Dataset
The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Moreover, the absence of child-specific speech corpora is an essential gap that poses significant challenges. To address this gap, we present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms, containing 355 utterances from 288 children (ages 6 - 13). We further conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset and compare its performance with adult Arabic benchmarks. Our evaluation across eight Whisper variants reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on adult datasets. These results align with other research on English speech. Results highlight the critical need for dedicated child speech benchmarks and inclusive training data in ASR development. Emphasizing that such data must be governed by strict ethical and privacy frameworks to protect sensitive child information. We hope that this study provides an initial step for future work on equitable speech technologies for Arabic-speaking children. We hope that our publicly available dataset enrich the children's demographic representation in ASR datasets.
☆ DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model
Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centric strategies on text-to-SQL tasks has rarely been explored. In this paper, we systemically design a fully automated data-centric pipeline for text-to-SQL tasks, including \emph{adaptive data repair}, which can automatically find and fix errors in the training dataset; and \emph{error data augmentation}, where we specifically diffuse and enhance erroneous data predicted by the initially trained models. Meanwhile, we propose a Multi-Model collaboration training schema, aiming to train multiple models with different augmented data, enabling them to possess distinct capabilities and work together to complement each other, because it has been found that the capability of a single fine-tuned model is very limited. Furthermore, we utilize an ensemble strategy to integrate the capabilities of multiple models to solve a multiple-choice question, aiming to further improve the accuracy of text-to-SQL tasks. The experiment results and ablation study have demonstrated the effectiveness of data-centric pipeline and Multi-Model(MM) interactive iterative strategies, achieving first place in lightweight text-to-SQL models (within 70B).
☆ A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results ICASSP 2026
We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.
comment: Submitted to ICASSP 2026
☆ Code Aesthetics with Agentic Reward Feedback
Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.
comment: 30 pages, 7 figures
☆ Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding
Mubeen is a proprietary Arabic language model developed by MASARAT SA, optimized for deep understanding of Arabic linguistics, Islamic studies, and cultural heritage. Trained on an extensive collection of authentic Arabic sources significantly expanded by digitizing historical manuscripts via a proprietary Arabic OCR engine, the model incorporates seminal scholarly works in linguistics, jurisprudence, hadith, and Quranic exegesis, alongside thousands of academic theses and peer-reviewed research papers. Conditioned through a deep linguistic engineering framework, Mubeen masters not just the meaning but the eloquence of Arabic, enabling precise understanding across classical texts, contemporary writing, and regional dialects with focus on comprehending user intent and delivering accurate, contextually relevant responses. Unlike other Arabic models relying on translated English data that often fail in intent detection or retrieval-augmented generation (RAG), Mubeen uses native Arabic sources to ensure cultural authenticity and accuracy. Its core innovation is the Practical Closure Architecture, designed to solve the "Utility Gap Crisis" where factually correct answers fail to resolve users' core needs, forcing them into frustrating cycles of re-prompting. By prioritizing clarity and decisive guidance, Mubeen transforms from an information repository into a decisive guide, aligning with Saudi Vision 2030. The model's architecture combines deep heritage specialization with multi-disciplinary expert modules, enabling robust performance across both cultural preservation and general knowledge domains.
comment: 21 pages, 2 figures, 3 tables. Includes appendices on ethical guidelines and training framework. Submitted September 04, 2025
☆ Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? AACL
Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available
comment: This manuscript contains 11 pages, 5 tables and 16 figures This was accepted at International Joint Conference on Natural Language Processing & Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL) 2025
☆ Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports
Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risks. Existing hallucination detection methods frequently lack the necessary sentence-level granularity or robust generalization across different LVLM generators. We introduce a novel approach: a sentence-level Process Reward Model (PRM) adapted for this vision-language task. Our PRM predicts the factual correctness of each generated sentence, conditioned on clinical context and preceding text. When fine-tuned on MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM outperforms existing verification techniques, demonstrating, for instance, relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods reliant on internal model states, our PRM demonstrates strong generalization to an unseen LVLM. We further show its practical utility: PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5% (when discarding the worst 10% of reports). Moreover, when guiding a novel weighted best-of-N selection process on the MIMIC-CXR test set, our PRM show relative improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for BERTScore. These results demonstrate that a lightweight, context-aware PRM provides a model-agnostic safety layer for clinical LVLMs without access to internal activations
☆ PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training Budgets
Continual pre-training (CPT) for domain adaptation must balance target-domain gains with stability on the base domain. Existing CPT scaling laws typically assume a fixed pre-training budget, which limits their ability to forecast adaptation outcomes for models trained at different tokens-per-parameter (PTPP). We present \emph{PTPP-aware} adaptation scaling laws that make the pre-training budget an explicit variable, enabling accurate \emph{prediction} of adaptation loss at unseen \ptpp. On a multilingual setup (English/Arabic $\rightarrow$ French), PTPP-aware formulations trained on early stages (\ptpp{}=\{15,31\}) predict target loss at \ptpp{}=279 and outperform a PTPP-agnostic \dcpt{} transfer baseline on metrics (Huber-on-log, MAE$_\mathrm{rel}$, calibration slope); full diagnostics (RMSE, MAPE) are in the appendix. Beyond forecasting, we show a practical use case: planning replay ratios and adaptation token budgets that satisfy target and forgetting constraints under compute limits.
☆ DREaM: Drug-Drug Relation Extraction via Transfer Learning Method
Relation extraction between drugs plays a crucial role in identifying drug drug interactions and predicting side effects. The advancement of machine learning methods in relation extraction, along with the development of large medical text databases, has enabled the low cost extraction of such relations compared to other approaches that typically require expert knowledge. However, to the best of our knowledge, there are limited datasets specifically designed for drug drug relation extraction currently available. Therefore, employing transfer learning becomes necessary to apply machine learning methods in this domain. In this study, we propose DREAM, a method that first employs a trained relation extraction model to discover relations between entities and then applies this model to a corpus of medical texts to construct an ontology of drug relationships. The extracted relations are subsequently validated using a large language model. Quantitative results indicate that the LLM agreed with 71 of the relations extracted from a subset of PubMed abstracts. Furthermore, our qualitative analysis indicates that this approach can uncover ambiguities in the medical domain, highlighting the challenges inherent in relation extraction in this field.
☆ SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations
As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at https://github.com/SI-Bench/SI-Bench.git.
comment: 17 pages, 9 figures
☆ MATCH: Task-Driven Code Evaluation through Contrastive Learning
AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.
☆ Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs
The screenplay serves as the foundation for television production, defining narrative structure, character development, and dialogue. While Large Language Models (LLMs) show great potential in creative writing, direct end-to-end generation approaches often fail to produce well-crafted screenplays. We argue this failure stems from forcing a single model to simultaneously master two disparate capabilities: creative narrative construction and rigid format adherence. The resulting outputs may mimic superficial style but lack the deep structural integrity and storytelling substance required for professional use. To enable LLMs to generate high-quality screenplays, we introduce Dual-Stage Refinement (DSR), a decomposed framework that decouples creative narrative generation from format conversion. The first stage transforms a brief outline into rich, novel-style prose. The second stage refines this narrative into a professionally formatted screenplay. This separation enables the model to specialize in one distinct capability at each stage. A key challenge in implementing DSR is the scarcity of paired outline-to-novel training data. We address this through hybrid data synthesis: reverse synthesis deconstructs existing screenplays into structured inputs, while forward synthesis leverages these inputs to generate high-quality narrative texts as training targets. Blind evaluations by professional screenwriters show that DSR achieves a 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance. Our work demonstrates that decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains.
☆ ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix
Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs) to domain-specific instructions by training on a carefully curated subset of high-quality instruction-response pairs, typically drawn from a larger dataset that often contains many low-quality or noisy samples. However, existing quality-first paradigms often overlook valuable signals in discarded low-quality data and rely on imperfect quality filters. We introduce ENTP (Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix), a framework that revitalizes low-quality corpora through symbolic purification and neural reconstruction. The symbolic module identifies and prunes noisy samples based on statistical priors, while the neural component synthesizes enriched instruction-response pairs by leveraging latent representations and model knowledge. This neural-symbolic synergy enhances data informativeness and diversity. Experiments show that ENTP-augmented datasets, constructed exclusively from low-quality data, outperform 13 established data-selection baselines across five instruction-following benchmarks, and even surpass fine-tuning on the full original dataset (approximately 300K examples). Our results highlight the untapped potential of low-quality data and underscore the importance of intelligent purification and synthesis for efficient instruction alignment.
☆ Rethinking GSPO: The Perplexity-Entropy Equivalence
We provide a new perspective on GSPO's length-normalized importance ratios by establishing their connection to information-theoretic quantities. We show that GSPO's sequence-level weight $s(\theta) = (\pi_\theta/\pi_{\theta_{\text{old}}})^{1/|y|}$ can be equivalently expressed as the inverse perplexity ratio $\text{PPL}_{\theta_{\text{old}}}/\text{PPL}_\theta$ and as the exponential cross-entropy change $\exp(\Delta H)$. While the perplexity-entropy relationship follows from standard definitions, this observation provides a useful lens for understanding GSPO: the algorithm weights policy gradient updates by perplexity ratios, offering an information-theoretic interpretation of the importance weights. This perspective helps explain GSPO's empirical properties, including log-domain variance reduction through geometric averaging and stability in training mixture-of-experts models. We validate the mathematical equivalences and variance predictions through controlled experiments on mathematical reasoning tasks.
comment: 10 pages, 2 figures
☆ Corpus Frequencies in Morphological Inflection: Do They Matter?
The traditional approach to morphological inflection (the task of modifying a base word (lemma) to express grammatical categories) has been, for decades, to consider lexical entries of lemma-tag-form triples uniformly, lacking any information about their frequency distribution. However, in production deployment, one might expect the user inputs to reflect a real-world distribution of frequencies in natural texts. With future deployment in mind, we explore the incorporation of corpus frequency information into the task of morphological inflection along three key dimensions during system development: (i) for train-dev-test split, we combine a lemma-disjoint approach, which evaluates the model's generalization capabilities, with a frequency-weighted strategy to better reflect the realistic distribution of items across different frequency bands in training and test sets; (ii) for evaluation, we complement the standard type accuracy (often referred to simply as accuracy), which treats all items equally regardless of frequency, with token accuracy, which assigns greater weight to frequent words and better approximates performance on running text; (iii) for training data sampling, we introduce a method novel in the context of inflection, frequency-aware training, which explicitly incorporates word frequency into the sampling process. We show that frequency-aware training outperforms uniform sampling in 26 out of 43 languages.
comment: Published in the proceedings of ITAT 2025.15 pages, 1 figure, 4 tables
☆ Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation NeurIPS 2025
Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). LoRA essentially describes the projection of an input space into a low-dimensional output space, with the dimensionality determined by the LoRA rank. In standard LoRA, all input tokens share the same weights and undergo an identical input-output projection. This limits LoRA's ability to capture token-specific information due to the inherent semantic differences among tokens. To address this limitation, we propose Token-wise Projected Low-Rank Adaptation (TopLoRA), which dynamically adjusts LoRA weights according to the input token, thereby learning token-wise input-output projections in an end-to-end manner. Formally, the weights of TopLoRA can be expressed as $B\Sigma_X A$, where $A$ and $B$ are low-rank matrices (as in standard LoRA), and $\Sigma_X$ is a diagonal matrix generated from each input token $X$. Notably, TopLoRA does not increase the rank of LoRA weights but achieves more granular adaptation by learning token-wise LoRA weights (i.e., token-wise input-output projections). Extensive experiments across multiple models and datasets demonstrate that TopLoRA consistently outperforms LoRA and its variants. The code is available at https://github.com/Leopold1423/toplora-neurips25.
comment: Accepted by NeurIPS 2025
☆ Flexing in 73 Languages: A Single Small Model for Multilingual Inflection
We present a compact, single-model approach to multilingual inflection, the task of generating inflected word forms from base lemmas to express grammatical categories. Our model, trained jointly on data from 73 languages, is lightweight, robust to unseen words, and outperforms monolingual baselines in most languages. This demonstrates the effectiveness of multilingual modeling for inflection and highlights its practical benefits: simplifying deployment by eliminating the need to manage and retrain dozens of separate monolingual models. In addition to the standard SIGMORPHON shared task benchmarks, we evaluate our monolingual and multilingual models on 73 Universal Dependencies (UD) treebanks, extracting lemma-tag-form triples and their frequency counts. To ensure realistic data splits, we introduce a novel frequency-weighted, lemma-disjoint train-dev-test resampling procedure. Our work addresses the lack of an open-source, general-purpose, multilingual morphological inflection system capable of handling unseen words across a wide range of languages, including Czech. All code is publicly released at: https://github.com/tomsouri/multilingual-inflection.
comment: Published in the proceedings of TSD 2025. 12 pages, 1 figure, 4 tables
☆ Leveraging Hierarchical Organization for Medical Multi-document Summarization
Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model's ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.
☆ MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models
Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.
☆ A Survey on LLM Mid-training
Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.
☆ Fast-MIA: Efficient and Scalable Membership Inference for LLMs
We propose Fast-MIA (https://github.com/Nikkei/fast-mia), a Python library for efficiently evaluating membership inference attacks (MIA) against Large Language Models (LLMs). MIA against LLMs has emerged as a crucial challenge due to growing concerns over copyright, security, and data privacy, and has attracted increasing research attention. However, the progress of this research is significantly hindered by two main obstacles: (1) the high computational cost of inference in LLMs, and (2) the lack of standardized and maintained implementations of MIA methods, which makes large-scale empirical comparison difficult. To address these challenges, our library provides fast batch inference and includes implementations of representative MIA methods under a unified evaluation framework. This library supports easy implementation of reproducible benchmarks with simple configuration and extensibility. We release Fast-MIA as an open-source (Apache License 2.0) tool to support scalable and transparent research on LLMs.
☆ Quality-Aware Translation Tagging in Multilingual RAG system EMNLP 2025
Multilingual Retrieval-Augmented Generation (mRAG) often retrieves English documents and translates them into the query language for low-resource settings. However, poor translation quality degrades response generation performance. Existing approaches either assume sufficient translation quality or utilize the rewriting method, which introduces factual distortion and hallucinations. To mitigate these problems, we propose Quality-Aware Translation Tagging in mRAG (QTT-RAG), which explicitly evaluates translation quality along three dimensions-semantic equivalence, grammatical accuracy, and naturalness&fluency-and attach these scores as metadata without altering the original content. We evaluate QTT-RAG against CrossRAG and DKM-RAG as baselines in two open-domain QA benchmarks (XORQA, MKQA) using six instruction-tuned LLMs ranging from 2.4B to 14B parameters, covering two low-resource languages (Korean and Finnish) and one high-resource language (Chinese). QTT-RAG outperforms the baselines by preserving factual integrity while enabling generator models to make informed decisions based on translation reliability. This approach allows for effective usage of cross-lingual documents in low-resource settings with limited native language documents, offering a practical and robust solution across multilingual domains.
comment: EMNLP 2025 MRL Workshop
☆ Knocking-Heads Attention
Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.
☆ Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.
comment: Work in Progress
☆ Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts
Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.
☆ UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization
With the rapid proliferation of image generative models, the authenticity of digital images has become a significant concern. While existing studies have proposed various methods for detecting AI-generated content, current benchmarks are limited in their coverage of diverse generative models and image categories, often overlooking end-to-end image editing and artistic images. To address these limitations, we introduce UniAIDet, a unified and comprehensive benchmark that includes both photographic and artistic images. UniAIDet covers a wide range of generative models, including text-to-image, image-to-image, image inpainting, image editing, and deepfake models. Using UniAIDet, we conduct a comprehensive evaluation of various detection methods and answer three key research questions regarding generalization capability and the relation between detection and localization. Our benchmark and analysis provide a robust foundation for future research.
☆ M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}
☆ LangLingual: A Personalised, Exercise-oriented English Language Learning Tool Leveraging Large Language Models
Language educators strive to create a rich experience for learners, while they may be restricted in the extend of feedback and practice they can provide. We present the design and development of LangLingual, a conversational agent built using the LangChain framework and powered by Large Language Models. The system is specifically designed to provide real-time, grammar-focused feedback, generate context-aware language exercises and track learner proficiency over time. The paper discusses the architecture, implementation and evaluation of LangLingual in detail. The results indicate strong usability, positive learning outcomes and encouraging learner engagement.
comment: 14 pages
☆ Leveraging LLMs for Early Alzheimer's Prediction
We present a connectome-informed LLM framework that encodes dynamic fMRI connectivity as temporal sequences, applies robust normalization, and maps these data into a representation suitable for a frozen pre-trained LLM for clinical prediction. Applied to early Alzheimer's detection, our method achieves sensitive prediction with error rates well below clinically recognized margins, with implications for timely Alzheimer's intervention.
☆ Auto prompting without training labels: An LLM cascade for product quality assessment in e-commerce catalogs
We introduce a novel, training free cascade for auto-prompting Large Language Models (LLMs) to assess product quality in e-commerce. Our system requires no training labels or model fine-tuning, instead automatically generating and refining prompts for evaluating attribute quality across tens of thousands of product category-attribute pairs. Starting from a seed of human-crafted prompts, the cascade progressively optimizes instructions to meet catalog-specific requirements. This approach bridges the gap between general language understanding and domain-specific knowledge at scale in complex industrial catalogs. Our extensive empirical evaluations shows the auto-prompt cascade improves precision and recall by $8-10\%$ over traditional chain-of-thought prompting. Notably, it achieves these gains while reducing domain expert effort from 5.1 hours to 3 minutes per attribute - a $99\%$ reduction. Additionally, the cascade generalizes effectively across five languages and multiple quality assessment tasks, consistently maintaining performance gains.
☆ Latent Chain-of-Thought for Visual Reasoning NeurIPS 2025
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
comment: NeurIPS 2025
☆ Agent-based Automated Claim Matching with Instruction-following LLMs
We present a novel agent-based approach for the automated claim matching task with instruction-following LLMs. We propose a two-step pipeline that first generates prompts with LLMs, to then perform claim matching as a binary classification task with LLMs. We demonstrate that LLM-generated prompts can outperform SOTA with human-generated prompts, and that smaller LLMs can do as well as larger ones in the generation process, allowing to save computational resources. We also demonstrate the effectiveness of using different LLMs for each step of the pipeline, i.e. using an LLM for prompt generation, and another for claim matching. Our investigation into the prompt generation process in turn reveals insights into the LLMs' understanding of claim matching.
comment: Accepted for the International Joint Conference on Natural Language Processing & Asia-Pacific Chapter of the Association for Computational Linguistics (2025) Findings
☆ Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation
Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior due to the discriminative nature of the data that they have been trained on. Despite significant progress in the development of methods and models that refrain from using stereotypical information in their decision-making, recent work has shown that approaches used for bias alignment are brittle. In this work, we introduce a novel and general augmentation framework that involves three plug-and-play steps and is applicable to a number of fairness evaluation benchmarks. Through application of augmentation to a fairness evaluation dataset (Bias Benchmark for Question Answering (BBQ)), we find that Large Language Models (LLMs), including state-of-the-art open and closed weight models, are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically. Furthermore, we find that such models are more likely to have biased behavior in cases where the target demographic belongs to a community less studied by the literature, underlining the need to expand the fairness and safety research to include more diverse communities.
comment: 9 pages, 3 figures, 3 tables
☆ AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages
Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB -- a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.
☆ Language Models for Longitudinal Clinical Prediction
We explore a lightweight framework that adapts frozen large language models to analyze longitudinal clinical data. The approach integrates patient history and context within the language model space to generate accurate forecasts without model fine-tuning. Applied to neuropsychological assessments, it achieves accurate and reliable performance even with minimal training data, showing promise for early-stage Alzheimer's monitoring.
☆ OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning
We present OraPlan-SQL, our system for the Archer NL2SQL Evaluation Challenge 2025, a bilingual benchmark requiring complex reasoning such as arithmetic, commonsense, and hypothetical inference. OraPlan-SQL ranked first, exceeding the second-best system by more than 6% in execution accuracy (EX), with 55.0% in English and 56.7% in Chinese, while maintaining over 99% SQL validity (VA). Our system follows an agentic framework with two components: Planner agent that generates stepwise natural language plans, and SQL agent that converts these plans into executable SQL. Since SQL agent reliably adheres to the plan, our refinements focus on the planner. Unlike prior methods that rely on multiple sub-agents for planning and suffer from orchestration overhead, we introduce a feedback-guided meta-prompting strategy to refine a single planner. Failure cases from a held-out set are clustered with human input, and an LLM distills them into corrective guidelines that are integrated into the planner's system prompt, improving generalization without added complexity. For the multilingual scenario, to address transliteration and entity mismatch issues, we incorporate entity-linking guidelines that generate alternative surface forms for entities and explicitly include them in the plan. Finally, we enhance reliability through plan diversification: multiple candidate plans are generated for each query, with the SQL agent producing a query for each plan, and final output selected via majority voting over their executions.
☆ GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA
I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.
☆ Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs EMNLP 2025
In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.
comment: Accepted at EMNLP 2025
☆ Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception
Large language model agents are increasingly used in multi-turn conversational settings to interact with and execute tasks in dynamic environments. However, a key limitation is their temporal blindness: they, by default, operate with a stationary context, failing to account for the real-world time elapsed between messages. This becomes a critical liability when an agent must decide whether to invoke a tool based on how much time has passed since the last observation. Without temporal awareness, agents often either over-rely on previous context (skipping necessary tool calls), or under-rely on it (unnecessarily repeating tool calls). To study this challenge, we introduce TicToc-v1, a test set of multi-turn user-agent trajectories across 34 scenarios with varying time sensitivity. Each trajectory ends with a user question, where the need for a tool call depends on the amount of time elapsed since the last message. To give LLMs temporal context, we augment dialogue messages with explicit timestamps, bridging the gap between static dialogue and evolving environments. We then collected human preferences for these samples, creating two subsets: one where humans preferred relying on the previous observation (prefer-noTool), and another where they preferred a new tool call (prefer-Tool). We evaluated how well LLM tool-calling decisions align with human preferences under varying time intervals on TicToc-v1. Our analysis show that without time information, most models perform only slightly better than random, with the top alignment rate being just over 60%. While adding timestamps leads to a slight improvement, particularly for larger models, the improvement is modest, peaking at around 65%. We also show that naive, prompt-based alignment have limited effectiveness. Our findings highlight the need for specific post-training alignment to align multi-turn LLM tool use with human temporal perception.
comment: preliminary work in progress
☆ A Neural Model for Contextual Biasing Score Learning and Filtering
Contextual biasing improves automatic speech recognition (ASR) by integrating external knowledge, such as user-specific phrases or entities, during decoding. In this work, we use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder, which can be used to filter out unlikely phrases and to calculate bonus for shallow-fusion biasing. We introduce a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors. Experiments on the Librispeech biasing benchmark show that our method effectively filters out majority of the candidate phrases, and significantly improves recognition accuracy under different biasing conditions when the scores are used in shallow fusion biasing. Our approach is modular and can be used with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.
comment: Accepted to IEEE ASRU 2025
☆ CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection
Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user--model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.
☆ How Pragmatics Shape Articulation: A Computational Case Study in STEM ASL Discourse
Most state-of-the-art sign language models are trained on interpreter or isolated vocabulary data, which overlooks the variability that characterizes natural dialogue. However, human communication dynamically adapts to contexts and interlocutors through spatiotemporal changes and articulation style. This specifically manifests itself in educational settings, where novel vocabularies are used by teachers, and students. To address this gap, we collect a motion capture dataset of American Sign Language (ASL) STEM (Science, Technology, Engineering, and Mathematics) dialogue that enables quantitative comparison between dyadic interactive signing, solo signed lecture, and interpreted articles. Using continuous kinematic features, we disentangle dialogue-specific entrainment from individual effort reduction and show spatiotemporal changes across repeated mentions of STEM terms. On average, dialogue signs are 24.6%-44.6% shorter in duration than the isolated signs, and show significant reductions absent in monologue contexts. Finally, we evaluate sign embedding models on their ability to recognize STEM signs and approximate how entrained the participants become over time. Our study bridges linguistic analysis and computational modeling to understand how pragmatics shape sign articulation and its representation in sign language technologies.
☆ Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language
We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.
☆ BitSkip: An Empirical Analysis of Quantization and Early Exit Composition
The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically exploring these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8-bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.
comment: Submitted to JMLR
☆ RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
☆ Evaluating Long-Term Memory for Long-Context Question Answering
In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned reasoning model gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.
comment: 14 pages including appendix, 3 figures. Submitted to October ARR and to Metacognition in Generative AI EurIPS workshop (under review for both)
♻ ☆ Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models
Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable, all without any extra computational overhead. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.
comment: The Thirty-Ninth Annual Conference on Neural Information Processing Systems
♻ ☆ LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.
comment: 34 pages, 5 figures, 7 tables
♻ ☆ SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that preserves safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across three LLMs and two tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective layer-wise merging offers an effective safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense.
♻ ☆ Superficial Self-Improved Reasoners Benefit from Model Merging EMNLP 2025
As scaled language models (LMs) approach human-level reasoning capabilities, self-improvement emerges as a solution to synthesizing high-quality data corpus. While previous research has identified model collapse as a risk in self-improvement, where model outputs become increasingly deterministic, we discover a more fundamental challenge: the superficial self-improved reasoners phenomenon. In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities on out-of-domain (OOD) tasks due to memorization rather than genuine. Through a systematic investigation of LM architecture, we discover that during self-improvement, LM weight updates are concentrated in less reasoning-critical layers, leading to superficial learning. To address this, we propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization while incorporating genuine reasoning improvements. Our approach effectively mitigates both LM collapse and superficial learning, moving towards more stable self-improving systems.
comment: EMNLP 2025
♻ ☆ SafeCOMM: A Study on Safety Degradation in Fine-Tuned Telecom Large Language Models
Fine-tuning large language models (LLMs) on telecom datasets is a common practice to adapt general-purpose models to the telecom domain. However, little attention has been paid to how this process may compromise model safety. Recent research has shown that even benign fine-tuning can degrade the safety alignment of LLMs, causing them to respond to harmful or unethical user queries. In this paper, we investigate this issue by fine-tuning LLMs on three representative telecom datasets and show that safety degrades even for light telecom domain adaptation. To this end, we introduce TeleHarm, the first telecom-specific red-teaming benchmark, which we use alongside established Direct-Harm and HexPhi datasets to systematically assess harmful behavior. We further extend our analysis to publicly available TeleLLMs that were continually pre-trained on large telecom corpora, revealing that safety alignment is severely lacking, primarily due to the omission of safety-focused instruction tuning. To address these issues, we evaluate three realignment defenses: SafeInstruct, SafeLoRA, SafeMERGE. We show that, across all settings, the proposed defenses can effectively restore safety without compromising telecom task performance, leading to Safe teleCOMMunication (SafeCOMM) models. Our work serves as both a diagnostic study and practical guide for safety realignment in telecom-tuned LLMs, underscoring the need for safety-aware instruction and fine-tuning in the telecom domain.
♻ ☆ Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance
Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.
♻ ☆ Human-Aligned Faithfulness in Toxicity Explanations of LLMs
The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity -- from their explanations that justify a stance -- to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs' free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate HAF of LLMs' toxicity explanations with no human involvement, and highlight how "non-ideal" the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and irrelevant responses. We open-source our code at https://github.com/uofthcdslab/HAF and LLM-generated explanations at https://huggingface.co/collections/uofthcdslab/haf.
comment: 23 pages, 5 figures, 7 tables
♻ ☆ AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
While RAG demonstrates remarkable capabilities in LLM applications, its effectiveness is hindered by the ever-increasing length of retrieved contexts, which introduces information redundancy and substantial computational overhead. Existing context pruning methods, such as LLMLingua, lack contextual awareness and offer limited flexibility in controlling compression rates, often resulting in either insufficient pruning or excessive information loss. In this paper, we propose AttentionRAG, an attention-guided context pruning method for RAG systems. The core idea of AttentionRAG lies in its attention focus mechanism, which reformulates RAG queries into a next-token prediction paradigm. This mechanism isolates the query's semantic focus to a single token, enabling precise and efficient attention calculation between queries and retrieved contexts. Extensive experiments on LongBench and Babilong benchmarks show that AttentionRAG achieves up to 6.3$\times$ context compression while outperforming LLMLingua methods by around 10\% in key metrics.
♻ ☆ Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions
Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet -- corrects these false presuppositions more than $43\%$ of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.
♻ ☆ Less is More: Local Intrinsic Dimensions of Contextual Language Models NeurIPS 2025
Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, often require extensive empirical evaluation. In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model's latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model's training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model's training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains. Through this exploration, we aim to provide practitioners with a deeper understanding of the implications of fine-tuning on embedding spaces, facilitating informed decisions when configuring models for specific applications. The results of this work contribute to the ongoing discourse on the interpretability, adaptability, and generalizability of LLMs by bridging the gap between intrinsic model mechanisms and geometric properties in the respective embeddings.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025; in press). 10 pages, with an additional 17 pages in the appendix. Our code is available at https://github.com/aidos-lab/Topo_LLM_public and https://github.com/aidos-lab/grokking-via-lid
♻ ☆ Computational-Assisted Systematic Review and Meta-Analysis (CASMA): Effect of a Subclass of GnRH-a on Endometriosis Recurrence
Background: Evidence synthesis facilitates evidence-based medicine. This task becomes increasingly difficult to accomplished with applying computational solutions, since the medical literature grows at astonishing rates. Objective: This study evaluates an information retrieval-driven workflow, CASMA, to enhance the efficiency, transparency, and reproducibility of systematic reviews. Endometriosis recurrence serves as the ideal case due to its complex and ambiguous literature. Methods: The hybrid approach integrates PRISMA guidelines with fuzzy matching and regular expression (regex) to facilitate semi-automated deduplication and filtered records before manual screening. The workflow synthesised evidence from randomised controlled trials on the efficacy of a subclass of gonadotropin-releasing hormone agonists (GnRH-a). A modified splitting method addressed unit-of-analysis errors in multi-arm trials. Results: The workflow sharply reduced the screening workload, taking only 11 days to fetch and filter 33,444 records. Seven eligible RCTs were synthesized (841 patients). The pooled random-effects model yielded a Risk Ratio (RR) of $0.64$ ($95\%$ CI $0.48$ to $0.86$), demonstrating a $36\%$ reduction in recurrence, with non-significant heterogeneity ($I^2=0.00\%$, $\tau^2=0.00$). The findings were robust and stable, as they were backed by sensitivity analyses. Conclusion: This study demonstrates an application of an information-retrieval-driven workflow for medical evidence synthesis. The approach yields valuable clinical results and a generalisable framework to scale up the evidence synthesis, bridging the gap between clinical research and computer science.
comment: 15 pages, 12 figures and 4 tables. This work describes an information retrieval-driven workflow for medical evidence synthesis, with an application to endometriosis recurrence. The method can be generalized to other systematic reviews. The preregistered protocol is available: https://doi.org/10.17605/OSF.IO/R2DFA
♻ ☆ How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences ($\sim$0.01GB text data) from the target language.
comment: Accepted to Computational Linguistics
♻ ☆ A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving
With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads.
comment: Accepted in a computer science workshop
♻ ☆ Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
♻ ☆ Estimating LLM Consistency: A User Baseline vs Surrogate Metrics EMNLP 2025
Large language models (LLMs) are prone to hallucinations and sensitiveto prompt perturbations, often resulting in inconsistent or unreliablegenerated text. Different methods have been proposed to mitigate suchhallucinations and fragility, one of which is to measure theconsistency of LLM responses -- the model's confidence in the responseor likelihood of generating a similar response when resampled. Inprevious work, measuring LLM response consistency often relied oncalculating the probability of a response appearing within a pool of resampledresponses, analyzing internal states, or evaluating logits of resopnses.However, it was not clear how well theseapproaches approximated users' perceptions of consistency of LLMresponses. To find out, we performed a user study ($n=2,976$)demonstrating that current methods for measuring LLM responseconsistency typically do not align well with humans' perceptions of LLMconsistency. We propose a logit-based ensemble method for estimatingLLM consistency and show that our method matches the performance of thebest-performing existing metric in estimating human ratings of LLMconsistency. Our results suggest that methods for estimating LLMconsistency without human evaluation are sufficiently imperfect towarrant broader use of evaluation with human input; this would avoidmisjudging the adequacy of models because of the imperfections ofautomated consistency metrics.
comment: Published as a main conference paper at EMNLP 2025
♻ ☆ Can Large Language Models Unlock Novel Scientific Research Ideas? EMNLP 2025
The widespread adoption of Large Language Models (LLMs) and publicly available ChatGPT have marked a significant turning point in the integration of Artificial Intelligence (AI) into people's everyday lives. This study examines the ability of Large Language Models (LLMs) to generate future research ideas from scientific papers. Unlike tasks such as summarization or translation, idea generation lacks a clearly defined reference set or structure, making manual evaluation the default standard. However, human evaluation in this setting is extremely challenging ie: it requires substantial domain expertise, contextual understanding of the paper, and awareness of the current research landscape. This makes it time-consuming, costly, and fundamentally non-scalable, particularly as new LLMs are being released at a rapid pace. Currently, there is no automated evaluation metric specifically designed for this task. To address this gap, we propose two automated evaluation metrics: Idea Alignment Score (IAScore) and Idea Distinctness Index. We further conducted human evaluation to assess the novelty, relevance, and feasibility of the generated future research ideas. This investigation offers insights into the evolving role of LLMs in idea generation, highlighting both its capability and limitations. Our work contributes to the ongoing efforts in evaluating and utilizing language models for generating future research ideas. We make our datasets and codes publicly available
comment: EMNLP 2025 (Main)
♻ ☆ ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation
Legal claims refer to the plaintiff's demands in a case and are essential to guiding judicial reasoning and case resolution. While many works have focused on improving the efficiency of legal professionals, the research on helping non-professionals (e.g., plaintiffs) remains unexplored. This paper explores the problem of legal claim generation based on the given case's facts. First, we construct ClaimGen-CN, the first dataset for Chinese legal claim generation task, from various real-world legal disputes. Additionally, we design an evaluation metric tailored for assessing the generated claims, which encompasses two essential dimensions: factuality and clarity. Building on this, we conduct a comprehensive zero-shot evaluation of state-of-the-art general and legal-domain large language models. Our findings highlight the limitations of the current models in factual precision and expressive clarity, pointing to the need for more targeted development in this domain. To encourage further exploration of this important task, we will make the dataset publicly available.
♻ ☆ Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model's Empathy EMNLP 2025
Large Language Models' (LLMs) ability to converse naturally is empowered by their ability to empathetically understand and respond to their users. However, emotional experiences are shaped by demographic and cultural contexts. This raises an important question: Can LLMs demonstrate equitable empathy across diverse user groups? We propose a framework to investigate how LLMs' cognitive and affective empathy vary across user personas defined by intersecting demographic attributes. Our study introduces a novel intersectional analysis spanning 315 unique personas, constructed from combinations of age, culture, and gender, across four LLMs. Results show that attributes profoundly shape a model's empathetic responses. Interestingly, we see that adding multiple attributes at once can attenuate and reverse expected empathy patterns. We show that they broadly reflect real-world empathetic trends, with notable misalignments for certain groups, such as those from Confucian culture. We complement our quantitative findings with qualitative insights to uncover model behaviour patterns across different demographic groups. Our findings highlight the importance of designing empathy-aware LLMs that account for demographic diversity to promote more inclusive and equitable model behaviour.
comment: 9 pages, 4 figures, 4 tables, EMNLP 2025 Findings
♻ ☆ Bootstrapping Referring Multi-Object Tracking
Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrained by limited language expressiveness, lacking the capacity to model object dynamics in spatial numbers and temporal states. To address these limitations, we introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking, comprehensively accounting for variations in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT benchmark named Refer-KITTI-V2, featuring scalable and diverse language expressions. To efficiently generate high-quality annotations covering object dynamics with minimal manual effort, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts. In addition, we propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT. At its core is a query-driven Temporal Enhancement Module that represents each object as a Transformer query, enabling long-term spatial-temporal interactions with other objects and past frames to efficiently refine these queries. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source code and dataset is available at https://github.com/zyn213/TempRMOT.
♻ ☆ Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly half a day and LLaMA-3-8B for voice interactions up to almost 20.8 hours.
♻ ☆ SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
comment: Project Website: http://simbench.tiancheng.hu/ Data: https://huggingface.co/datasets/pitehu/SimBench
♻ ☆ MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses ICLR 2025
Scientific discovery plays a pivotal role in advancing human society, and recent progress in large language models (LLMs) suggests their potential to accelerate this process. However, it remains unclear whether LLMs can autonomously generate novel and valid hypotheses in chemistry. In this work, we investigate whether LLMs can discover high-quality chemistry hypotheses given only a research background-comprising a question and/or a survey-without restriction on the domain of the question. We begin with the observation that hypothesis discovery is a seemingly intractable task. To address this, we propose a formal mathematical decomposition grounded in a fundamental assumption: that most chemistry hypotheses can be composed from a research background and a set of inspirations. This decomposition leads to three practical subtasks-retrieving inspirations, composing hypotheses with inspirations, and ranking hypotheses - which together constitute a sufficient set of subtasks for the overall scientific discovery task. We further develop an agentic LLM framework, MOOSE-Chem, that is a direct implementation of this mathematical decomposition. To evaluate this framework, we construct a benchmark of 51 high-impact chemistry papers published and online after January 2024, each manually annotated by PhD chemists with background, inspirations, and hypothesis. The framework is able to rediscover many hypotheses with high similarity to the groundtruth, successfully capturing the core innovations-while ensuring no data contamination since it uses an LLM with knowledge cutoff date prior to 2024. Finally, based on LLM's surprisingly high accuracy on inspiration retrieval, a task with inherently out-of-distribution nature, we propose a bold assumption: that LLMs may already encode latent scientific knowledge associations not yet recognized by humans.
comment: Accepted by ICLR 2025
♻ ☆ Prompting is not Enough: Exploring Knowledge Integration and Controllable Generation
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model's ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at https://github.com/USTC-StarTeam/GenKI
comment: 13 pages, 5 figures
♻ ☆ TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration NeurIPS 2025
Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modeling. However, the heterogeneity of data and the diversity of trajectory tasks make effective and reliable trajectory modeling an important yet highly challenging endeavor, even for domain experts. \fix In this paper, we propose \textit{TrajAgent}, a agent framework powered by large language models (LLMs), designed to facilitate robust and efficient trajectory modeling through automation modeling. This framework leverages and optimizes diverse specialized models to address various trajectory modeling tasks across different datasets effectively. \unfix~In \textit{TrajAgent}, we first develop \textit{UniEnv}, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on \textit{UniEnv}, we introduce an agentic workflow designed for automatic trajectory modeling across various trajectory tasks and data. Furthermore, we introduce collaborative learning schema between LLM-based agents and small speciallized models, to enhance the performance of the whole framework effectively. Extensive experiments on four tasks using four real-world datasets demonstrate the effectiveness of \textit{TrajAgent} in automated trajectory modeling, achieving a performance improvement of \fix 2.38\%-69.91\% \unfix over baseline methods. The codes and data can be accessed via https://github.com/tsinghua-fib-lab/TrajAgent.
comment: Accepted by NeurIPS 2025, https://github.com/tsinghua-fib-lab/TrajAgent
♻ ☆ LLMs can hide text in other text of the same length
A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.
comment: 21 pages, main paper 9 pages
♻ ☆ MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search NeurIPS 2025
Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the new task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent literature show that our method consistently outperforms strong baselines.
comment: Accepted by NeurIPS 2025
♻ ☆ The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation NeurIPS 2025
Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.
comment: Accepted at NeurIPS 2025
♻ ☆ TaoSR1: The Thinking Model for E-commerce Relevance Search
Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.
♻ ☆ Thought Anchors: Which LLM Reasoning Steps Matter?
Current frontier large-language models rely on reasoning to achieve state-of-the-art performance. Many existing interpretability are limited in this area, as standard methods have been designed to study single forward passes of a model rather than the multi-token computational steps that unfold during reasoning. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We introduce a black-box method that measures each sentence's counterfactual importance by repeatedly sampling replacement sentences from the model, filtering for semantically different ones, and continuing the chain of thought from that point onwards to quantify the sentence's impact on the distribution of final answers. We discover that certain sentences can have an outsized impact on the trajectory of the reasoning trace and final answer. We term these sentences \textit{thought anchors}. These are generally planning or uncertainty management sentences, and specialized attention heads consistently attend from subsequent sentences to thought anchors. We further show that examining sentence-sentence causal links within a reasoning trace gives insight into a model's behavior. Such information can be used to predict a problem's difficulty and the extent different question domains involve sequential or diffuse reasoning. As a proof-of-concept, we demonstrate that our techniques together provide a practical toolkit for analyzing reasoning models by conducting a detailed case study of how the model solves a difficult math problem, finding that our techniques yield a consistent picture of the reasoning trace's structure. We provide an open-source tool (thought-anchors.com) for visualizing the outputs of our methods on further problems. The convergence across our methods shows the potential of sentence-level analysis for a deeper understanding of reasoning models.
comment: Paul C. Bogdan and Uzay Macar contributed equally to this work, and their listed order was determined by coinflip. Neel Nanda and Arthur Conmy contributed equally to this work as senior authors, and their listed order was determined by coinflip
♻ ☆ ThinkBrake: Mitigating Overthinking in Tool Reasoning
Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8\% to 94.2\% while reducing tokens by 80-94\%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL's single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25\%, outperforming various baselines.
♻ ☆ OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S
comment: Technical Report, Update on OpenS2S_v1.5
♻ ☆ When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning
While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.
♻ ☆ Input Matters: Evaluating Input Structure's Impact on LLM Summaries of Sports Play-by-Play
A major concern when deploying LLMs in accuracy-critical domains such as sports reporting is that the generated text may not faithfully reflect the input data. We quantify how input structure affects hallucinations and other factual errors in LLM-generated summaries of NBA play-by-play data, across three formats: row-structured, JSON and unstructured. We manually annotated 3,312 factual errors across 180 game summaries produced by two models, Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured input, while row-structured input reduces errors by 54% for Llama and 51% for Qwen. A two-way repeated measures ANOVA shows that input structure accounts for over 80% of the variance in error rates, with Tukey HSD post hoc tests confirming statistically significant differences between all input formats.
comment: Accepted at INLG 2025
♻ ☆ LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models.
♻ ☆ Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning
The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.
♻ ☆ StereoDetect: Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings
Stereotypes are known to have very harmful effects, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases, thereby leaving the study of stereotypes in its early stages. Our study revealed that many works have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and Anti-stereotype detection is a problem that requires social knowledge; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a five-tuple definition and provide precise terminologies disentangling stereotypes, anti-stereotypes, stereotypical bias, and general bias. We provide a conceptual framework grounded in social psychology for reliable detection. We identify key shortcomings in existing benchmarks for this task of stereotype and anti-stereotype detection. To address these gaps, we developed StereoDetect, a well curated, definition-aligned benchmark dataset designed for this task. We show that sub-10B language models and GPT-4o frequently misclassify anti-stereotypes and fail to recognize neutral overgeneralizations. We demonstrate StereoDetect's effectiveness through multiple qualitative and quantitative comparisons with existing benchmarks and models fine-tuned on them. The dataset and code is available at https://github.com/KaustubhShejole/StereoDetect.
♻ ☆ Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment
Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.
♻ ☆ Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs?
Chain-of-thought (CoT) prompting has emerged as a common technique for enhancing the reasoning abilities of large language models (LLMs). While extended reasoning can boost accuracy on complex tasks, it is often unnecessary and substantially increases token usage, limiting the practicality of reasoning models in many scenarios. Recent models, such as GPT-OSS and Qwen3, expose controls that enable users to adjust the length of CoT or determine whether it is used at all. Yet, it remains unclear when CoT should be used: on some tasks it improves performance, while on others it provides little benefit or even harms performance. We address this challenge with confidence-gated CoT, where a model invokes reasoning only when confidence in its direct answer is low. To this end, we present the first systematic study of training-free confidence estimation methods for CoT gating. Specifically, we evaluate four training-free confidence estimation methods and compare them to a random baseline and an oracle that always knows when CoT is needed. Through extensive experiments, we show that existing training-free confidence measures can reduce redundant CoT and outperform randomly invoked CoT. However, the utility of individual confidence measures is inconsistent, varying with both the dataset and the model, underscoring the difficulty of deploying confidence-gated CoT in practice. By analysing both strengths and failure modes, our study highlights the potential and limitations of current methods and paves the way toward more reliable adaptive gating of CoT.
comment: Under Review
♻ ☆ First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training NeurIPS 2025
Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL), which require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. This limitation has motivated a growing interest in unsupervised paradigms as a third stage of post-training after SFT and RL. While recent efforts have explored this direction, their methods are complex and difficult to iterate. To address this, we propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs, enabling continual self-improvement without any external supervision. The training method of MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that such training method effectively improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3\%$\rightarrow$72.9\% on MathVista, 62.9\%$\rightarrow$68.7\% on We-Math), using standard dataset without ground truth labels. To further explore scalability, we extend our framework to a data self-generation setting, designing two strategies that prompt the MLLM to synthesize new training samples on its own. Additional experiments show that combining these synthetic data with the unsupervised training method can also boost performance, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for autonomous enhancement of MLLMs, serving as a critical third step after initial SFT and RL in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.
comment: Accepted by NeurIPS 2025
♻ ☆ DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.
comment: Under Review
♻ ☆ COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection EMNLP 2025
The growing size of large language models has created significant computational inefficiencies. To address this challenge, sparse activation methods selectively deactivates non-essential parameters during inference, reducing computational costs in FFNN layers. While existing methods focus on non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN layer lies globally in the form of a linear combination over its internal down projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination. Experimental results demonstrate that D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5% ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4% better performance preservation compared to existing methods. Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.
comment: EMNLP 2025 (Main Track)
♻ ☆ GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability
Improving the general capabilities of large language models (LLMs) is an active research topic. As a common data structure in many real-world domains, understanding graph data is a crucial part of advancing general intelligence. To this end, we propose a dynamic benchmark named GraphInstruct in this paper, which comprehensively includes 21 classical graph reasoning tasks, providing diverse graph generation pipelines and detailed intermediate reasoning steps for each sample. Based on GraphInstruct, we develop GraphSolver via efficient instruction-tuning, which demonstrates prominent graph understanding capability compared to other open-sourced LLMs. To further endow LLMs with multi-step graph reasoning capability, we propose a label-mask training strategy and build GraphSolver+, which leverages masked supervision on intermediate reasoning tokens to emphasize crucial node-identification signals. As one of the pioneering efforts to enhance the graph understanding and reasoning abilities of LLMs, extensive experiments have demonstrated the superiority of GraphSolver and GraphSolver+ over other LLMs. We sincerely hope GraphInstruct will facilitate further research on applying LLMs to graph-structured data. Our code and data are released publicly at: https://github.com/CGCL-codes/GraphInstruct.
comment: Accepted by Frontiers of Computer Science
♻ ☆ The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with substantial performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever's difficulty in ranking documents across languages. Finally, we propose two simple retrieval strategies that address this source of failure by enforcing equal retrieval from both languages or by translating the query, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.
comment: Accepted to ArabicNLP 2025
♻ ☆ ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions NeurIPS 2025
Recent advances in Large Language Models (LLMs) have propelled intelligent agents from reactive responses to proactive support. While promising, existing proactive agents either rely exclusively on observations from enclosed environments (e.g., desktop UIs) with direct LLM inference or employ rule-based proactive notifications, leading to suboptimal user intent understanding and limited functionality for proactive service. In this paper, we introduce ContextAgent, the first context-aware proactive agent that incorporates extensive sensory contexts surrounding humans to enhance the proactivity of LLM agents. ContextAgent first extracts multi-dimensional contexts from massive sensory perceptions on wearables (e.g., video and audio) to understand user intentions. ContextAgent then leverages the sensory contexts and personas from historical data to predict the necessity for proactive services. When proactive assistance is needed, ContextAgent further automatically calls the necessary tools to assist users unobtrusively. To evaluate this new task, we curate ContextAgentBench, the first benchmark for evaluating context-aware proactive LLM agents, covering 1,000 samples across nine daily scenarios and twenty tools. Experiments on ContextAgentBench show that ContextAgent outperforms baselines by achieving up to 8.5% and 6.0% higher accuracy in proactive predictions and tool calling, respectively. We hope our research can inspire the development of more advanced, human-centric, proactive AI assistants. The code and dataset are publicly available at https://github.com/openaiotlab/ContextAgent.
comment: Accepted by NeurIPS 2025
♻ ☆ ColorEcosystem: Powering Personalized, Standardized, and Trustworthy Agentic Service in massive-agent Ecosystem
With the rapid development of (multimodal) large language model-based agents, the landscape of agentic service management has evolved from single-agent systems to multi-agent systems, and now to massive-agent ecosystems. Current massive-agent ecosystems face growing challenges, including impersonal service experiences, a lack of standardization, and untrustworthy behavior. To address these issues, we propose ColorEcosystem, a novel blueprint designed to enable personalized, standardized, and trustworthy agentic service at scale. Concretely, ColorEcosystem consists of three key components: agent carrier, agent store, and agent audit. The agent carrier provides personalized service experiences by utilizing user-specific data and creating a digital twin, while the agent store serves as a centralized, standardized platform for managing diverse agentic services. The agent audit, based on the supervision of developer and user activities, ensures the integrity and credibility of both service providers and users. Through the analysis of challenges, transitional forms, and practical considerations, the ColorEcosystem is poised to power personalized, standardized, and trustworthy agentic service across massive-agent ecosystems. Meanwhile, we have also implemented part of ColorEcosystem's functionality, and the relevant code is open-sourced at https://github.com/opas-lab/color-ecosystem.
♻ ☆ Computational Analysis of Character Development in Holocaust Testimonies
This work presents a computational approach to analyze character development along the narrative timeline. The analysis characterizes the inner and outer changes the protagonist undergoes within a narrative, and the interplay between them. We consider transcripts of Holocaust survivor testimonies as a test case, each telling the story of an individual in first-person terms. We focus on the survivor's religious trajectory, examining the evolution of their disposition toward religious belief and practice along the testimony. Clustering the resulting trajectories in the dataset, we identify common sequences in the data. Our findings highlight multiple common structures of religiosity across the narratives: in terms of belief, most present a constant disposition, while for practice, most present an oscillating structure, serving as valuable material for historical and sociological research. This work demonstrates the potential of natural language processing techniques for analyzing character evolution through thematic trajectories in narratives.
♻ ☆ FaithLM: Towards Faithful Explanations for Large Language Models
Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses to decide. We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of LLM explanations without token masking or task-specific heuristics. FaithLM formalizes explanation faithfulness as an intervention property: a faithful explanation should yield a prediction shift when its content is contradicted. Theoretical analysis shows that the resulting contrary-hint score is a sound and discriminative estimator of faithfulness. Building on this principle, FaithLM iteratively refines both the elicitation prompt and the explanation to maximize the measured score. Experiments on three multi-domain datasets and multiple LLM backbones demonstrate that FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines. These findings highlight that intervention-based evaluation, coupled with iterative optimization, provides a principled route toward faithful and reliable LLM explanations.
♻ ☆ Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
AI agent frameworks operate in isolation, forcing agents to rediscover solutions and repeat mistakes across different systems. Despite valuable problem-solving experiences accumulated by frameworks like smolagents, OpenHands, and OWL, this knowledge remains trapped within individual systems, preventing the emergence of collective intelligence. Current memory systems focus on individual agents or framework-specific demonstrations, failing to enable cross-architecture knowledge transfer. We introduce AGENT KB, a universal memory infrastructure enabling seamless experience sharing across heterogeneous agent frameworks without retraining. AGENT KB aggregates trajectories into a structured knowledge base and serves lightweight APIs. At inference time, hybrid retrieval operates through two stages: planning seeds agents with cross-domain workflows, while feedback applies targeted diagnostic fixes. A disagreement gate ensures retrieved knowledge enhances rather than disrupts reasoning, addressing knowledge interference in cross-framework transfer. We validate AGENT KB across major frameworks on GAIA, Humanity's Last Exam, GPQA, and SWE-bench. Results show substantial improvements across diverse model families: compared to baseline pass@1, smolagents with AGENT KB achieve up to 18.7pp gains at pass@3 (55.2% -> 73.9%), while OpenHands improves 4.0pp on SWE-bench pass@1 (24.3% -> 28.3%). Similar improvements are observed across all base model families. Ablations confirm that hybrid retrieval and feedback stages are essential, with automatically generated experiences matching manual curation. This establishes the foundation for collective agent intelligence through shared memory infrastructures.
♻ ☆ Detecting and Rectifying Noisy Labels: A Similarity-based Approach
Label noise in datasets could significantly damage the performance and robustness of deep neural networks (DNNs) trained on these datasets. As the size of modern DNNs grows, there is a growing demand for automated tools for detecting such errors. In this paper, we propose post-hoc, model-agnostic noise detection and rectification methods utilizing the penultimate feature from a DNN. Our idea is based on the observation that the similarity between the penultimate feature of a mislabeled data point and its true class data points is higher than that for data points from other classes, making the probability of label occurrence within a tight, similar cluster informative for detecting and rectifying errors. Through theoretical and empirical analyses, we demonstrate that our approach achieves high detection performance across diverse, realistic noise scenarios and can automatically rectify these errors to improve dataset quality. Our implementation is available at https://anonymous.4open.science/r/noise-detection-and-rectification-AD8E.
♻ ☆ EQ-Negotiator: Emotion Policing Personas for Anti-Manipulation in Credit Collection Dialogues
Persona modeling in large language models typically focuses on static character traits, but overlooks the dynamic emotional intelligence required for real-time adversarial negotiations. In financial dialogues, this limitation creates critical vulnerabilities: debtors exploit predictable empathetic responses through emotional manipulation tactics like aggression, feigned distress, and guilt-tripping. To bridge this gap, we present EQ-Negotiator, a novel framework that grounds persona behavior in emotion dynamics rather than static personality profiles. Unlike naive empathy-centric agents, EQ-Negotiator integrates emotion memory and game-theoretic reasoning, powered by a Hidden Markov Model (HMM) to track and predict debtor emotional states. By analyzing both real-time and historical emotional cues, EQ-Negotiator strategically counters negative emotions (e.g., aggression, feigned distress) while preserving productive debtor relationships. This work advances persona modeling from descriptive character profiles to functional emotional architectures, establishing emotion as the critical link between persona design and tactical execution. Through agent-to-agent validation across 20 credit negotiation scenarios, we demonstrate that emotion-driven personas enable robust defensive capabilities against manipulation while maintaining strategic effectiveness.
♻ ☆ Seeing Symbols, Missing Cultures: Probing Vision-Language Models' Reasoning on Fire Imagery and Cultural Meaning COLING 2025
Vision-Language Models (VLMs) often appear culturally competent but rely on superficial pattern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themed cultural imagery through both classification and explanation analysis. Testing multiple models on Western festivals, non-Western traditions, and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresented cultural events, frequently offering vague labels or dangerously misclassifying emergencies as celebrations. These failures expose the risks of symbolic shortcuts and highlight the need for cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodal systems.
comment: 8 pages, 5 figures, 4 tables. Submitted to WiNLP 2025 Workshop at COLING 2025
♻ ☆ Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector NeurIPS 2025
LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD's effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.
comment: Accepted at NeurIPS 2025 (Camera-Ready Version)
♻ ☆ A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications
The advent of large language models (LLMs) has transformed information access and reasoning through open-ended natural language interaction. However, LLMs remain limited by static knowledge, factual hallucinations, and the inability to retrieve real-time or domain-specific information. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs in external evidence, but traditional RAG pipelines are often single turn and heuristic, lacking adaptive control over retrieval and reasoning. Recent advances in agentic search address these limitations by enabling LLMs to plan, retrieve, and reflect through multi-step interaction with search environments. Within this paradigm, reinforcement learning (RL) offers a powerful mechanism for adaptive and self-improving search behavior. This survey provides the first comprehensive overview of \emph{RL-based agentic search}, organizing the emerging field along three complementary dimensions: (i) What RL is for (functional roles), (ii) How RL is used (optimization strategies), and (iii) Where RL is applied (scope of optimization). We summarize representative methods, evaluation protocols, and applications, and discuss open challenges and future directions toward building reliable and scalable RL driven agentic search systems. We hope this survey will inspire future research on the integration of RL and agentic search. Our repository is available at https://github.com/ventr1c/Awesome-RL-based-Agentic-Search-Papers.
comment: 38 pages, 4 figures, 7 tables
♻ ☆ Science Hierarchography: Hierarchical Organization of Science Literature
Scientific knowledge is growing rapidly, making it difficult to track progress and high-level conceptual links across broad disciplines. While tools like citation networks and search engines help retrieve related papers, they lack the abstraction needed to capture the needed to represent the density and structure of activity across subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that spans multiple levels of abstraction -- from broad domains to specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve this goal, we develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. Compared to LLM-heavy methods like iterative tree construction, our approach achieves superior quality-speed trade-offs. Our hierarchies capture different dimensions of research contributions, reflecting the interdisciplinary and multifaceted nature of modern science. We evaluate its utility by measuring how effectively an LLM-based agent can navigate the hierarchy to locate target papers. Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo are available: https://github.com/JHU-CLSP/science-hierarchography
♻ ☆ OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
comment: Technical Report. Code: https://github.com/NVlabs/OmniVinci
♻ ☆ Semantic Agreement Enables Efficient Open-Ended LLM Cascades EMNLP
Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement -- meaning-level consensus between ensemble outputs -- as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.
comment: 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) Industry Track
♻ ☆ Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization NeurIPS 2025
Offline reinforcement learning (RL) is a variant of RL where the policy is learned from a previously collected dataset of trajectories and rewards. In our work, we propose a practical approach to offline RL with large language models (LLMs). We recast the problem as reward-weighted fine-tuning, which can be solved using similar techniques to supervised fine-tuning (SFT). To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality.
comment: Accepted at NeurIPS 2025 (main conference)
♻ ☆ Learned, Lagged, LLM-splained: LLM Responses to End User Security Questions
Answering end user security questions is challenging. While large language models (LLMs) like GPT, LLAMA, and Gemini are far from error-free, they have shown promise in answering a variety of questions outside of security. We studied LLM performance in the area of end user security by qualitatively evaluating 3 popular LLMs on 900 systematically collected end user security questions. While LLMs demonstrate broad generalist ``knowledge'' of end user security information, there are patterns of errors and limitations across LLMs consisting of stale and inaccurate answers, and indirect or unresponsive communication styles, all of which impacts the quality of information received. Based on these patterns, we suggest directions for model improvement and recommend user strategies for interacting with LLMs when seeking assistance with security.
comment: 17 pages, 7 tables
♻ ☆ RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.
♻ ☆ DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans
The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences.We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews.DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios.Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.
comment: In Submission
Information Retrieval 20
☆ LimRank: Less is More for Reasoning-Intensive Information Reranking EMNLP 2025
Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks, which is computationally expensive. In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision. To enable this, we design LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating diverse, challenging, and realistic reranking examples. Using this synthetic data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and FollowIR for instruction-following retrieval. Our experiments demonstrate that LIMRANK achieves competitive performance, while being trained on less than 5% of the data typically used in prior work. Further ablation studies demonstrate the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization capabilities of LIMRANK across downstream tasks, including scientific literature search and retrieval-augmented generation for knowledge-intensive problem solving.
comment: EMNLP 2025 Main (Short)
☆ Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment
The rapid digitization of histopathology slides has opened up new possibilities for computational tools in clinical and research workflows. Among these, content-based slide retrieval stands out, enabling pathologists to identify morphologically and semantically similar cases, thereby supporting precise diagnoses, enhancing consistency across observers, and assisting example-based education. However, effective retrieval of whole slide images (WSIs) remains challenging due to their gigapixel scale and the difficulty of capturing subtle semantic differences amid abundant irrelevant content. To overcome these challenges, we present PathSearch, a retrieval framework that unifies fine-grained attentive mosaic representations with global-wise slide embeddings aligned through vision-language contrastive learning. Trained on a corpus of 6,926 slide-report pairs, PathSearch captures both fine-grained morphological cues and high-level semantic patterns to enable accurate and flexible retrieval. The framework supports two key functionalities: (1) mosaic-based image-to-image retrieval, ensuring accurate and efficient slide research; and (2) multi-modal retrieval, where text queries can directly retrieve relevant slides. PathSearch was rigorously evaluated on four public pathology datasets and three in-house cohorts, covering tasks including anatomical site retrieval, tumor subtyping, tumor vs. non-tumor discrimination, and grading across diverse organs such as breast, lung, kidney, liver, and stomach. External results show that PathSearch outperforms traditional image-to-image retrieval frameworks. A multi-center reader study further demonstrates that PathSearch improves diagnostic accuracy, boosts confidence, and enhances inter-observer agreement among pathologists in real clinical scenarios. These results establish PathSearch as a scalable and generalizable retrieval solution for digital pathology.
☆ Leveraging Hierarchical Organization for Medical Multi-document Summarization
Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model's ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.
☆ Think before Recommendation: Autonomous Reasoning-enhanced Recommender NeurIPS 2025
The core task of recommender systems is to learn user preferences from historical user-item interactions. With the rapid development of large language models (LLMs), recent research has explored leveraging the reasoning capabilities of LLMs to enhance rating prediction tasks. However, existing distillation-based methods suffer from limitations such as the teacher model's insufficient recommendation capability, costly and static supervision, and superficial transfer of reasoning ability. To address these issues, this paper proposes RecZero, a reinforcement learning (RL)-based recommendation paradigm that abandons the traditional multi-model and multi-stage distillation approach. Instead, RecZero trains a single LLM through pure RL to autonomously develop reasoning capabilities for rating prediction. RecZero consists of two key components: (1) "Think-before-Recommendation" prompt construction, which employs a structured reasoning template to guide the model in step-wise analysis of user interests, item features, and user-item compatibility; and (2) rule-based reward modeling, which adopts group relative policy optimization (GRPO) to compute rewards for reasoning trajectories and optimize the LLM. Additionally, the paper explores a hybrid paradigm, RecOne, which combines supervised fine-tuning with RL, initializing the model with cold-start reasoning samples and further optimizing it with RL. Experimental results demonstrate that RecZero and RecOne significantly outperform existing baseline methods on multiple benchmark datasets, validating the superiority of the RL paradigm in achieving autonomous reasoning-enhanced recommender systems.
comment: NeurIPS 2025 poster
☆ Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models
Financial documents are essential sources of information for regulators, auditors, and financial institutions, particularly for assessing the wealth and compliance of Small and Medium-sized Businesses. However, SMB documents are often difficult to parse. They are rarely born digital and instead are distributed as scanned images that are none machine readable. The scans themselves are low in resolution, affected by skew or rotation, and often contain noisy backgrounds. These documents also tend to be heterogeneous, mixing narratives, tables, figures, and multilingual content within the same report. Such characteristics pose major challenges for automated information extraction, especially when relying on end to end large Vision Language Models, which are computationally expensive, sensitive to noise, and slow when applied to files with hundreds of pages. We propose a multistage pipeline that leverages traditional image processing models and OCR extraction, together with compact VLMs for structured field extraction of large-scale financial documents. Our approach begins with image pre-processing, including segmentation, orientation detection, and size normalization. Multilingual OCR is then applied to recover page-level text. Upon analyzing the text information, pages are retrieved for coherent sections. Finally, compact VLMs are operated within these narrowed-down scopes to extract structured financial indicators. Our approach is evaluated using an internal corpus of multi-lingual, scanned financial documents. The results demonstrate that compact VLMs, together with a multistage pipeline, achieves 8.8 times higher field level accuracy relative to directly feeding the whole document into large VLMs, only at 0.7 percent of the GPU cost and 92.6 percent less end-to-end service latency.
☆ Improving Product Search Relevance with EAR-MP: A Solution for the CIKM 2025 AnalytiCup
Multilingual e-commerce search is challenging due to linguistic diversity and the noise inherent in user-generated queries. This paper documents the solution employed by our team (EAR-MP) for the CIKM 2025 AnalytiCup, which addresses two core tasks: Query-Category (QC) relevance and Query-Item (QI) relevance. Our approach first normalizes the multilingual dataset by translating all text into English, then mitigates noise through extensive data cleaning and normalization. For model training, we build on DeBERTa-v3-large and improve performance with label smoothing, self-distillation, and dropout. In addition, we introduce task-specific upgrades, including hierarchical token injection for QC and a hybrid scoring mechanism for QI. Under constrained compute, our method achieves competitive results, attaining an F1 score of 0.8796 on QC and 0.8744 on QI. These findings underscore the importance of systematic data preprocessing and tailored training strategies for building robust, resource-efficient multilingual relevance systems.
☆ Tagging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts EMNLP 2025
Recent investigations into effective context lengths of modern flagship large language models (LLMs) have revealed major limitations in effective question answering (QA) and reasoning over long and complex contexts for even the largest and most impressive cadre of models. While approaches like retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to mitigate this issue, they are sensitive to chunking, embedding and retrieval strategies and models, and furthermore, rely on extensive pre-processing, knowledge acquisition and indexing steps. In this paper, we propose Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy that boosts LLM performance in long-context scenarios, without degrading and altering the integrity and composition of retrieved documents. We validate our hypothesis by augmenting two challenging and directly relevant question-answering benchmarks -- NoLima and NovelQA -- and show that tagging the context or even just adding tag definitions into QA prompts leads to consistent performance gains over the baseline -- up to 17% for 32K token contexts, and 2.9% in complex reasoning question-answering for multi-hop queries requiring knowledge across a wide span of text. Additional details are available at https://sites.google.com/view/tag-emnlp.
comment: Paper accepted at EMNLP 2025
☆ GTR-Mamba: Geometry-to-Tangent Routing for Hyperbolic POI Recommendation ICDE 2026
Next Point-of-Interest (POI) recommendation is a critical task in modern Location-Based Social Networks (LBSNs), aiming to model the complex decision-making process of human mobility to provide personalized recommendations for a user's next check-in location. Existing POI recommendation models, predominantly based on Graph Neural Networks and sequential models, have been extensively studied. However, these models face a fundamental limitation: they struggle to simultaneously capture the inherent hierarchical structure of spatial choices and the dynamics and irregular shifts of user-specific temporal contexts. To overcome this limitation, we propose GTR-Mamba, a novel framework for cross-manifold conditioning and routing. GTR-Mamba leverages the distinct advantages of different mathematical spaces for different tasks: it models the static, tree-like preference hierarchies in hyperbolic geometry, while routing the dynamic sequence updates to a novel Mamba layer in the computationally stable and efficient Euclidean tangent space. This process is coordinated by a cross-manifold channel that fuses spatio-temporal information to explicitly steer the State Space Model (SSM), enabling flexible adaptation to contextual changes. Extensive experiments on three real-world datasets demonstrate that GTR-Mamba consistently outperforms state-of-the-art baseline models in next POI recommendation.
comment: 14 pages, 8 figures, 4 tables, submitted to ICDE 2026
☆ MGFRec: Towards Reinforced Reasoning Recommendation with Multiple Groundings and Feedback
The powerful reasoning and generative capabilities of large language models (LLMs) have inspired researchers to apply them to reasoning-based recommendation tasks, which require in-depth reasoning about user interests and the generation of recommended items. However, previous reasoning-based recommendation methods have typically performed inference within the language space alone, without incorporating the actual item space. This has led to over-interpreting user interests and deviating from real items. Towards this research gap, we propose performing multiple rounds of grounding during inference to help the LLM better understand the actual item space, which could ensure that its reasoning remains aligned with real items. Furthermore, we introduce a user agent that provides feedback during each grounding step, enabling the LLM to better recognize and adapt to user interests. Comprehensive experiments conducted on three Amazon review datasets demonstrate the effectiveness of incorporating multiple groundings and feedback. These findings underscore the critical importance of reasoning within the actual item space, rather than being confined to the language space, for recommendation tasks.
♻ ☆ Computational-Assisted Systematic Review and Meta-Analysis (CASMA): Effect of a Subclass of GnRH-a on Endometriosis Recurrence
Background: Evidence synthesis facilitates evidence-based medicine. This task becomes increasingly difficult to accomplished with applying computational solutions, since the medical literature grows at astonishing rates. Objective: This study evaluates an information retrieval-driven workflow, CASMA, to enhance the efficiency, transparency, and reproducibility of systematic reviews. Endometriosis recurrence serves as the ideal case due to its complex and ambiguous literature. Methods: The hybrid approach integrates PRISMA guidelines with fuzzy matching and regular expression (regex) to facilitate semi-automated deduplication and filtered records before manual screening. The workflow synthesised evidence from randomised controlled trials on the efficacy of a subclass of gonadotropin-releasing hormone agonists (GnRH-a). A modified splitting method addressed unit-of-analysis errors in multi-arm trials. Results: The workflow sharply reduced the screening workload, taking only 11 days to fetch and filter 33,444 records. Seven eligible RCTs were synthesized (841 patients). The pooled random-effects model yielded a Risk Ratio (RR) of $0.64$ ($95\%$ CI $0.48$ to $0.86$), demonstrating a $36\%$ reduction in recurrence, with non-significant heterogeneity ($I^2=0.00\%$, $\tau^2=0.00$). The findings were robust and stable, as they were backed by sensitivity analyses. Conclusion: This study demonstrates an application of an information-retrieval-driven workflow for medical evidence synthesis. The approach yields valuable clinical results and a generalisable framework to scale up the evidence synthesis, bridging the gap between clinical research and computer science.
comment: 15 pages, 12 figures and 4 tables. This work describes an information retrieval-driven workflow for medical evidence synthesis, with an application to endometriosis recurrence. The method can be generalized to other systematic reviews. The preregistered protocol is available: https://doi.org/10.17605/OSF.IO/R2DFA
♻ ☆ An Ecosystem for Ontology Interoperability
Ontology interoperability is one of the complicated issues that restricts the use of ontologies in knowledge graphs (KGs). Different ontologies with conflicting and overlapping concepts make it difficult to design, develop, and deploy an interoperable ontology for downstream tasks. We propose an ecosystem for ontology interoperability. The ecosystem employs three state-of-the-art semantic techniques in different phases of the ontology engineering life cycle: ontology design patterns (ODPs) in the design phase, ontology matching and versioning (OM\&OV) in the develop phase, and ontology-compliant knowledge graphs (OCKGs) in the deploy phase, to achieve better ontology interoperability and data integration in real-world applications. A case study of sensor observation in the building domain validates the usefulness of the proposed ecosystem.
comment: 5 pages, 8 figures
♻ ☆ The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation NeurIPS 2025
Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.
comment: Accepted at NeurIPS 2025
♻ ☆ TaoSR1: The Thinking Model for E-commerce Relevance Search
Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.
♻ ☆ SBAN: A Framework & Multi-Dimensional Dataset for Large Language Model Pre-Training and Software Code Mining
This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code. This unique multimodal structure enables research on cross-representation learning, semantic understanding of software, and automated malware detection. Beyond security applications, SBAN supports broader tasks such as code translation, code explanation, and other software mining tasks involving heterogeneous data. It is particularly suited for scalable training of deep models, including transformers and other LLM architectures. By bridging low-level machine representations and high-level human semantics, SBAN provides a robust foundation for building intelligent systems that reason about code. We believe that this dataset opens new opportunities for mining software behavior, improving security analytics, and enhancing LLM capabilities in pre-training and fine-tuning tasks for software code mining.
♻ ☆ Understanding Embedding Scaling in Collaborative Filtering
Scaling recommendation models into large recommendation models has become one of the most widely discussed topics. Recent efforts focus on components beyond the scaling embedding dimension, as it is believed that scaling embedding may lead to performance degradation. Although there have been some initial observations on embedding, the root cause of their non-scalability remains unclear. Moreover, whether performance degradation occurs across different types of models and datasets is still an unexplored area. Regarding the effect of embedding dimensions on performance, we conduct large-scale experiments across 10 datasets with varying sparsity levels and scales, using 4 representative classical architectures. We surprisingly observe two novel phenomena: double-peak and logarithmic. For the former, as the embedding dimension increases, performance first improves, then declines, rises again, and eventually drops. For the latter, it exhibits a perfect logarithmic curve. Our contributions are threefold. First, we discover two novel phenomena when scaling collaborative filtering models. Second, we gain an understanding of the underlying causes of the double-peak phenomenon. Lastly, we theoretically analyze the noise robustness of collaborative filtering models, with results matching empirical observations.
♻ ☆ The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with substantial performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever's difficulty in ranking documents across languages. Finally, we propose two simple retrieval strategies that address this source of failure by enforcing equal retrieval from both languages or by translating the query, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.
comment: Accepted to ArabicNLP 2025
♻ ☆ CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection
Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships-specifically, cases in which the image and text are not directly connected but are associated through underlying semantic links. Moreover, noise in the evidence further impairs detection accuracy. To address these challenges, we propose CMIE, a novel OOC misinformation detection framework that incorporates a Coexistence Relationship Generation (CRG) strategy and an Association Scoring (AS) mechanism. CMIE identifies the underlying coexistence relationships between images and text, and selectively utilizes relevant evidence to enhance misinformation detection. Experimental results demonstrate that our approach outperforms existing methods.
♻ ☆ Membership Inference Attacks on Recommender System: A Survey
Recommender systems (RecSys) have been widely applied to various applications, including E-commerce, finance, healthcare, social media and have become increasingly influential in shaping user behavior and decision-making, highlighting their growing impact in various domains. However, recent studies have shown that RecSys are vulnerable to membership inference attacks (MIAs), which aim to infer whether user interaction record was used to train a target model or not. MIAs on RecSys models can directly lead to a privacy breach. For example, via identifying the fact that a purchase record that has been used to train a RecSys associated with a specific user, an attacker can infer that user's special quirks. In recent years, MIAs have been shown to be effective on other ML tasks, e.g., classification models and natural language processing. However, traditional MIAs are ill-suited for RecSys due to the unseen posterior probability. Although MIAs on RecSys form a newly emerging and rapidly growing research area, there has been no systematic survey on this topic yet. In this article, we conduct the first comprehensive survey on RecSys MIAs. This survey offers a comprehensive review of the latest advancements in RecSys MIAs, exploring the design principles, challenges, attack and defense associated with this emerging field. We provide a unified taxonomy that categorizes different RecSys MIAs based on their characterizations and discuss their pros and cons. Based on the limitations and gaps identified in this survey, we point out several promising future research directions to inspire the researchers who wish to follow this area. This survey not only serves as a reference for the research community but also provides a clear description for researchers outside this research domain.
comment: under review
♻ ☆ From ID-based to ID-free: Rethinking ID Effectiveness in Multimodal Collaborative Filtering Recommendation
Most existing multimodal collaborative filtering recommendation (MCFRec) methods rely heavily on ID features and multimodal content to enhance recommendation performance. However, this paper reveals that ID features are effective but have limited benefits in multimodal collaborative filtering recommendation. Therefore, this paper systematically deconstruct the pros and cons of ID features: (i) they provide initial embedding but lack semantic richness, (ii) they provide a unique identifier for each user and item but hinder generalization to untrained data, and (iii) they assist in aligning and fusing multimodal features but may lead to representation shift. Based on these insights, this paper proposes IDFREE, an ID-free multimodal collaborative Filtering REcommEndation baseline. IDFREE replaces ID features with multimodal features and positional encodings to generate semantically meaningful ID-free embeddings. For ID-free multimodal collaborative filtering, it further proposes an adaptive similarity graph module to construct dynamic user-user and item-item graphs based on multimodal features. Then, an augmented user-item graph encoder is proposed to construct more effective user and item encoding. Finally, IDFREE achieves inter-multimodal alignment based on the contrastive learning and uses Softmax loss as recommendation loss. Basic experiments on three public datasets demonstrate that IDFREE outperforms existing ID-based MCFRec methods, achieving an average performance gain of 72.24% across standard metrics (Recall@5, 10, 20, 50 and NDCG@5, 10, 20, 50). Exploratory and extended experiments further validate our findings on the limitations of ID features in MCFRec. The code is released at https://github.com/G-H-Li/IDFREE.
comment: We identified that our current approach achieves its reported performance only under specific data conditions, and its robustness is weaker than we initially expected
♻ ☆ LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling
Scaling large recommendation systems requires advancing three major frontiers: processing longer user histories, expanding candidate sets, and increasing model capacity. While promising, transformers' computational cost scales quadratically with the user sequence length and linearly with the number of candidates. This trade-off makes it prohibitively expensive to expand candidate sets or increase sequence length at inference, despite the significant performance improvements. We introduce \textbf{LIME}, a novel architecture that resolves this trade-off. Through two key innovations, LIME fundamentally reduces computational complexity. First, low-rank ``link embeddings" enable pre-computation of attention weights by decoupling user and candidate interactions, making the inference cost nearly independent of candidate set size. Second, a linear attention mechanism, \textbf{LIME-XOR}, reduces the complexity with respect to user sequence length from quadratic ($O(N^2)$) to linear ($O(N)$). Experiments on public and industrial datasets show LIME achieves near-parity with state-of-the-art transformers but with a 10$\times$ inference speedup on large candidate sets or long sequence lengths. When tested on a major recommendation platform, LIME improved user engagement while maintaining minimal inference costs with respect to candidate set size and user history length, establishing a new paradigm for efficient and expressive recommendation systems.
comment: 16 pages
Multimedia 8
☆ MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose the Cross-Image Reasoning Model (CIRM), which performs targeted cross-image sequence modeling to capture latent inter-image connections. In addition, we introduce a relevance-guided, fine-grained cross-modal fusion mechanism based on text-image correspondence to reduce information loss during integration. We establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0 and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios.
☆ Enabling American Sign Language Communication Under Low Data Rates
In recent years, video conferencing applications have become increasingly prevalent, relying heavily on high-speed internet connectivity. When such connectivity is lacking, users often default to audio-only communication, a mode that significantly disadvantages American Sign Language (ASL) users, whose communication relies on hand gestures, body movement, and facial expressions. In this work, we introduce VC4ASL, a system designed to enable ASL communication over the audio channel of existing video conferencing applications, even in the absence of reliable video. VC4ASL integrates seamlessly with current platforms without requiring any modifications. Our approach establishes a communication channel through audio by encoding and transmitting human pose information, which is then rendered to reconstruct signed content. We propose novel receive-side error detection and correction mechanisms that exploit the inherent structural constraints of human pose data. To evaluate the system, we simulate network-degraded environments, generate pose-based ASL video sequences, and conduct user studies to assess comprehension among ASL users. Experimental results demonstrate that VC4ASL effectively facilitates intelligible ASL communication over audio in low-bandwidth scenarios where video transmission is impaired.
♻ ☆ TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph
Audio-visual video parsing (AVVP) aims to detect event categories and their temporal boundaries in videos, typically under weak supervision. Existing methods mainly focus on (i) improving temporal modeling using attention-based architectures or (ii) generating richer pseudo-labels to address the absence of frame-level annotations. However, attention-based models often overfit noisy pseudo-labels, leading to cumulative training errors, while pseudo-label generation approaches distribute attention uniformly across frames, weakening temporal localization accuracy. To address these challenges, we propose TEn-CATG, a text-enriched AVVP framework that combines semantic calibration with category-aware temporal reasoning. More specifically, we design a bi-directional text fusion (BiT) module by leveraging audio-visual features as semantic anchors to refine text embeddings, which departs from conventional text-to-feature alignment, thereby mitigating noise and enhancing cross-modal consistency. Furthermore, we introduce the category-aware temporal graph (CATG) module to model temporal relationships by selecting multi-scale temporal neighbors and learning category-specific temporal decay factors, enabling effective event-dependent temporal reasoning. Extensive experiments demonstrate that TEn-CATG achieves state-of-the-art results across multiple evaluation metrics on benchmark datasets LLP and UnAV-100, highlighting its robustness and superior ability to capture complex temporal and semantic dependencies in weakly supervised AVVP tasks.
♻ ☆ Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark
Multimodal Industrial Anomaly Detection (MIAD), which utilizes 3D point clouds and 2D RGB images to identify abnormal regions in products, plays a crucial role in industrial quality inspection. However, traditional MIAD settings assume that all 2D and 3D modalities are paired, ignoring the fact that multimodal data collected from the real world is often imperfect due to missing modalities. Additionally, models trained on modality-incomplete data are prone to overfitting. Therefore, MIAD models that demonstrate robustness against modality-incomplete data are highly desirable in practice. To address this, we introduce a pioneering study that comprehensively investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD), and under the guidance of experts, we construct the MIIAD Bench with rich modality-missing settings to account for imperfect learning environments with incomplete multimodal information. As expected, we find that most existing MIAD methods perform poorly on the MIIAD Bench, leading to significant performance degradation. To tackle this challenge, we propose a novel two-stage Robust modAlity-aware fusing and Detecting framewoRk, abbreviated as RADAR. Specifically: i) We propose Modality-incomplete Instruction to guide the multimodal Transformer to robustly adapt to various modality-incomplete scenarios, and implement adaptive parameter learning based on HyperNetwork. ii) Then, we construct a Double-Pseudo Hybrid Module to highlight the uniqueness of modality combinations, mitigating overfitting issues and further enhancing the robustness of the MIIAD model. Our experimental results demonstrate that the proposed RADAR significantly outperforms traditional MIAD methods on our newly created MIIAD dataset, proving its practical application value.
♻ ☆ CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection
Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships-specifically, cases in which the image and text are not directly connected but are associated through underlying semantic links. Moreover, noise in the evidence further impairs detection accuracy. To address these challenges, we propose CMIE, a novel OOC misinformation detection framework that incorporates a Coexistence Relationship Generation (CRG) strategy and an Association Scoring (AS) mechanism. CMIE identifies the underlying coexistence relationships between images and text, and selectively utilizes relevant evidence to enhance misinformation detection. Experimental results demonstrate that our approach outperforms existing methods.
♻ ☆ From ID-based to ID-free: Rethinking ID Effectiveness in Multimodal Collaborative Filtering Recommendation
Most existing multimodal collaborative filtering recommendation (MCFRec) methods rely heavily on ID features and multimodal content to enhance recommendation performance. However, this paper reveals that ID features are effective but have limited benefits in multimodal collaborative filtering recommendation. Therefore, this paper systematically deconstruct the pros and cons of ID features: (i) they provide initial embedding but lack semantic richness, (ii) they provide a unique identifier for each user and item but hinder generalization to untrained data, and (iii) they assist in aligning and fusing multimodal features but may lead to representation shift. Based on these insights, this paper proposes IDFREE, an ID-free multimodal collaborative Filtering REcommEndation baseline. IDFREE replaces ID features with multimodal features and positional encodings to generate semantically meaningful ID-free embeddings. For ID-free multimodal collaborative filtering, it further proposes an adaptive similarity graph module to construct dynamic user-user and item-item graphs based on multimodal features. Then, an augmented user-item graph encoder is proposed to construct more effective user and item encoding. Finally, IDFREE achieves inter-multimodal alignment based on the contrastive learning and uses Softmax loss as recommendation loss. Basic experiments on three public datasets demonstrate that IDFREE outperforms existing ID-based MCFRec methods, achieving an average performance gain of 72.24% across standard metrics (Recall@5, 10, 20, 50 and NDCG@5, 10, 20, 50). Exploratory and extended experiments further validate our findings on the limitations of ID features in MCFRec. The code is released at https://github.com/G-H-Li/IDFREE.
comment: We identified that our current approach achieves its reported performance only under specific data conditions, and its robustness is weaker than we initially expected
♻ ☆ ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations EMNLP
This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations.Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering. Code is available at github.com/bowen-upenn/ControlText.
comment: The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) Findings
♻ ☆ Adaptive 3D Mesh Steganography Based on Feature-Preserving Distortion
Current 3D mesh steganography algorithms relying on geometric modification are prone to detection by steganalyzers. In traditional steganography, adaptive steganography has proven to be an efficient means of enhancing steganography security. Taking inspiration from this, we propose a highly adaptive embedding algorithm, guided by the principle of minimizing a carefully crafted distortion through efficient steganography codes. Specifically, we tailor a payload-limited embedding optimization problem for 3D settings and devise a feature-preserving distortion (FPD) to measure the impact of message embedding. The distortion takes on an additive form and is defined as a weighted difference of the effective steganalytic subfeatures utilized by the current 3D steganalyzers. With practicality in mind, we refine the distortion to enhance robustness and computational efficiency. By minimizing the FPD, our algorithm can preserve mesh features to a considerable extent, including steganalytic and geometric features, while achieving a high embedding capacity. During the practical embedding phase, we employ the Q-layered syndrome trellis code (STC). However, calculating the bit modification probability (BMP) for each layer of the Q-layered STC, given the variation of Q, can be cumbersome. To address this issue, we design a universal and automatic approach for the BMP calculation. The experimental results demonstrate that our algorithm achieves state-of-the-art performance in countering 3D steganalysis. Code is available at https://github.com/zjhJOJO/3D-steganography-based-on-FPD.git.
comment: IEEE TVCG, corresponding author Jiahao Zhu, code is available at https://github.com/zjhJOJO/3D-steganography-based-on-FPD.git
Image and Video Processing 16
☆ Revising Second Order Terms in Deep Animation Video Coding
First Order Motion Model is a generative model that animates human heads based on very little motion information derived from keypoints. It is a promising solution for video communication because first it operates at very low bitrate and second its computational complexity is moderate compared to other learning based video codecs. However, it has strong limitations by design. Since it generates facial animations by warping source-images, it fails to recreate videos with strong head movements. This works concentrates on one specific kind of head movements, namely head rotations. We show that replacing the Jacobian transformations in FOMM by a global rotation helps the system to perform better on items with head-rotations while saving 40% to 80% of bitrate on P-frames. Moreover, we apply state-of-the-art normalization techniques to the discriminator to stabilize the adversarial training which is essential for generating visually appealing videos. We evaluate the performance by the learned metics LPIPS and DISTS to show the success our optimizations.
☆ KongNet: A Multi-headed Deep Learning Model for Detection and Classification of Nuclei in Histopathology Images
Accurate detection and classification of nuclei in histopathology images are critical for diagnostic and research applications. We present KongNet, a multi-headed deep learning architecture featuring a shared encoder and parallel, cell-type-specialised decoders. Through multi-task learning, each decoder jointly predicts nuclei centroids, segmentation masks, and contours, aided by Spatial and Channel Squeeze-and-Excitation (SCSE) attention modules and a composite loss function. We validate KongNet in three Grand Challenges. The proposed model achieved first place on track 1 and second place on track 2 during the MONKEY Challenge. Its lightweight variant (KongNet-Det) secured first place in the 2025 MIDOG Challenge. KongNet pre-trained on the MONKEY dataset and fine-tuned on the PUMA dataset ranked among the top three in the PUMA Challenge without further optimisation. Furthermore, KongNet established state-of-the-art performance on the publicly available PanNuke and CoNIC datasets. Our results demonstrate that the specialised multi-decoder design is highly effective for nuclei detection and classification across diverse tissue and stain types. The pre-trained model weights along with the inference code have been publicly released to support future research.
comment: Submitted to Medical Image Analysis, currently under review
☆ Equivariance2Inverse: A Practical Self-Supervised CT Reconstruction Method Benchmarked on Real, Limited-Angle, and Blurred Data
Deep learning has shown impressive results in reducing noise and artifacts in X-ray computed tomography (CT) reconstruction. Self-supervised CT reconstruction methods are especially appealing for real-world applications because they require no ground truth training examples. However, these methods involve a simplified X-ray physics model during training, which may make inaccurate assumptions, for example, about scintillator blurring, the scanning geometry, or the distribution of the noise. As a result, they can be less robust to real-world imaging circumstances. In this paper, we review the model assumptions of six recent self-supervised CT reconstruction methods. Moreover, we benchmark these methods on the real-world 2DeteCT dataset and on synthetic data with and without scintillator blurring and a limited-angle scanning geometry. The results of our benchmark show that methods that assume that the noise is pixel-wise independent do not perform well on data with scintillator blurring, and that assuming rotation invariance improves results on limited-angle reconstructions. Based on these findings, we combined successful concepts of the Robust Equivariant Imaging and Sparse2Inverse methods in a new self-supervised CT reconstruction method called Equivariance2Inverse.
comment: 11 pages, 4 figures
☆ Privacy-Preserving Semantic Communication over Wiretap Channels with Learnable Differential Privacy
While semantic communication (SemCom) improves transmission efficiency by focusing on task-relevant information, it also raises critical privacy concerns. Many existing secure SemCom approaches rely on restrictive or impractical assumptions, such as favorable channel conditions for the legitimate user or prior knowledge of the eavesdropper's model. To address these limitations, this paper proposes a novel secure SemCom framework for image transmission over wiretap channels, leveraging differential privacy (DP) to provide approximate privacy guarantees. Specifically, our approach first extracts disentangled semantic representations from source images using generative adversarial network (GAN) inversion method, and then selectively perturbs private semantic representations with approximate DP noise. Distinct from conventional DP-based protection methods, we introduce DP noise with learnable pattern, instead of traditional white Gaussian or Laplace noise, achieved through adversarial training of neural networks (NNs). This design mitigates the inherent non-invertibility of DP while effectively protecting private information. Moreover, it enables explicitly controllable security levels by adjusting the privacy budget according to specific security requirements, which is not achieved in most existing secure SemCom approaches. Experimental results demonstrate that, compared with the previous DP-based method and direct transmission, the proposed method significantly degrades the reconstruction quality for the eavesdropper, while introducing only slight degradation in task performance. Under comparable security levels, our approach achieves an LPIPS advantage of 0.06-0.29 and an FPPSR advantage of 0.10-0.86 for the legitimate user compared with the previous DP-based method.
☆ Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI
Deep reinforcement learning agents often struggle when tasks require understanding both vision and language. Conventional architectures typically isolate perception (for example, CNN-based visual encoders) from decision-making (policy networks). This separation can be inefficient, since the policy's failures do not directly help the perception module learn what is important. To address this, we implement the Perception-Decision Interleaving Transformer (PDiT) architecture introduced by Mao et al. (2023), a model that alternates between perception and decision layers within a single transformer. This interleaving allows feedback from decision-making to refine perceptual features dynamically. In addition, we integrate a contrastive loss inspired by CLIP to align textual mission embeddings with visual scene features. We evaluate the PDiT encoders on the BabyAI GoToLocal environment and find that the approach achieves more stable rewards and stronger alignment compared to a standard PPO baseline. The results suggest that interleaved transformer encoders are a promising direction for developing more integrated autonomous agents.
comment: Undergraduate research project, IIT Palakkad, 2025
☆ Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation
We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in realworld environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use EfficientNet-B0 as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead computing the bearing angle directly from consecutive GNSS positions. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repository at https://github.com/oskarnatan/Seq-DeepIPC.
comment: Preprint notice, this manuscript has been submitted to IEEE sensors journal for possible publication
☆ USF-MAE: Ultrasound Self-Supervised Foundation Model with Masked Autoencoding
Ultrasound imaging is one of the most widely used diagnostic modalities, offering real-time, radiation-free assessment across diverse clinical domains. However, interpretation of ultrasound images remains challenging due to high noise levels, operator dependence, and limited field of view, resulting in substantial inter-observer variability. Current Deep Learning approaches are hindered by the scarcity of large labeled datasets and the domain gap between general and sonographic images, which limits the transferability of models pretrained on non-medical data. To address these challenges, we introduce the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), the first large-scale self-supervised MAE framework pretrained exclusively on ultrasound data. The model was pre-trained on 370,000 2D and 3D ultrasound images curated from 46 open-source datasets, collectively termed OpenUS-46, spanning over twenty anatomical regions. This curated dataset has been made publicly available to facilitate further research and reproducibility. Using a Vision Transformer encoder-decoder architecture, USF-MAE reconstructs masked image patches, enabling it to learn rich, modality-specific representations directly from unlabeled data. The pretrained encoder was fine-tuned on three public downstream classification benchmarks: BUS-BRA (breast cancer), MMOTU-2D (ovarian tumors), and GIST514-DB (gastrointestinal stromal tumors). Across all tasks, USF-MAE consistently outperformed conventional CNN and ViT baselines, achieving F1-scores of 81.6%, 79.6%, and 82.4%, respectively. Despite not using labels during pretraining, USF-MAE approached the performance of the supervised foundation model UltraSam on breast cancer classification and surpassed it on the other tasks, demonstrating strong cross-anatomical generalization.
☆ Explainable Detection of AI-Generated Images with Artifact Localization Using Faster-Than-Lies and Vision-Language Models for Edge Devices
The increasing realism of AI-generated imagery poses challenges for verifying visual authenticity. We present an explainable image authenticity detection system that combines a lightweight convolutional classifier ("Faster-Than-Lies") with a Vision-Language Model (Qwen2-VL-7B) to classify, localize, and explain artifacts in 32x32 images. Our model achieves 96.5% accuracy on the extended CiFAKE dataset augmented with adversarial perturbations and maintains an inference time of 175ms on 8-core CPUs, enabling deployment on local or edge devices. Using autoencoder-based reconstruction error maps, we generate artifact localization heatmaps, which enhance interpretability for both humans and the VLM. We further categorize 70 visual artifact types into eight semantic groups and demonstrate explainable text generation for each detected anomaly. This work highlights the feasibility of combining visual and linguistic reasoning for interpretable authenticity detection in low-resolution imagery and outlines potential cross-domain applications in forensics, industrial inspection, and social media moderation.
☆ Gut decisions based on the liver: A radiomics approach to boost colorectal cancer screening
Non-invasive colorectal cancer (CRC) screening represents a key opportunity to improve colonoscopy participation rates and reduce CRC mortality. This study explores the potential of the gut-liver axis for predicting colorectal neoplasia through liver-derived radiomic features extracted from routine CT images as a novel opportunistic screening approach. In this retrospective study, we analyzed data from 1,997 patients who underwent colonoscopy and abdominal CT. Patients either had no colorectal neoplasia (n=1,189) or colorectal neoplasia (n_total=808; adenomas n=423, CRC n=385). Radiomics features were extracted from 3D liver segmentations using the Radiomics Processing ToolKit (RPTK), which performed feature extraction, filtering, and classification. The dataset was split into training (n=1,397) and test (n=600) cohorts. Five machine learning models were trained with 5-fold cross-validation on the 20 most informative features, and the best model ensemble was selected based on the validation AUROC. The best radiomics-based XGBoost model achieved a test AUROC of 0.810, clearly outperforming the best clinical-only model (test AUROC: 0.457). Subclassification between colorectal cancer and adenoma showed lower accuracy (test AUROC: 0.674). Our findings establish proof-of-concept that liver-derived radiomics from routine abdominal CT can predict colorectal neoplasia. Beyond offering a pragmatic, widely accessible adjunct to CRC screening, this approach highlights the gut-liver axis as a novel biomarker source for opportunistic screening and sparks new mechanistic hypotheses for future translational research.
comment: Equal contribution between first, second, fifteenth, and sixteenth authors
☆ Semantic Communications with World Models
Semantic communication is a promising technique for emerging wireless applications, which reduces transmission overhead by transmitting only task-relevant features instead of raw data. However, existing methods struggle under extremely low bandwidth and varying channel conditions, where corrupted or missing semantics lead to severe reconstruction errors. To resolve this difficulty, we propose a world foundation model (WFM)-aided semantic video transmission framework that leverages the predictive capability of WFMs to generate future frames based on the current frame and textual guidance. This design allows transmissions to be omitted when predictions remain reliable, thereby saving bandwidth. Through WFM's prediction, the key semantics are preserved, yet minor prediction errors tend to amplify over time. To mitigate issue, a lightweight depth-based feedback module is introduced to determine whether transmission of the current frame is needed. Apart from transmitting the entire frame, a segmentation-assisted partial transmission method is proposed to repair degraded frames, which can further balance performance and bandwidth cost. Furthermore, an active transmission strategy is developed for mobile scenarios by exploiting camera trajectory information and proactively scheduling transmissions before channel quality deteriorates. Simulation results show that the proposed framework significantly reduces transmission overhead while maintaining task performances across varying scenarios and channel conditions.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports
We introduce ReXGroundingCT, the first publicly available dataset linking free-text findings to pixel-level 3D segmentations in chest CT scans. The dataset includes 3,142 non-contrast chest CT scans paired with standardized radiology reports from CT-RATE. Construction followed a structured three-stage pipeline. First, GPT-4 was used to extract and standardize findings, descriptors, and metadata from reports originally written in Turkish and machine-translated into English. Second, GPT-4o-mini categorized each finding into a hierarchical ontology of lung and pleural abnormalities. Third, 3D annotations were produced for all CT volumes: the training set was quality-assured by board-certified radiologists, and the validation and test sets were fully annotated by board-certified radiologists. Additionally, a complementary chain-of-thought dataset was created to provide step-by-step hierarchical anatomical reasoning for localizing findings within the CT volume, using GPT-4o and localization coordinates derived from organ segmentation models. ReXGroundingCT contains 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs, covering diverse radiological patterns from 3,142 non-contrast CT scans. About 79% of findings are focal abnormalities and 21% are non-focal. The dataset includes a public validation set of 50 cases and a private test set of 100 cases, both annotated by board-certified radiologists. The dataset establishes a foundation for enabling free-text finding segmentation and grounded radiology report generation in CT imaging. Model performance on the private test set is hosted on a public leaderboard at https://rexrank.ai/ReXGroundingCT. The dataset is available at https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.
♻ ☆ RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation
In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model's ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.
♻ ☆ Smartphone-based iris recognition through high-quality visible-spectrum iris image capture.V2
Smartphone-based iris recognition in the visible spectrum (VIS) remains difficult due to illumination variability, pigmentation differences, and the absence of standardized capture controls. This work presents a compact end-to-end pipeline that enforces ISO/IEC 29794-6 quality compliance at acquisition and demonstrates that accurate VIS iris recognition is feasible on commodity devices. Using a custom Android application performing real-time framing, sharpness evaluation, and feedback, we introduce the CUVIRIS dataset of 752 compliant images from 47 subjects. A lightweight MobileNetV3-based multi-task segmentation network (LightIrisNet) is developed for efficient on-device processing, and a transformer matcher (IrisFormer) is adapted to the VIS domain. Under a standardized protocol and comparative benchmarking against prior CNN baselines, OSIRIS attains a TAR of 97.9% at FAR=0.01 (EER=0.76%), while IrisFormer, trained only on UBIRIS.v2, achieves an EER of 0.057% on CUVIRIS. The acquisition app, trained models, and a public subset of the dataset are released to support reproducibility. These results confirm that standardized capture and VIS-adapted lightweight models enable accurate and practical iris recognition on smartphones.
comment: This submission has been withdrawn because it duplicates significant content from another version of the paper already available on arXiv as arXiv:2412.13063
♻ ☆ Neural Stereo Video Compression with Hybrid Disparity Compensation
Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one that employs explicit horizontal shifting, and another that utilizes an implicit cross-attention mechanism to reduce cross-view disparity redundancy. In this work, we propose a hybrid disparity compensation (HDC) strategy that leverages explicit pixel displacement as a robust prior feature to simplify optimization and perform implicit cross-attention mechanisms for subsequent warping operations, thereby capturing a broader range of disparity information. Specifically, HDC first computes a similarity map by fusing the horizontally shifted cross-view features to capture pixel displacement information. This similarity map is then normalized into an "explicit pixel-wise attention score" to perform the cross-attention mechanism, implicitly aligning features from one view to another. Building upon HDC, we introduce a novel end-to-end optimized neural stereo video compression framework, which integrates HDC-based modules into key coding operations, including cross-view feature extraction and reconstruction (HDC-FER) and cross-view entropy modeling (HDC-EM). Extensive experiments on SVC benchmarks, including KITTI 2012, KITTI 2015, and Nagoya, which cover both autonomous driving and general scenes, demonstrate that our framework outperforms both neural and traditional SVC methodologies.
♻ ☆ Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion NeurIPS 2025
Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model's robustness to hyperparameter variations.
comment: 13 pages, conference paper. Accepted to NeurIPS 2025, not yet published
♻ ☆ Servo navigation and phase equalization enhanced by run-time stabilization (PEERS) for 3D EPI time series
Purpose: To enhance time-resolved segmented imaging by synergy of run-time stabilization and retrospective, data-driven phase correction. Methods: A segmented 3D EPI sequence for fMRI time series is equipped with servo navigation based on short orbital navigators and a linear perturbation model, enabling run-time correction for rigid-body motion as well as bulk phase and frequency fluctuation. Complementary retrospective phase correction is based on the repetitive structure of the time series and serves to address residual phase and frequency offsets. The combined approach is termed phase equalization enhanced by run-time stabilization (PEERS). Results: The proposed strategy is evaluated in a phantom and in-vivo. Servo navigation is found to diminish motion confound in raw data and maintain k-space consistency over time series. In turn, retrospective phase equalization is found to eliminate shot-wise phase and frequency offsets relative to the navigator, which are attributed to eddy-currents and vibrations from phase encoding. Retrospective phase equalization reduces the precision requirements for run-time frequency control, supporting the use of short navigators. Relative to conventional volume realignment, PEERS achieved tSNR improvements up to $30\%$ for small motion and in the order of $10\%$ when volunteers tried to hold still. Retrospective phase equalization is found to clearly outperform phase correction based solely on navigator-based frequency estimates. Conclusion: Servo navigation achieves high-precision run-time motion correction for 3D EPI fMRI. Coarse frequency tracking based on short navigators is supplemented by precise retrospective frequency and phase correction. Fully automatic and self-calibrated, PEERS offers effective plug-and-play motion and phase correction for 3D fMRI.
comment: to be submitted to Magnetic Resonance in Medicine (MRM)
Human-Computer Interaction 7
☆ Storycaster: An AI System for Immersive Room-Based Storytelling
While Cave Automatic Virtual Environment (CAVE) systems have long enabled room-scale virtual reality and various kinds of interactivity, their content has largely remained predetermined. We present \textit{Storycaster}, a generative AI CAVE system that transforms physical rooms into responsive storytelling environments. Unlike headset-based VR, \textit{Storycaster} preserves spatial awareness, using live camera feeds to augment the walls with cylindrical projections, allowing users to create worlds that blend with their physical surroundings. Additionally, our system enables object-level editing, where physical items in the room can be transformed to their virtual counterparts in a story. A narrator agent guides participants, enabling them to co-create stories that evolve in response to voice commands, with each scene enhanced by generated ambient audio, dialogue, and imagery. Participants in our study ($n=13$) found the system highly immersive and engaging, with narrator and audio most impactful, while also highlighting areas for improvement in latency and image resolution.
☆ How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations
AI agents are continually optimized for tasks related to human work, such as software engineering and professional writing, signaling a pressing trend with significant impacts on the human workforce. However, these agent developments have often not been grounded in a clear understanding of how humans execute work, to reveal what expertise agents possess and the roles they can play in diverse workflows. In this work, we study how agents do human work by presenting the first direct comparison of human and agent workers across multiple essential work-related skills: data analysis, engineering, computation, writing, and design. To better understand and compare heterogeneous computer-use activities of workers, we introduce a scalable toolkit to induce interpretable, structured workflows from either human or agent computer-use activities. Using such induced workflows, we compare how humans and agents perform the same tasks and find that: (1) While agents exhibit promise in their alignment to human workflows, they take an overwhelmingly programmatic approach across all work domains, even for open-ended, visually dependent tasks like design, creating a contrast with the UI-centric methods typically used by humans. (2) Agents produce work of inferior quality, yet often mask their deficiencies via data fabrication and misuse of advanced tools. (3) Nonetheless, agents deliver results 88.3% faster and cost 90.4-96.2% less than humans, highlighting the potential for enabling efficient collaboration by delegating easily programmable tasks to agents.
☆ Everything counts: the managed omnirelevance of speech in 'human - voice agent' interaction
To this day, turn-taking models determining voice agents' conduct have been examined from a technical point of view, while the interactional constraints or resources they constitute for human conversationalists have not been empirically described. From the detailed analysis of corpora of naturalistic data, we document how, whether in interaction with rule-based robots from a 'pre-LLM era' or with the most recent voice agents, humans' conduct was produced in reference to the ever-present risk that, each time they spoke, their talk may trigger a new uncalled-for contribution from the artificial agent. We argue that this 'omnirelevance of human speech' is a constitutive feature of current human-agent interaction that, due to recent improvements in voice capture technology, weighs on human practices even more today than in the past. Specifically, we document how, in multiparty settings, humans shaped their conduct in such a way as to remain undetected by the machine's sensors.
☆ Emotion Recognition with Minimal Wearable Sensing: Multi-domain Feature, Hybrid Feature Selection, and Personalized vs. Generalized Ensemble Model Analysis
Negative emotions are linked to the onset of neurodegenerative diseases and dementia, yet they are often difficult to detect through observation. Physiological signals from wearable devices offer a promising noninvasive method for continuous emotion monitoring. In this study, we propose a lightweight, resource-efficient machine learning approach for binary emotion classification, distinguishing between negative (sadness, disgust, anger) and positive (amusement, tenderness, gratitude) affective states using only electrocardiography (ECG) signals. The method is designed for deployment in resource-constrained systems, such as Internet of Things (IoT) devices, by reducing battery consumption and cloud data transmission through the avoidance of computationally expensive multimodal inputs. We utilized ECG data from 218 CSV files extracted from four studies in the Psychophysiology of Positive and Negative Emotions (POPANE) dataset, which comprises recordings from 1,157 healthy participants across seven studies. Each file represents a unique subject emotion, and the ECG signals, recorded at 1000 Hz, were segmented into 10-second epochs to reflect real-world usage. Our approach integrates multidomain feature extraction, selective feature fusion, and a voting classifier. We evaluated it using a participant-exclusive generalized model and a participant-inclusive personalized model. The personalized model achieved the best performance, with an average accuracy of 95.59%, outperforming the generalized model, which reached 69.92% accuracy. Comparisons with other studies on the POPANE and similar datasets show that our approach consistently outperforms existing methods. This work highlights the effectiveness of personalized models in emotion recognition and their suitability for wearable applications that require accurate, low-power, and real-time emotion tracking.
♻ ☆ Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination
The scientific ideation process often involves blending salient aspects of existing papers to create new ideas -- a framework known as facet-based ideation. To see how large language models (LLMs) might assist in this process, we contribute Scideator, the first human-LLM interface for facet-based scientific ideation. Starting from a user-provided set of scientific papers, Scideator extracts key facets -- purposes, mechanisms, and evaluations -- from these and related papers, allowing users to explore the idea space by interactively recombining facets to synthesize inventive ideas. Scideator also helps users gauge idea originality by searching the literature for overlaps, assessing idea novelty based on an explicit facet-based definition. To support these tasks, Scideator introduces three LLM-powered retrieval-augmented generation (RAG) modules: Analogous Paper Facet Finder, Faceted Idea Generator, and Idea Novelty Checker. In a within-subjects user study (N=22) with computer-science researchers comparing Scideator to a strong baseline, our tool provided significantly more creativity support, particularly with respect to exploration and expressiveness.
comment: Updated for clarity
♻ ☆ Efficiency with Rigor! A Trustworthy LLM-powered Workflow for Qualitative Data Analysis
Qualitative data analysis (QDA) emphasizes trustworthiness, requiring sustained human engagement and reflexivity. Recently, large language models (LLMs) have been applied in QDA to improve efficiency. However, their use raises concerns about unvalidated automation and displaced sensemaking, which can undermine trustworthiness. To address these issues, we employed two strategies: transparency and human involvement. Through a literature review and formative interviews, we identified six design requirements for transparent automation and meaningful human involvement. Guided by these requirements, we developed MindCoder, an LLM-powered workflow that delegates mechanical tasks, such as grouping and validation, to the system, while enabling humans to conduct meaningful interpretation. MindCoder also maintains comprehensive logs of users' step-by-step interactions to ensure transparency and support trustworthy results. In an evaluation with 12 users and two external evaluators, MindCoder supported active interpretation, offered flexible control, and produced more trustworthy codebooks. We further discuss design implications for building human-AI collaborative QDA workflows.
♻ ☆ NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References
Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the "same instance, similar representation" assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).
comment: Accepted by TPAMI
Software Engineering 14
☆ On the Freshness of Pinned Dependencies in Maven
Library dependencies in software ecosystems play a crucial role in the development of software. As newer releases of these libraries are published, developers may opt to pin their dependencies to a particular version. While pinning may have benefits in ensuring reproducible builds and avoiding breaking changes, it bears larger risks in using outdated dependencies that may contain bugs and security vulnerabilities. To understand the frequency and consequences of dependency pinning, we first define the concepts of stale and fresh pins, which are distinguished based on how outdated the dependency is relative to the release date of the project. We conduct an empirical study to show that over 60% of consumers of popular Maven libraries contain stale pins to their dependencies, with some outdated versions over a year old. These pinned versions often miss out on security fixes; we find that 10% of all dependency upgrades in our dataset to the latest minor or patch version would reduce security vulnerabilities. We prototype an approach called Pin-Freshener that can encourage developers to freshen their pins by leveraging the insight that crowdsourced tests of peer projects can provide additional signal for the safety of an upgrade. Running Pin-Freshener on dependency upgrades shows that just 1-5 additional test suites can provide 35-100% more coverage of a dependency, compared to that of a single consumer test suite. Our evaluation on real-world pins to the top 500 popular libraries in Maven shows that Pin-Freshener can provide an additional signal of at least 5 passing crowdsourced test suites to over 3,000 consumers to safely perform an upgrade that reduces security vulnerabilities. Pin-Freshener can provide practical confidence to developers by offering additional signal beyond their own test suites, representing an improvement over current practices.
☆ Collaborative LLM Agents for C4 Software Architecture Design Automation
Software architecture design is a fundamental part of creating every software system. Despite its importance, producing a C4 software architecture model, the preferred notation for such architecture, remains manual and time-consuming. We introduce an LLM-based multi-agent system that automates this task by simulating a dialogue between role-specific experts who analyze requirements and generate the Context, Container, and Component views of the C4 model. Quality is assessed with a hybrid evaluation framework: deterministic checks for structural and syntactic integrity and C4 rule consistency, plus semantic and qualitative scoring via an LLM-as-a-Judge approach. Tested on five canonical system briefs, the workflow demonstrates fast C4 model creation, sustains high compilation success, and delivers semantic fidelity. A comparison of four state-of-the-art LLMs shows different strengths relevant to architectural design. This study contributes to automated software architecture design and its evaluation methods.
comment: This paper has been accepted for the upcoming 59th Hawaii International Conference on System Sciences (HICSS-59), 2026, Hawaii, USA. The final published version will appear in the official conference proceedings
☆ Scalable Supervising Software Agents with Patch Reasoner
While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P's practicality.
☆ Does In-IDE Calibration of Large Language Models work at Scale?
The introduction of large language models into integrated development environments (IDEs) is revolutionizing software engineering, yet it poses challenges to the usefulness and reliability of Artificial Intelligence-generated code. Post-hoc calibration of internal model confidences aims to align probabilities with an acceptability measure. Prior work suggests calibration can improve alignment, but at-scale evidence is limited. In this work, we investigate the feasibility of applying calibration of code models to an in-IDE context. We study two aspects of the problem: (1) the technical method for implementing confidence calibration and improving the reliability of code generation models, and (2) the human-centered design principles for effectively communicating reliability signal to developers. First, we develop a scalable and flexible calibration framework which can be used to obtain calibration weights for open-source models using any dataset, and evaluate whether calibrators improve the alignment between model confidence and developer acceptance behavior. Through a large-scale analysis of over 24 million real-world developer interactions across multiple programming languages, we find that a general, post-hoc calibration model based on Platt-scaling does not, on average, improve the reliability of model confidence signals. We also find that while dynamically personalizing calibration to individual users can be effective, its effectiveness is highly dependent on the volume of user interaction data. Second, we conduct a multi-phase design study with 3 expert designers and 153 professional developers, combining scenario-based design, semi-structured interviews, and survey validation, revealing a clear preference for presenting reliability signals via non-numerical, color-coded indicators within the in-editor code generation workflow.
comment: Under Review
☆ DynaCausal: Dynamic Causality-Aware Root Cause Analysis for Distributed Microservices
Cloud-native microservices enable rapid iteration and scalable deployment but also create complex, fast-evolving dependencies that challenge reliable diagnosis. Existing root cause analysis (RCA) approaches, even with multi-modal fusion of logs, traces, and metrics, remain limited in capturing dynamic behaviors and shifting service relationships. Three critical challenges persist: (i) inadequate modeling of cascading fault propagation, (ii) vulnerability to noise interference and concept drift in normal service behavior, and (iii) over-reliance on service deviation intensity that obscures true root causes. To address these challenges, we propose DynaCausal, a dynamic causality-aware framework for RCA in distributed microservice systems. DynaCausal unifies multi-modal dynamic signals to capture time-varying spatio-temporal dependencies through interaction-aware representation learning. It further introduces a dynamic contrastive mechanism to disentangle true fault indicators from contextual noise and adopts a causal-prioritized pairwise ranking objective to explicitly optimize causal attribution. Comprehensive evaluations on public benchmarks demonstrate that DynaCausal consistently surpasses state-of-the-art methods, attaining an average AC@1 of 0.63 with absolute gains from 0.25 to 0.46, and delivering both accurate and interpretable diagnoses in highly dynamic microservice environments.
☆ Finding the Needle in the Crash Stack: Industrial-Scale Crash Root Cause Localization with AutoCrashFL
Fault Localization (FL) aims to identify root causes of program failures. FL typically targets failures observed from test executions, and as such, often involves dynamic analyses to improve accuracy, such as coverage profiling or mutation testing. However, for large industrial software, measuring coverage for every execution is prohibitively expensive, making the use of such techniques difficult. To address these issues and apply FL in an industrial setting, this paper proposes AutoCrashFL, an LLM agent for the localization of crashes that only requires the crashdump from the Program Under Test (PUT) and access to the repository of the corresponding source code. We evaluate AutoCrashFL against real-world crashes of SAP HANA, an industrial software project consisting of more than 35 million lines of code. Experiments reveal that AutoCrashFL is more effective in localization, as it identified 30% crashes at the top, compared to 17% achieved by the baseline. Through thorough analysis, we find that AutoCrashFL has attractive practical properties: it is relatively more effective for complex bugs, and it can indicate confidence in its results. Overall, these results show the practicality of LLM agent deployment on an industrial scale.
comment: 11 pages, 8 figures, under review
☆ Agentsway -- Software Development Methodology for AI Agents-based Teams
The emergence of Agentic AI is fundamentally transforming how software is designed, developed, and maintained. Traditional software development methodologies such as Agile, Kanban, ShapeUp, etc, were originally designed for human-centric teams and are increasingly inadequate in environments where autonomous AI agents contribute to planning, coding, testing, and continuous learning. To address this methodological gap, we present "Agentsway" a novel software development framework designed for ecosystems where AI agents operate as first-class collaborators. Agentsway introduces a structured lifecycle centered on human orchestration, and privacy-preserving collaboration among specialized AI agents. The framework defines distinct roles for planning, prompting, coding, testing, and fine-tuning agents, each contributing to iterative improvement and adaptive learning throughout the development process. By integrating fine-tuned LLMs that leverage outputs and feedback from different agents throughout the development cycle as part of a retrospective learning process, Agentsway enhances domain-specific reasoning, and explainable decision-making across the entire software development lifecycle. Responsible AI principles are further embedded across the agents through the coordinated use of multiple fine-tuned LLMs and advanced reasoning models, ensuring balanced, transparent, and accountable decision-making. This work advances software engineering by formalizing agent-centric collaboration, integrating privacy-by-design principles, and defining measurable metrics for productivity and trust. Agentsway represents a foundational step toward the next generation of AI-native, self-improving software development methodologies. To the best of our knowledge, this is the first research effort to introduce a dedicated methodology explicitly designed for AI agent-based software engineering teams.
♻ ☆ Leveraging Code Cohesion Analysis to Identify Source Code Supply Chain Attacks
Supply chain attacks significantly threaten software security with malicious code injections within legitimate projects. Such attacks are very rare but may have a devastating impact. Detecting spurious code injections using automated tools is further complicated as it often requires deciphering the intention of both the inserted code and its context. In this study, we propose an unsupervised approach for highlighting spurious code injections by quantifying cohesion disruptions in the source code. Using a name-prediction-based cohesion (NPC) metric, we analyze how function cohesion changes when malicious code is introduced compared to natural cohesion fluctuations. An analysis of 54,707 functions over 369 open-source C++ repositories reveals that code injection reduces cohesion and shifts naming patterns toward shorter, less descriptive names compared to genuine function updates. Considering the sporadic nature of real supply-chain attacks, we evaluate the proposed method with extreme test-set imbalance and show that monitoring high-cohesion functions with NPC can effectively detect functions with injected code, achieving a Precision@100 of 36.41% at a 1:1,000 ratio and 12.47% at 1:10,000. These results suggest that automated cohesion measurements, in general, and name-prediction-based cohesion, in particular, may help identify supply chain attacks, improving source code integrity.
♻ ☆ Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG
Malware family classification aims to identify the specific family (e.g., GuLoader or BitRAT) a malware sample may belong to, in contrast to malware detection or sample classification, which only predicts a Yes/No outcome. Accurate family identification can greatly facilitate automated sample labeling and understanding on crowdsourced malware analysis platforms such as VirusTotal and MalwareBazaar, which generate vast amounts of data daily. In this paper, we explore and assess the feasibility of using traditional binary string features for family classification in the new era of large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, we investigate howFamily-Specific String (FSS) features can be utilized in a manner similar to RAG to facilitate family classification. To this end, we develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules, with each providing a relative improvement ranging from 8.1% to 120%.
comment: This is a technical report from Lingnan University, Hong Kong. Code is available at https://github.com/AIS2Lab/MalwareGPT
♻ ☆ LLM Agents for Generating Microservice-based Applications: how complex is your specification?
In this paper we evaluate the capabilities of LLM Agents in generating code for real-world problems. Specifically, we explore code synthesis for microservice-based applications, a widely used architectural pattern for building applications. We define a standard template for specifying these applications, and we propose a metric for scoring the difficulty of a specification. The higher the score, the more difficult it is to generate code for the specification. Our experimental results show that agents using strong LLMs (like GPT-3o-mini) do fairly well on medium difficulty specifications but do poorly on those of higher difficulty levels. This is due to more intricate business logic, a greater use of external services, database integration and inclusion of non-functional capabilities such as authentication. We analyzed the errors in LLM-synthesized code and report on the key challenges LLM Agents face in generating code for these specifications. Finally, we show that using a fine-grained approach to code generation improves the correctness of the generated code.
comment: 24 pages + 5 pages appendices. 7 Figures. 10 Tables
♻ ☆ Code Smell
Dependencies between modules can trigger ripple effects when changes are made, making maintenance complex and costly, so minimizing these dependencies is crucial. Consequently, understanding what drives dependencies is important. One potential factor is code smells, which are symptoms in code that indicate design issues and reduce code quality. When multiple code smells interact through static dependencies, their combined impact on quality can be even more severe. While individual code smells have been widely studied, the influence of their interactions remains underexplored. In this study, we aim to investigate whether and how the distribution of static dependencies changes in the presence of code smell interactions. We conducted a dependency analysis on 116 open-source Java systems to quantify these interactions by comparing cases where code smell interactions exist and where they do not. Our results suggest that overall, code smell interactions are linked to a significant increase in total dependencies in 28 out of 36 cases, and that all code smells are associated with a consistent change direction (increase or decrease) in certain dependency types when interacting with other code smells. Consequently, this information can be used to support more accurate code smell detection and prioritization, as well as to develop more effective refactoring strategies.
♻ ☆ An Empirical Study of Bitwise Operators Intuitiveness through Performance Metrics
Objectives: This study aims to investigate the readability and understandability of bitwise operators in programming, with the main hypothesis that there will be a difference in the performance metrics (response time and error rate) between participants exposed to various bitwise operators related questions and those who are not. Participants: Participants in this human research study include people without programming background, novice programmers, and university students with varying programming experience (from freshmen to PhD level). There were 23 participants in this study. Study Methods: This study uses a within-subjects experimental design to assess how people with diverse programming backgrounds understand and use bitwise operators. Participants complete tasks in a JavaScript program, and their task completion times and task accuracy are recorded for analysis. Findings: The results indicate that operators can be one of the factors predicting response time, showing a small but significant effect (R-squared = 0.032, F(1, 494) = 16.5, p < .001). Additionally, operators such as OR, NOT, and Left Shift showed statistical significance in task completion times compared to other operators. Conclusions: While the complexity of bitwise operators did not generally result in longer task completion times, certain operators were found to be less intuitive, suggesting the need for further investigation and potential redesign for improved understandability.
comment: 15 pages, 10 tables, 9 Figures
♻ ☆ CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming NeurIPS 2025
Competitive programming benchmarks are widely used in scenarios such as programming contests and large language model assessments. However, the growing presence of duplicate or highly similar problems raises concerns not only about competition fairness, but also about the validity of competitive programming as a benchmark for model evaluation. In this paper, we propose a new problem, similar question retrieval, to tackle this issue. Due to the lack of both data and models, solving this problem is challenging. To this end, we introduce CPRet, a retrieval-oriented benchmark suite for competitive programming, covering four retrieval tasks: two code-centric (i.e., Text-to-Code, Code-to-Code) and two newly proposed problem-centric tasks (i.e., Problem-to-Duplicate, Simplified-to-Full) built from a combination of automatically crawled problem-solution data and manually curated annotations. Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation. Besides, we further develop two task-specialized retrievers based on this dataset: CPRetriever-Code, trained with a novel Group-InfoNCE loss for problem-code alignment, and CPRetriever-Prob, fine-tuned for identifying problem-level similarity. Both models achieve strong results and are open-sourced for local use. Finally, we analyze LiveCodeBench and find that high-similarity problems inflate model pass rates and reduce differentiation, underscoring the need for similarity-aware evaluation in future benchmarks. Github: https://github.com/coldchair/CPRet Online Demo: https://www.cpret.online/
comment: Accepted by NeurIPS 2025 Dataset and Benchmark Track
♻ ☆ On Developers' Self-Declaration of AI-Generated Code: An Analysis of Practices
AI code generation tools have gained significant popularity among developers, who use them to assist in software development due to their capability to generate code. Existing studies mainly explored the quality, e.g., correctness and security, of AI-generated code, while in real-world software development, the prerequisite is to distinguish AI-generated code from human-written code, which emphasizes the need to explicitly declare AI-generated code by developers. To this end, this study intends to understand the ways developers use to self-declare AI-generated code and explore the reasons why developers choose to self-declare or not. We conducted a mixed-methods study consisting of two phases. In the first phase, we mined GitHub repositories and collected 613 instances of AI-generated code snippets. In the second phase, we conducted a follow-up practitioners' survey, which received 111 valid responses. Our research revealed the practices followed by developers to self-declare AI-generated code. Most practitioners (76.6%) always or sometimes self-declare AI-generated code. In contrast, other practitioners (23.4%) noted that they never self-declare AI-generated code. The reasons for self-declaring AI-generated code include the need to track and monitor the code for future review and debugging, and ethical considerations. The reasons for not self-declaring AI-generated code include extensive modifications to AI-generated code and the developers' perception that self-declaration is an unnecessary activity. We finally provided guidelines for practitioners to self-declare AI-generated code, addressing ethical and code quality concerns.
comment: Preprint accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM), 2025
Graphics 5
☆ Step2Motion: Locomotion Reconstruction from Pressure Sensing Insoles
Human motion is fundamentally driven by continuous physical interaction with the environment. Whether walking, running, or simply standing, the forces exchanged between our feet and the ground provide crucial insights for understanding and reconstructing human movement. Recent advances in wearable insole devices offer a compelling solution for capturing these forces in diverse, real-world scenarios. Sensor insoles pose no constraint on the users' motion (unlike mocap suits) and are unaffected by line-of-sight limitations (in contrast to optical systems). These qualities make sensor insoles an ideal choice for robust, unconstrained motion capture, particularly in outdoor environments. Surprisingly, leveraging these devices with recent motion reconstruction methods remains largely unexplored. Aiming to fill this gap, we present Step2Motion, the first approach to reconstruct human locomotion from multi-modal insole sensors. Our method utilizes pressure and inertial data-accelerations and angular rates-captured by the insoles to reconstruct human motion. We evaluate the effectiveness of our approach across a range of experiments to show its versatility for diverse locomotion styles, from simple ones like walking or jogging up to moving sideways, on tiptoes, slightly crouching, or dancing.
☆ Environment-aware Motion Matching SIGGRAPH
Interactive applications demand believable characters that respond naturally to dynamic environments. Traditional character animation techniques often struggle to handle arbitrary situations, leading to a growing trend of dynamically selecting motion-captured animations based on predefined features. While Motion Matching has proven effective for locomotion by aligning to target trajectories, animating environment interactions and crowd behaviors remains challenging due to the need to consider surrounding elements. Existing approaches often involve manual setup or lack the naturalism of motion capture. Furthermore, in crowd animation, body animation is frequently treated as a separate process from trajectory planning, leading to inconsistencies between body pose and root motion. To address these limitations, we present Environment-aware Motion Matching, a novel real-time system for full-body character animation that dynamically adapts to obstacles and other agents, emphasizing the bidirectional relationship between pose and trajectory. In a preprocessing step, we extract shape, pose, and trajectory features from a motion capture database. At runtime, we perform an efficient search that matches user input and current pose while penalizing collisions with a dynamic environment. Our method allows characters to naturally adjust their pose and trajectory to navigate crowded scenes.
comment: Published in ACM TOG and presented in SIGGRAPH ASIA 2025. Project webpage: https://upc-virvig.github.io/Environment-aware-Motion-Matching/
♻ ☆ GS-ProCams: Gaussian Splatting-based Projector-Camera Systems
We present GS-ProCams, the first Gaussian Splatting-based framework for projector-camera systems (ProCams). GS-ProCams is not only view-agnostic but also significantly enhances the efficiency of projection mapping (PM) that requires establishing geometric and radiometric mappings between the projector and the camera. Previous CNN-based ProCams are constrained to a specific viewpoint, limiting their applicability to novel perspectives. In contrast, NeRF-based ProCams support view-agnostic projection mapping, however, they require an additional co-located light source and demand significant computational and memory resources. To address this issue, we propose GS-ProCams that employs 2D Gaussian for scene representations, and enables efficient view-agnostic ProCams applications. In particular, we explicitly model the complex geometric and photometric mappings of ProCams using projector responses, the projection surface's geometry and materials represented by Gaussians, and the global illumination component. Then, we employ differentiable physically-based rendering to jointly estimate them from captured multi-view projections. Compared to state-of-the-art NeRF-based methods, our GS-ProCams eliminates the need for additional devices, achieving superior ProCams simulation quality. It also uses only 1/10 of the GPU memory for training and is 900 times faster in inference speed. Please refer to our project page for the code and dataset: https://realqingyue.github.io/GS-ProCams/.
♻ ☆ SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent NeurIPS 2025
Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.
comment: Accepted by NeurIPS 2025, 26 pages
♻ ☆ A Cycle Ride to HDR: Semantics Aware Self-Supervised Framework for Unpaired LDR-to-HDR Image Reconstruction
Reconstruction of High Dynamic Range (HDR) from Low Dynamic Range (LDR) images is an important computer vision task. There is a significant amount of research utilizing both conventional non-learning methods and modern data-driven approaches, focusing on using both single-exposed and multi-exposed LDR for HDR image reconstruction. However, most current state-of-the-art methods require high-quality paired {LDR;HDR} datasets with limited literature use of unpaired datasets, that is, methods that learn the LDR-HDR mapping between domains. This paper proposes CycleHDR, a method that integrates self-supervision into a modified semantic- and cycle-consistent adversarial architecture that utilizes unpaired LDR and HDR datasets for training. Our method introduces novel artifact- and exposure-aware generators to address visual artifact removal. It also puts forward an encoder and loss to address semantic consistency, another under-explored topic. CycleHDR is the first to use semantic and contextual awareness for the LDR-HDR reconstruction task in a self-supervised setup. The method achieves state-of-the-art performance across several benchmark datasets and reconstructs high-quality HDR images. The official website of this work is available at: https://github.com/HrishavBakulBarua/Cycle-HDR
Information Retrieval 13
☆ Civic Ground Truth in News Recommenders: A Method for Public Value Scoring RecSys 2025
Research in news recommendation systems (NRS) continues to explore the best ways to integrate normative goals such as editorial objectives and public service values into existing systems. Prior efforts have incorporated expert input or audience feedback to quantify these values, laying the groundwork for more civic-minded recommender systems. This paper contributes to that trajectory, introducing a method for embedding civic values into NRS through large-scale, structured audience evaluations. The proposed civic ground truth approach aims to generate value-based labels through a nationally representative survey that are generalisable across a wider news corpus, using automated metadata enrichment.
comment: Presented at NORMalize 2025: The Third Workshop on the Normative Design and Evaluation of Recommender Systems, co-located with the ACM Conference on Recommender Systems 2025 (RecSys 2025), Prague
☆ REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce Visual Search System Optimization
In Taobao e-commerce visual search, user behavior analysis reveals a large proportion of no-click requests, suggesting diverse and implicit user intents. These intents are expressed in various forms and are difficult to mine and discover, thereby leading to the limited adaptability and lag in platform strategies. This greatly restricts users' ability to express diverse intents and hinders the scalability of the visual search system. This mismatch between user implicit intent expression and system response defines the User-SearchSys Intent Discrepancy. To alleviate the issue, we propose a novel framework REVISION. This framework integrates offline reasoning mining with online decision-making and execution, enabling adaptive strategies to solve implicit user demands. In the offline stage, we construct a periodic pipeline to mine discrepancies from historical no-click requests. Leveraging large models, we analyze implicit intent factors and infer optimal suggestions by jointly reasoning over query and product metadata. These inferred suggestions serve as actionable insights for refining platform strategies. In the online stage, REVISION-R1-3B, trained on the curated offline data, performs holistic analysis over query images and associated historical products to generate optimization plans and adaptively schedule strategies across the search pipeline. Our framework offers a streamlined paradigm for integrating large models with traditional search systems, enabling end-to-end intelligent optimization across information aggregation and user interaction. Experimental results demonstrate that our approach improves the efficiency of implicit intent mining from large-scale search logs and significantly reduces the no-click rate.
☆ $\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework $\text{E}^2\text{Rank}$, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, $\textrm{E}^2\text{Rank}$ achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
comment: Code and models are avaliable at https://alibaba-nlp.github.io/E2Rank
☆ ATLAS: Actor-Critic Task-Completion with Look-ahead Action Simulation NeurIPS 2025
We observe that current state-of-the-art web-agents are unable to effectively adapt to new environments without neural network fine-tuning, without which they produce inefficient execution plans due to a lack of awareness of the structure and dynamics of the new environment. To address this limitation, we introduce ATLAS (Actor-Critic Task-completion with Look-ahead Action Simulation), a memory-augmented agent that is able to make plans grounded in a model of the environment by simulating the consequences of those actions in cognitive space. Our agent starts by building a "cognitive map" by performing a lightweight curiosity driven exploration of the environment. The planner proposes candidate actions; the simulator predicts their consequences in cognitive space; a critic analyzes the options to select the best roll-out and update the original plan; and a browser executor performs the chosen action. On the WebArena-Lite Benchmark, we achieve a 63% success rate compared to 53.9% success rate for the previously published state-of-the-art. Unlike previous systems, our modular architecture requires no website-specific LLM fine-tuning. Ablations show sizable drops without the world-model, hierarchical planner, and look-ahead-based replanner confirming their complementary roles within the design of our system
comment: 9 pages, NeurIPS 2025 Workshop on Language Agents and World Models
☆ Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation NeurIPS 2025
Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modality to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves the generation quality by 17.07% while reducing 8.95% retrieval times.
comment: Accepted at NeurIPS 2025 UniReps Workshop
☆ Diversification as Risk Minimization WSDM 2026
Users tend to remember failures of a search session more than its many successes. This observation has led to work on search robustness, where systems are penalized if they perform very poorly on some queries. However, this principle of robustness has been overlooked within a single query. An ambiguous or underspecified query (e.g., ``jaguar'') can have several user intents, where popular intents often dominate the ranking, leaving users with minority intents unsatisfied. Although the diversification literature has long recognized this issue, existing metrics only model the average relevance across intents and provide no robustness guarantees. More surprisingly, we show theoretically and empirically that many well-known diversification algorithms are no more robust than a naive, non-diversified algorithm. To address this critical gap, we propose to frame diversification as a risk-minimization problem. We introduce VRisk, which measures the expected risk faced by the least-served fraction of intents in a query. Optimizing VRisk produces a robust ranking, reducing the likelihood of poor user experiences. We then propose VRisker, a fast greedy re-ranker with provable approximation guarantees. Finally, experiments on NTCIR INTENT-2, TREC Web 2012, and MovieLens show the vulnerability of existing methods. VRisker reduces worst-case intent failures by up to 33% with a minimal 2% drop in average performance.
comment: Preprint, accepted at WSDM 2026 (Full Paper). 16 pages, 8 figures
☆ Tools are under-documented: Simple Document Expansion Boosts Tool Retrieval
Large Language Models (LLMs) have recently demonstrated strong capabilities in tool use, yet progress in tool retrieval remains hindered by incomplete and heterogeneous tool documentation. To address this challenge, we introduce Tool-DE, a new benchmark and framework that systematically enriches tool documentation with structured fields to enable more effective tool retrieval, together with two dedicated models, Tool-Embed and Tool-Rank. We design a scalable document expansion pipeline that leverages both open- and closed-source LLMs to generate, validate, and refine enriched tool profiles at low cost, producing large-scale corpora with 50k instances for embedding-based retrievers and 200k for rerankers. On top of this data, we develop two models specifically tailored for tool retrieval: Tool-Embed, a dense retriever, and Tool-Rank, an LLM-based reranker. Extensive experiments on ToolRet and Tool-DE demonstrate that document expansion substantially improves retrieval performance, with Tool-Embed and Tool-Rank achieving new state-of-the-art results on both benchmarks. We further analyze the contribution of individual fields to retrieval effectiveness, as well as the broader impact of document expansion on both training and evaluation. Overall, our findings highlight both the promise and limitations of LLM-driven document expansion, positioning Tool-DE, along with the proposed Tool-Embed and Tool-Rank, as a foundation for future research in tool retrieval.
☆ ATOM: AdapTive and OptiMized dynamic temporal knowledge graph construction using LLMs
In today's rapidly expanding data landscape, knowledge extraction from unstructured text is vital for real-time analytics, temporal inference, and dynamic memory frameworks. However, traditional static knowledge graph (KG) construction often overlooks the dynamic and time-sensitive nature of real-world data, limiting adaptability to continuous changes. Moreover, recent zero- or few-shot approaches that avoid domain-specific fine-tuning or reliance on prebuilt ontologies often suffer from instability across multiple runs, as well as incomplete coverage of key facts. To address these challenges, we introduce ATOM (AdapTive and OptiMized), a few-shot and scalable approach that builds and continuously updates Temporal Knowledge Graphs (TKGs) from unstructured texts. ATOM splits input documents into minimal, self-contained "atomic" facts, improving extraction exhaustivity and stability. Then, it constructs atomic TKGs from these facts while employing a dual-time modeling that distinguishes when information is observed from when it is valid. The resulting atomic TKGs are subsequently merged in parallel. Empirical evaluations demonstrate that ATOM achieves ~18% higher exhaustivity, ~17% better stability, and over 90% latency reduction compared to baseline methods, demonstrating a strong scalability potential for dynamic TKG construction.
☆ Open Multimodal Retrieval-Augmented Factual Image Generation
Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.
comment: Preprint
♻ ☆ Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems EMNLP 2025
Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.
comment: Accepted to EMNLP 2025 Industry Track - Oral Presentation
♻ ☆ A Simple but Effective Elaborative Query Reformulation Approach for Natural Language Recommendation
Natural Language (NL) recommender systems aim to retrieve relevant items from free-form user queries and item descriptions. Existing systems often rely on dense retrieval (DR), which struggles to interpret challenging queries that express broad (e.g., "cities for youth friendly activities") or indirect (e.g., "cities for a high school graduation trip") user intents. While query reformulation (QR) has been widely adopted to improve such systems, existing QR methods tend to focus only on expanding the range of query subtopics (breadth) or elaborating on the potential meaning of a query (depth), but not both. In this paper, we propose EQR (Elaborative Subtopic Query Reformulation), a large language model-based QR method that combines both breadth and depth by generating potential query subtopics with information-rich elaborations. We also introduce three new natural language recommendation benchmarks in travel, hotel, and restaurant domains to establish evaluation of NL recommendation with challenging queries. Experiments show EQR substantially outperforms state-of-the-art QR methods in various evaluation metrics, highlighting that a simple yet effective QR approach can significantly improve NL recommender systems for queries with broad and indirect user intents.
comment: 11 pages, 5 figures
♻ ☆ Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals NeurIPS 2025
Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce RAGuard, the first benchmark to evaluate the robustness of RAG systems against misleading retrievals. Unlike prior benchmarks that rely on synthetic noise, our fact-checking dataset captures naturally occurring misinformation by constructing its retrieval corpus from Reddit discussions. It categorizes retrieved evidence into three types: supporting, misleading, and unrelated, providing a realistic and challenging testbed for assessing how well RAG systems navigate different types of evidence. Our experiments reveal that, when exposed to potentially misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), while human annotators consistently perform better, highlighting LLMs' susceptibility to noisy environments. To our knowledge, RAGuard is the first benchmark to systematically assess the robustness of the RAG against misleading evidence. We expect this benchmark to drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications. The dataset is available at https://huggingface.co/datasets/UCSC-IRKM/RAGuard.
comment: Advances in Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming NeurIPS 2025
Competitive programming benchmarks are widely used in scenarios such as programming contests and large language model assessments. However, the growing presence of duplicate or highly similar problems raises concerns not only about competition fairness, but also about the validity of competitive programming as a benchmark for model evaluation. In this paper, we propose a new problem, similar question retrieval, to tackle this issue. Due to the lack of both data and models, solving this problem is challenging. To this end, we introduce CPRet, a retrieval-oriented benchmark suite for competitive programming, covering four retrieval tasks: two code-centric (i.e., Text-to-Code, Code-to-Code) and two newly proposed problem-centric tasks (i.e., Problem-to-Duplicate, Simplified-to-Full) built from a combination of automatically crawled problem-solution data and manually curated annotations. Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation. Besides, we further develop two task-specialized retrievers based on this dataset: CPRetriever-Code, trained with a novel Group-InfoNCE loss for problem-code alignment, and CPRetriever-Prob, fine-tuned for identifying problem-level similarity. Both models achieve strong results and are open-sourced for local use. Finally, we analyze LiveCodeBench and find that high-similarity problems inflate model pass rates and reduce differentiation, underscoring the need for similarity-aware evaluation in future benchmarks. Github: https://github.com/coldchair/CPRet Online Demo: https://www.cpret.online/
comment: Accepted by NeurIPS 2025 Dataset and Benchmark Track
Multimedia 11
☆ LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction
This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at https://github.com/dsgt-arc/mediaeval-2025-memorability
☆ Region-Adaptive Learned Hierarchical Encoding for 3D Gaussian Splatting Data
We introduce Region-Adaptive Learned Hierarchical Encoding (RALHE) for 3D Gaussian Splatting (3DGS) data. While 3DGS has recently become popular for novel view synthesis, the size of trained models limits its deployment in bandwidth-constrained applications such as volumetric media streaming. To address this, we propose a learned hierarchical latent representation that builds upon the principles of "overfitted" learned image compression (e.g., Cool-Chic and C3) to efficiently encode 3DGS attributes. Unlike images, 3DGS data have irregular spatial distributions of Gaussians (geometry) and consist of multiple attributes (signals) defined on the irregular geometry. Our codec is designed to account for these differences between images and 3DGS. Specifically, we leverage the octree structure of the voxelized 3DGS geometry to obtain a hierarchical multi-resolution representation. Our approach overfits latents to each Gaussian attribute under a global rate constraint. These latents are decoded independently through a lightweight decoder network. To estimate the bitrate during training, we employ an autoregressive probability model that leverages octree-derived contexts from the 3D point structure. The multi-resolution latents, decoder, and autoregressive entropy coding networks are jointly optimized for each Gaussian attribute. Experiments demonstrate that the proposed RALHE compression framework achieves a rendering PSNR gain of up to 2dB at low bitrates (less than 1 MB) compared to the baseline 3DGS compression methods.
comment: 10 Pages, 5 Figures
☆ Understanding What Is Not Said:Referring Remote Sensing Image Segmentation with Scarce Expressions
Referring Remote Sensing Image Segmentation (RRSIS) aims to segment instances in remote sensing images according to referring expressions. Unlike Referring Image Segmentation on general images, acquiring high-quality referring expressions in the remote sensing domain is particularly challenging due to the prevalence of small, densely distributed objects and complex backgrounds. This paper introduces a new learning paradigm, Weakly Referring Expression Learning (WREL) for RRSIS, which leverages abundant class names as weakly referring expressions together with a small set of accurate ones to enable efficient training under limited annotation conditions. Furthermore, we provide a theoretical analysis showing that mixed-referring training yields a provable upper bound on the performance gap relative to training with fully annotated referring expressions, thereby establishing the validity of this new setting. We also propose LRB-WREL, which integrates a Learnable Reference Bank (LRB) to refine weakly referring expressions through sample-specific prompt embeddings that enrich coarse class-name inputs. Combined with a teacher-student optimization framework using dynamically scheduled EMA updates, LRB-WREL stabilizes training and enhances cross-modal generalization under noisy weakly referring supervision. Extensive experiments on our newly constructed benchmark with varying weakly referring data ratios validate both the theoretical insights and the practical effectiveness of WREL and LRB-WREL, demonstrating that they can approach or even surpass models trained with fully annotated referring expressions.
☆ TVMC: Time-Varying Mesh Compression via Multi-Stage Anchor Mesh Generation
Time-varying meshes, characterized by dynamic connectivity and varying vertex counts, hold significant promise for applications such as augmented reality. However, their practical utilization remains challenging due to the substantial data volume required for high-fidelity representation. While various compression methods attempt to leverage temporal redundancy between consecutive mesh frames, most struggle with topological inconsistency and motion-induced artifacts. To address these issues, we propose Time-Varying Mesh Compression (TVMC), a novel framework built on multi-stage coarse-to-fine anchor mesh generation for inter-frame prediction. Specifically, the anchor mesh is progressively constructed in three stages: initial, coarse, and fine. The initial anchor mesh is obtained through fast topology alignment to exploit temporal coherence. A Kalman filter-based motion estimation module then generates a coarse anchor mesh by accurately compensating inter-frame motions. Subsequently, a Quadric Error Metric-based refinement step optimizes vertex positions to form a fine anchor mesh with improved geometric fidelity. Based on the refined anchor mesh, the inter-frame motions relative to the reference base mesh are encoded, while the residual displacements between the subdivided fine anchor mesh and the input mesh are adaptively quantized and compressed. This hierarchical strategy preserves consistent connectivity and high-quality surface approximation, while achieving an efficient and compact representation of dynamic geometry. Extensive experiments on standard MPEG dynamic mesh sequences demonstrate that TVMC achieves state-of-the-art compression performance. Compared to the latest V-DMC standard, it delivers a significant BD-rate gain of 10.2% ~ 16.9%, while preserving high reconstruction quality. The code is available at https://github.com/H-Huang774/TVMC.
☆ DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection
The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training data, along with the absence of a standardized benchmark, hinder deeper exploration. To address this challenge, we first build Mega-MMDF, a large-scale, diverse, and high-quality dataset for multimodal deepfake detection. Specifically, we employ 21 forgery pipelines through the combination of 10 audio forgery methods, 12 visual forgery methods, and 6 audio-driven face reenactment methods. Mega-MMDF currently contains 0.1 million real samples and 1.1 million forged samples, making it one of the largest and most diverse multimodal deepfake datasets, with plans for continuous expansion. Building on it, we present DeepfakeBench-MM, the first unified benchmark for multimodal deepfake detection. It establishes standardized protocols across the entire detection pipeline and serves as a versatile platform for evaluating existing methods as well as exploring novel approaches. DeepfakeBench-MM currently supports 5 datasets and 11 multimodal deepfake detectors. Furthermore, our comprehensive evaluations and in-depth analyses uncover several key findings from multiple perspectives (e.g., augmentation, stacked forgery). We believe that DeepfakeBench-MM, together with our large-scale Mega-MMDF, will serve as foundational infrastructures for advancing multimodal deepfake detection.
comment: Preprint
☆ STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Vision-Language Models (VLMs) are capable of performing a variety of multimodal tasks, it remains unclear how precisely they can identify object states. To alleviate this issue, we introduce the STAte and Transition UnderStanding Benchmark (STATUS Bench), the first benchmark for rigorously evaluating the ability of VLMs to understand subtle variations in object states in diverse situations. Specifically, STATUS Bench introduces a novel evaluation scheme that requires VLMs to perform three tasks simultaneously: object state identification (OSI), image retrieval (IR), and state change identification (SCI). These tasks are defined over our fully hand-crafted dataset involving image pairs, their corresponding object state descriptions and state change descriptions. Furthermore, we introduce a large-scale training dataset, namely STATUS Train, which consists of 13 million semi-automatically created descriptions. This dataset serves as the largest resource to facilitate further research in this area. In our experiments, we demonstrate that STATUS Bench enables rigorous consistency evaluation and reveal that current state-of-the-art VLMs still significantly struggle to capture subtle object state distinctions. Surprisingly, under the proposed rigorous evaluation scheme, most open-weight VLMs exhibited chance-level zero-shot performance. After fine-tuning on STATUS Train, Qwen2.5-VL achieved performance comparable to Gemini 2.0 Flash. These findings underscore the necessity of STATUS Bench and Train for advancing object state recognition in VLM research.
☆ Learning Event-guided Exposure-agnostic Video Frame Interpolation via Adaptive Feature Blending BMVC2025
Exposure-agnostic video frame interpolation (VFI) is a challenging task that aims to recover sharp, high-frame-rate videos from blurry, low-frame-rate inputs captured under unknown and dynamic exposure conditions. Event cameras are sensors with high temporal resolution, making them especially advantageous for this task. However, existing event-guided methods struggle to produce satisfactory results on severely low-frame-rate blurry videos due to the lack of temporal constraints. In this paper, we introduce a novel event-guided framework for exposure-agnostic VFI, addressing this limitation through two key components: a Target-adaptive Event Sampling (TES) and a Target-adaptive Importance Mapping (TIM). Specifically, TES samples events around the target timestamp and the unknown exposure time to better align them with the corresponding blurry frames. TIM then generates an importance map that considers the temporal proximity and spatial relevance of consecutive features to the target. Guided by this map, our framework adaptively blends consecutive features, allowing temporally aligned features to serve as the primary cues while spatially relevant ones offer complementary support. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our approach in exposure-agnostic VFI scenarios.
comment: Accepted for BMVC2025
♻ ☆ MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance ICCV 2025
Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at https://quanhaol.github.io/magicmotion-site.
comment: Accepted by ICCV 2025
♻ ☆ Principled Multimodal Representation Learning
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL's superiority compared to baseline methods. The source code will be publicly available.
comment: Corrected typos and updated experimental results. 32 pages, 9 figures, 10 tables
♻ ☆ NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References
Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the "same instance, similar representation" assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).
comment: Accepted by TPAMI
♻ ☆ GS-ProCams: Gaussian Splatting-based Projector-Camera Systems
We present GS-ProCams, the first Gaussian Splatting-based framework for projector-camera systems (ProCams). GS-ProCams is not only view-agnostic but also significantly enhances the efficiency of projection mapping (PM) that requires establishing geometric and radiometric mappings between the projector and the camera. Previous CNN-based ProCams are constrained to a specific viewpoint, limiting their applicability to novel perspectives. In contrast, NeRF-based ProCams support view-agnostic projection mapping, however, they require an additional co-located light source and demand significant computational and memory resources. To address this issue, we propose GS-ProCams that employs 2D Gaussian for scene representations, and enables efficient view-agnostic ProCams applications. In particular, we explicitly model the complex geometric and photometric mappings of ProCams using projector responses, the projection surface's geometry and materials represented by Gaussians, and the global illumination component. Then, we employ differentiable physically-based rendering to jointly estimate them from captured multi-view projections. Compared to state-of-the-art NeRF-based methods, our GS-ProCams eliminates the need for additional devices, achieving superior ProCams simulation quality. It also uses only 1/10 of the GPU memory for training and is 900 times faster in inference speed. Please refer to our project page for the code and dataset: https://realqingyue.github.io/GS-ProCams/.
Image and Video Processing 12
☆ Region-Adaptive Learned Hierarchical Encoding for 3D Gaussian Splatting Data
We introduce Region-Adaptive Learned Hierarchical Encoding (RALHE) for 3D Gaussian Splatting (3DGS) data. While 3DGS has recently become popular for novel view synthesis, the size of trained models limits its deployment in bandwidth-constrained applications such as volumetric media streaming. To address this, we propose a learned hierarchical latent representation that builds upon the principles of "overfitted" learned image compression (e.g., Cool-Chic and C3) to efficiently encode 3DGS attributes. Unlike images, 3DGS data have irregular spatial distributions of Gaussians (geometry) and consist of multiple attributes (signals) defined on the irregular geometry. Our codec is designed to account for these differences between images and 3DGS. Specifically, we leverage the octree structure of the voxelized 3DGS geometry to obtain a hierarchical multi-resolution representation. Our approach overfits latents to each Gaussian attribute under a global rate constraint. These latents are decoded independently through a lightweight decoder network. To estimate the bitrate during training, we employ an autoregressive probability model that leverages octree-derived contexts from the 3D point structure. The multi-resolution latents, decoder, and autoregressive entropy coding networks are jointly optimized for each Gaussian attribute. Experiments demonstrate that the proposed RALHE compression framework achieves a rendering PSNR gain of up to 2dB at low bitrates (less than 1 MB) compared to the baseline 3DGS compression methods.
comment: 10 Pages, 5 Figures
☆ Understanding What Is Not Said:Referring Remote Sensing Image Segmentation with Scarce Expressions
Referring Remote Sensing Image Segmentation (RRSIS) aims to segment instances in remote sensing images according to referring expressions. Unlike Referring Image Segmentation on general images, acquiring high-quality referring expressions in the remote sensing domain is particularly challenging due to the prevalence of small, densely distributed objects and complex backgrounds. This paper introduces a new learning paradigm, Weakly Referring Expression Learning (WREL) for RRSIS, which leverages abundant class names as weakly referring expressions together with a small set of accurate ones to enable efficient training under limited annotation conditions. Furthermore, we provide a theoretical analysis showing that mixed-referring training yields a provable upper bound on the performance gap relative to training with fully annotated referring expressions, thereby establishing the validity of this new setting. We also propose LRB-WREL, which integrates a Learnable Reference Bank (LRB) to refine weakly referring expressions through sample-specific prompt embeddings that enrich coarse class-name inputs. Combined with a teacher-student optimization framework using dynamically scheduled EMA updates, LRB-WREL stabilizes training and enhances cross-modal generalization under noisy weakly referring supervision. Extensive experiments on our newly constructed benchmark with varying weakly referring data ratios validate both the theoretical insights and the practical effectiveness of WREL and LRB-WREL, demonstrating that they can approach or even surpass models trained with fully annotated referring expressions.
☆ Atlas Urban Index: A VLM-Based Approach for Spatially and Temporally Calibrated Urban Development Monitoring KDD
We introduce the {\em Atlas Urban Index} (AUI), a metric for measuring urban development computed using Sentinel-2 \citep{spoto2012sentinel2} satellite imagery. Existing approaches, such as the {\em Normalized Difference Built-up Index} (NDBI), often struggle to accurately capture urban development due to factors like atmospheric noise, seasonal variation, and cloud cover. These limitations hinder large-scale monitoring of human development and urbanization. To address these challenges, we propose an approach that leverages {\em Vision-Language Models }(VLMs) to provide a development score for regions. Specifically, we collect a time series of Sentinel-2 images for each region. Then, we further process the images within fixed time windows to get an image with minimal cloud cover, which serves as the representative image for that time window. To ensure consistent scoring, we adopt two strategies: (i) providing the VLM with a curated set of reference images representing different levels of urbanization, and (ii) supplying the most recent past image to both anchor temporal consistency and mitigate cloud-related noise in the current image. Together, these components enable AUI to overcome the challenges of traditional urbanization indices and produce more reliable and stable development scores. Our qualitative experiments on Bangalore suggest that AUI outperforms standard indices such as NDBI.
comment: An abridged version of this paper will be presented at and appear in the Proceedings of ACM IKDD CODS 2025
☆ Approximate Signed Multiplier with Sign-Focused Compressor for Edge Detection Applications
This paper presents an approximate signed multiplier architecture that incorporates a sign-focused compressor, specifically designed for edge detection applications in machine learning and signal processing. The multiplier incorporates two types of sign-focused compressors: A + B + C + 1 and A + B + C + D + 1. Both exact and approximate compressor designs are utilized, with a focus on efficiently handling constant value "1" and negative partial products, which frequently appear in the partial product matrices of signed multipliers. To further enhance efficiency, the lower N - 1 columns of the partial product matrix are truncated, followed by an error compensation mechanism. Experimental results show that the proposed 8-bit approximate multiplier achieves a 29.21% reduction in power delay product (PDP) and a 14.39% reduction in power compared to the best of existing multipliers. The proposed multiplier is integrated into a custom convolution layer and performs edge detection, demonstrating its practical utility in real-world applications.
comment: 15 pages
☆ TVMC: Time-Varying Mesh Compression via Multi-Stage Anchor Mesh Generation
Time-varying meshes, characterized by dynamic connectivity and varying vertex counts, hold significant promise for applications such as augmented reality. However, their practical utilization remains challenging due to the substantial data volume required for high-fidelity representation. While various compression methods attempt to leverage temporal redundancy between consecutive mesh frames, most struggle with topological inconsistency and motion-induced artifacts. To address these issues, we propose Time-Varying Mesh Compression (TVMC), a novel framework built on multi-stage coarse-to-fine anchor mesh generation for inter-frame prediction. Specifically, the anchor mesh is progressively constructed in three stages: initial, coarse, and fine. The initial anchor mesh is obtained through fast topology alignment to exploit temporal coherence. A Kalman filter-based motion estimation module then generates a coarse anchor mesh by accurately compensating inter-frame motions. Subsequently, a Quadric Error Metric-based refinement step optimizes vertex positions to form a fine anchor mesh with improved geometric fidelity. Based on the refined anchor mesh, the inter-frame motions relative to the reference base mesh are encoded, while the residual displacements between the subdivided fine anchor mesh and the input mesh are adaptively quantized and compressed. This hierarchical strategy preserves consistent connectivity and high-quality surface approximation, while achieving an efficient and compact representation of dynamic geometry. Extensive experiments on standard MPEG dynamic mesh sequences demonstrate that TVMC achieves state-of-the-art compression performance. Compared to the latest V-DMC standard, it delivers a significant BD-rate gain of 10.2% ~ 16.9%, while preserving high reconstruction quality. The code is available at https://github.com/H-Huang774/TVMC.
☆ Learning Event-guided Exposure-agnostic Video Frame Interpolation via Adaptive Feature Blending BMVC2025
Exposure-agnostic video frame interpolation (VFI) is a challenging task that aims to recover sharp, high-frame-rate videos from blurry, low-frame-rate inputs captured under unknown and dynamic exposure conditions. Event cameras are sensors with high temporal resolution, making them especially advantageous for this task. However, existing event-guided methods struggle to produce satisfactory results on severely low-frame-rate blurry videos due to the lack of temporal constraints. In this paper, we introduce a novel event-guided framework for exposure-agnostic VFI, addressing this limitation through two key components: a Target-adaptive Event Sampling (TES) and a Target-adaptive Importance Mapping (TIM). Specifically, TES samples events around the target timestamp and the unknown exposure time to better align them with the corresponding blurry frames. TIM then generates an importance map that considers the temporal proximity and spatial relevance of consecutive features to the target. Guided by this map, our framework adaptively blends consecutive features, allowing temporally aligned features to serve as the primary cues while spatially relevant ones offer complementary support. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our approach in exposure-agnostic VFI scenarios.
comment: Accepted for BMVC2025
☆ Structure Aware Image Downscaling
Image downscaling is one of the key operations in recent display technology and visualization tools. By this process, the dimension of an image is reduced, aiming to preserve structural integrity and visual fidelity. In this paper, we propose a new image downscaling method which is built on the core ideas of image filtering and edge detection. In particular, we present a structure-informed downscaling algorithm that maintains fine details through edge-aware processing. The proposed method comprises three steps: (i) edge map computation, (ii) edge-guided interpolation, and (iii) texture enhancement. To faithfully retain the strong structures in an image, we first compute the edge maps by applying an efficient edge detection operator. This is followed by an edge-guided interpolation to preserve fine details after resizing. Finally, we fuse local texture enriched component of the original image to the interpolated one to restore high-frequency information. By integrating edge information with adaptive filtering, our approach effectively minimizes artifacts while retaining crucial image features. To demonstrate the effective downscaling capability of our proposed method, we validate on four datasets: DIV2K, BSD100, Urban100, and RealSR. For downscaling by 4x, our method could achieve as high as 39.07 dB PSNR on the DIV2K dataset and 38.71 dB on the RealSR dataset. Extensive experimental results confirm that the proposed image downscaling method is capable of achieving superior performance in terms of both visual quality and performance metrics with reference to recent methods. Most importantly, the downscaled images by our method do not suffer from edge blurring and texture loss, unlike many existing ones.
comment: 11 pages, 1 table and 6 figures
☆ Low-Light Image Enhancement Using Gamma Learning And Attention-Enabled Encoder-Decoder Networks
Images acquired in low-light environments present significant obstacles for computer vision systems and human perception, especially for applications requiring accurate object recognition and scene analysis. Such images typically manifest multiple quality issues: amplified noise, inadequate scene illumination, contrast reduction, color distortion, and loss of details. While recent deep learning methods have shown promise, developing simple and efficient frameworks that naturally integrate global illumination adjustment with local detail refinement continues to be an important objective. To this end, we introduce a dual-stage deep learning architecture that combines adaptive gamma correction with attention-enhanced refinement to address these fundamental limitations. The first stage uses an Adaptive Gamma Correction Module (AGCM) to learn suitable gamma values for each pixel based on both local and global cues, producing a brightened intermediate output. The second stage applies an encoder-decoder deep network with Convolutional Block Attention Modules (CBAM) to this brightened image, in order to restore finer details. We train the network using a composite loss that includes L1 reconstruction, SSIM, total variation, color constancy, and gamma regularization terms to balance pixel accuracy with visual quality. Experiments on LOL-v1, LOL-v2 real, and LOL-v2 synthetic datasets show our method reaches PSNR of upto 29.96 dB and upto 0.9458 SSIM, outperforming existing approaches. Additional tests on DICM, LIME, MEF, and NPE datasets using NIQE, BRISQUE, and UNIQUE metrics confirm better perceptual quality with fewer artifacts, achieving the best NIQE scores across all datasets. Our GAtED (Gamma learned and Attention-enabled Encoder-Decoder) method effectively handles both global illumination adjustment and local detail enhancement, offering a practical solution for low-light enhancement.
comment: 10 pages, 4 figures, and 2 Tables
♻ ☆ Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction? NeurIPS 2025
Widely adopted evaluation metrics for sparse-view CT reconstruction--such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio--prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomical structures, particularly small or thin regions that are easily missed. To address this limitation, we propose a suite of novel anatomy-aware evaluation metrics designed to assess structural completeness across anatomical structures, including large organs, small organs, intestines, and vessels. Building on these metrics, we introduce CARE, a Completeness-Aware Reconstruction Enhancement framework that incorporates structural penalties during training to encourage anatomical preservation of significant structures. CARE is model-agnostic and can be seamlessly integrated into analytical, implicit, and generative methods. When applied to these methods, CARE substantially improves structural completeness in CT reconstructions, achieving up to +32% improvement for large organs, +22% for small organs, +40% for intestines, and +36% for vessels.
comment: NeurIPS 2025
♻ ☆ Extended Depth-of-Field Lensless Imaging using an Optimized Radial Mask
The freedom of design of coded masks used by mask-based lensless cameras is an advantage these systems have when compared to lens-based ones. We leverage this freedom of design to propose a shape-preserving optimization scheme for a radial-type amplitude coded mask. Due to the depth-independency of the radial mask's point spread function, they can be used for extending the effective depth of field (DOF) of a lensless imaging system. In this paper we optimized a coded mask for improved frequency response, while retaining its radial characteristics and therefore extended-DOF capabilities. We show that our optimized radial mask achieved better overall frequency response when compared to naive implementations of a radial mask. We also quantitatively and qualitatively demonstrated the extended DOF imaging achieved by our optimized radial mask in simulations by comparing it to different non-radial coded masks. Finally, we built a prototype camera to validate the extended DOF capabilities of our coded mask in real scenarios.
comment: 12 pages, 9 figures, IEEE Transactions on computational imaging
♻ ☆ NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References
Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the "same instance, similar representation" assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).
comment: Accepted by TPAMI
♻ ☆ LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models
Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.
Systems and Control 19
☆ Transmission Neural Networks: Approximate Receding Horizon Control for Virus Spread on Networks
Transmission Neural Networks (TransNNs) pro- posed by Gao and Caines (2022) serve as both virus spread models over networks and neural network models with tuneable activation functions. This paper establishes that TransNNs provide upper bounds on the infection probability generated from the associated Markovian stochastic Susceptible-Infected- Susceptible (SIS) model with 2^n state configurations where n is the number of nodes in the network, and can be employed as an approximate model for the latter. Based on such an approximation, a TransNN-based receding horizon control approach for mitigating virus spread is proposed and we demonstrate that it allows significant computational savings compared to the dynamic programming solution to Markovian SIS model with 2^n state configurations, as well as providing less conservative control actions compared to the TransNN- based optimal control. Finally, numerical comparisons among (a) dynamic programming solutions for the Markovian SIS model, (b) TransNN-based optimal control and (c) the proposed TransNN-based receding horizon control are presented.
☆ Analytical Swarm Chemistry: Characterization and Analysis of Emergent Swarm Behaviors
Swarm robotics has potential for a wide variety of applications, but real-world deployments remain rare due to the difficulty of predicting emergent behaviors arising from simple local interactions. Traditional engineering approaches design controllers to achieve desired macroscopic outcomes under idealized conditions, while agent-based and artificial life studies explore emergent phenomena in a bottom-up, exploratory manner. In this work, we introduce Analytical Swarm Chemistry, a framework that integrates concepts from engineering, agent-based and artificial life research, and chemistry. This framework combines macrostate definitions with phase diagram analysis to systematically explore how swarm parameters influence emergent behavior. Inspired by concepts from chemistry, the framework treats parameters like thermodynamic variables, enabling visualization of regions in parameter space that give rise to specific behaviors. Applying this framework to agents with minimally viable capabilities, we identify sufficient conditions for behaviors such as milling and diffusion and uncover regions of the parameter space that reliably produce these behaviors. Preliminary validation on real robots demonstrates that these regions correspond to observable behaviors in practice. By providing a principled, interpretable approach, this framework lays the groundwork for predictable and reliable emergent behavior in real-world swarm systems.
comment: 9 pages, 8 figures, 1 table
☆ Residual Bias Compensation Filter for Physics-Based SOC Estimation in Lithium Iron Phosphate Batteries
This paper addresses state of charge (SOC) estimation for lithium iron phosphate (LFP) batteries, where the relatively flat open-circuit voltage (OCV-SOC) characteristic reduces observability. A residual bias compensation dual extended Kalman filter (RBC-DEKF) is developed. Unlike conventional bias compensation methods that treat the bias as an augmented state within a single filter, the proposed dual-filter structure decouples residual bias estimation from electrochemical state estimation. One EKF estimates the system states of a control-oriented parameter-grouped single particle model with thermal effects, while the other EKF estimates a residual bias that continuously corrects the voltage observation equation, thereby refining the model-predicted voltage in real time. Unlike bias-augmented single-filter schemes that enlarge the covariance coupling, the decoupled bias estimator refines the voltage observation without perturbing electrochemical state dynamics. Validation is conducted on an LFP cell from a public dataset under three representative operating conditions: US06 at 0 degC, DST at 25 degC, and FUDS at 50 degC. Compared with a conventional EKF using the same model and identical state filter settings, the proposed method reduces the average SOC RMSE from 3.75% to 0.20% and the voltage RMSE between the filtered model voltage and the measured voltage from 32.8 mV to 0.8 mV. The improvement is most evident in the mid-SOC range where the OCV-SOC curve is flat, confirming that residual bias compensation significantly enhances accuracy for model-based SOC estimation of LFP batteries across a wide temperature range.
comment: This paper has been submitted to the European Control Conference (ECC) 2026 for consideration. This is the authors' version of the work, made available for early dissemination. The copyright remains with the authors. The final version, if accepted, will appear in the ECC 2026 proceedings
☆ Ellipsoidal Set-Theoretic Design of Robust Safety Filters for Constrained Linear Systems
This paper presents an ellipsoidal set-theoretic framework for robust safety filter synthesis in constrained linear systems subject to additive bounded disturbances and input constraints. We formulate the safety filter design as a convex linear matrix inequality (LMI) optimization problem that simultaneously computes a robust controlled invariant (RCI) ellipsoidal set and its associated state-feedback control law. The RCI set is characterized as an ellipsoidal set, enabling computational tractability for high-dimensional systems while providing formal safety guarantees. The safety filter employs a smooth mixing strategy between nominal and backup controllers based on distance to the invariant set boundary, facilitating minimal intervention when the system operates safely. The proposed method extends to nonlinear systems by treating nonlinear terms as bounded disturbances with rigorous approximation bounds. Numerical validation on a six-degree-of-freedom quadrotor system demonstrates the filter's effectiveness in maintaining stability under external disturbances and aggressive maneuvers while preserving nominal performance during safe operation. The approach provides a constructive and computationally efficient solution for safety-critical control applications requiring real-time implementation.
☆ Curriculum-Based Iterative Self-Play for Scalable Multi-Drone Racing
The coordination of multiple autonomous agents in high-speed, competitive environments represents a significant engineering challenge. This paper presents CRUISE (Curriculum-Based Iterative Self-Play for Scalable Multi-Drone Racing), a reinforcement learning framework designed to solve this challenge in the demanding domain of multi-drone racing. CRUISE overcomes key scalability limitations by synergistically combining a progressive difficulty curriculum with an efficient self-play mechanism to foster robust competitive behaviors. Validated in high-fidelity simulation with realistic quadrotor dynamics, the resulting policies significantly outperform both a standard reinforcement learning baseline and a state-of-the-art game-theoretic planner. CRUISE achieves nearly double the planner's mean racing speed, maintains high success rates, and demonstrates robust scalability as agent density increases. Ablation studies confirm that the curriculum structure is the critical component for this performance leap. By providing a scalable and effective training methodology, CRUISE advances the development of autonomous systems for dynamic, competitive tasks and serves as a blueprint for future real-world deployment.
comment: 13 pages, 5 figures. This paper is currently under review at the journal Engineering Applications of Artificial Intelligence. Supplementary video: https://drive.google.com/file/d/1k7necen2DgIxaYT2alKK8-b20sE_AyDA/view Source code and models: https://doi.org/10.5281/zenodo.17256943
☆ SPIRAL: Self-Play Incremental Racing Algorithm for Learning in Multi-Drone Competitions
This paper introduces SPIRAL (Self-Play Incremental Racing Algorithm for Learning), a novel approach for training autonomous drones in multi-agent racing competitions. SPIRAL distinctively employs a self-play mechanism to incrementally cultivate complex racing behaviors within a challenging, dynamic environment. Through this self-play core, drones continuously compete against increasingly proficient versions of themselves, naturally escalating the difficulty of competitive interactions. This progressive learning journey guides agents from mastering fundamental flight control to executing sophisticated cooperative multi-drone racing strategies. Our method is designed for versatility, allowing integration with any state-of-the-art Deep Reinforcement Learning (DRL) algorithms within its self-play framework. Simulations demonstrate the significant advantages of SPIRAL and benchmark the performance of various DRL algorithms operating within it. Consequently, we contribute a versatile, scalable, and self-improving learning framework to the field of autonomous drone racing. SPIRAL's capacity to autonomously generate appropriate and escalating challenges through its self-play dynamic offers a promising direction for developing robust and adaptive racing strategies in multi-agent environments. This research opens new avenues for enhancing the performance and reliability of autonomous racing drones in increasingly complex and competitive scenarios.
comment: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
☆ Approximate Gradient Coding for Distributed Learning with Heterogeneous Stragglers
In this paper, we propose an optimally structured gradient coding scheme to mitigate the straggler problem in distributed learning. Conventional gradient coding methods often assume homogeneous straggler models or rely on excessive data replication, limiting performance in real-world heterogeneous systems. To address these limitations, we formulate an optimization problem minimizing residual error while ensuring unbiased gradient estimation by explicitly considering individual straggler probabilities. We derive closed-form solutions for optimal encoding and decoding coefficients via Lagrangian duality and convex optimization, and propose data allocation strategies that reduce both redundancy and computation load. We also analyze convergence behavior for $\lambda$-strongly convex and $\mu$-smooth loss functions. Numerical results show that our approach significantly reduces the impact of stragglers and accelerates convergence compared to existing methods.
☆ Smart Sensor Placement: A Correlation-Aware Attribution Framework (CAAF) for Real-world Data Modeling
Optimal sensor placement (OSP) is critical for efficient, accurate monitoring, control, and inference in complex real-world systems. We propose a machine-learning-based feature attribution framework to identify OSP for the prediction of quantities of interest. Feature attribution quantifies input contributions to a model's output; however, it struggles with highly correlated input data often encountered in real-world applications. To address this, we propose a Correlation-Aware Attribution Framework (CAAF), which introduces a clustering step before performing feature attribution to reduce redundancy and enhance generalizability. We first illustrate the core principles of the proposed framework through a series of validation cases, then demonstrate its effectiveness in real-world dynamical systems, such as structural health monitoring, airfoil lift prediction, and wall-normal velocity estimation for turbulent channel flow. The results show that the CAAF outperforms alternative approaches that typically struggle due to the presence of nonlinear dynamics, chaotic behavior, and multi-scale interactions, and enables the effective application of feature attribution for identifying OSP in real-world environments.
☆ Robust Multi-Agent Safety via Tube-Based Tightened Exponential Barrier Functions
This paper presents a constructive framework for synthesizing provably safe controllers for nonlinear multi-agent systems subject to bounded disturbances. The methodology applies to systems representable in Brunovsky canonical form, accommodating arbitrary-order dynamics in multi-dimensional spaces. The central contribution is a method of constraint tightening that formally couples robust error feedback with nominal trajectory planning. The key insight is that the design of an ancillary feedback law, which confines state errors to a robust positively invariant (RPI) tube, simultaneously provides the exact information needed to ensure the safety of the nominal plan. Specifically, the geometry of the resulting RPI tube is leveraged via its support function to derive state-dependent safety margins. These margins are then used to systematically tighten the high relative-degree exponential control barrier function (eCBF) constraints imposed on the nominal planner. This integrated synthesis guarantees that any nominal trajectory satisfying the tightened constraints corresponds to a provably safe trajectory for the true, disturbed system. We demonstrate the practical utility of this formal synthesis method by implementing the planner within a distributed Model Predictive Control (MPC) scheme, which optimizes performance while inheriting the robust safety guarantees.
comment: This work has been submitted to IFAC for possible publication
☆ Functional Uncertainty Classes, Nonparametric Adaptive Contro Functional Uncertainty Classes for Nonparametric Adaptive Control: the Curse of Dimensionality
This paper derives a new class of vector-valued reproducing kernel Hilbert spaces (vRKHS) defined in terms of operator-valued kernels for the representation of functional uncertainty arising in nonparametric adaptive control methods. These are referred to as maneuver or trajectory vRKHS KM in the paper, and they are introduced to address the curse of dimensionality that can arise for some types of nonparametric adaptive control strategies. The maneuver vRKHSs are derived based on the structure of a compact, l-dimensional, smooth Riemannian manifold M that is regularly embedded in the state space X = Rn, where M is assumed to approximately support the ultimate dynamics of the reference system to be tracked.
☆ A Scenario-based Stochastic Model of using BESS-based Virtual Transmission Lines in Day-Ahead Unit Commitment
The rapid increase in renewable energy sources (RES) implementation in the power system creates more severe network congestion, which may reduce grid operation efficiency and cause renewable curtailment. Deterministic optimization for the unit commitment shows that battery energy storage system (BESS)-based Virtual Transmission Line (VTL), as an alternative to physical transmission lines, can offer a quick solution for congestion relief, reduced operational costs, and lowered renewable curtailment. This paper aims to evaluate the benefits of VTL when considering Renewable Energy Sources uncertainty. Particularly, this work proposes a scenario-based stochastic security-constrained unit commitment model considering VTL, referred to as SSCUC-VTL. It incorporates the forecast error of RES into the commitment decision for systems with VTL. The performance of applying the VTL strategy is compared to that of adding a new physical transmission line and a standalone BESS. A case study has been conducted on an enhanced IEEE 24-bus test system. The simulation results demonstrate that VTL provides 23% more operational cost reduction than the physical transmission line, and up to 67% more congestion relief than the standalone BESS in a power system with solar and wind generation.
comment: This paper is to be published in the IEEE PES Asia-Pacific Power and Energy Engineering Conference (APPEEC) 2025
☆ Data-driven Exponential Framing for Pulsive Temporal Patterns without Repetition or Singularity
Extracting pulsive temporal patterns from a small dataset without their repetition or singularity shows significant importance in manufacturing applications but does not sufficiently attract scientific attention. We propose to quantify how long temporal patterns appear without relying on their repetition or singularity, enabling to extract such temporal patterns from a small dataset. Inspired by the celebrated time delay embedding and data-driven Hankel matrix analysis, we introduce a linear dynamical system model on the time-delay coordinates behind the data to derive the discrete-time bases each of which has a distinct exponential decay constant. The derived bases are fitted onto subsequences that are extracted with a sliding window in order to quantify how long patterns are dominant in the set of subsequences. We call the quantification method Data-driven Exponential Framing (DEF). A toy model-based experiment shows that DEF can identify multiple patterns with distinct lengths. DEF is also applied to electric current measurement on a punching machine, showing its possibility to extract multiple patterns from real-world oscillatory data.
comment: 16 pages
☆ Transmission Neural Networks: Approximate Receding Horizon Control for Virus Spread on Networks
Transmission Neural Networks (TransNNs) proposed by Gao and Caines (2022) serve as both virus spread models over networks and neural network models with tuneable activation functions. This paper establishes that TransNNs provide upper bounds on the infection probability generated from the associated Markovian stochastic Susceptible-Infected-Susceptible (SIS) model with 2^n state configurations where n is the number of nodes in the network, and can be employed as an approximate model for the latter. Based on such an approximation, a TransNN-based receding horizon control approach for mitigating virus spread is proposed and we demonstrate that it allows significant computational savings compared to the dynamic programming solution to Markovian SIS model with 2^n state configurations, as well as providing less conservative control actions compared to the TransNN-based optimal control. Finally, numerical comparisons among (a) dynamic programming solutions for the Markovian SIS model, (b) TransNN-based optimal control and (c) the proposed TransNN-based receding horizon control are presented.
♻ ☆ Predictive Reliability Assessment of Distribution Grids with Residential Distributed Energy Resources
Distribution system end users are transforming from passive to active participants, marked by the push towards widespread adoption of edge-level Distributed Energy Resources (DERs). This paper addresses the challenges in distribution system planning arising from these dynamic changes. We introduce a bottom-up probabilistic approach that integrates these edge-level DERs into the reliability evaluation process. Our methodology leverages joint probability distributions to characterize and model the penetration of rooftop photovoltaic (PV) systems and energy storage across a distribution network at the individual residential level. Employing a scenario-based approach, we showcase the application of our probabilistic method using a Monte Carlo Simulation process to assess average system reliability indices and their variations at the user level. To validate our approach, we applied this methodology to the RBTS test system across various adoption scenarios, effectively showcasing the capability of our proposed method in quantifying the variation in end-user reliability indices for each scenario within the distribution system.
comment: Accepted by CSEE Journal of Power and Energy Systems in Oct. 2025
♻ ☆ Stability Criteria and Motor Performance in Delayed Haptic Dyadic Interactions Mediated by Robots
This paper establishes analytical stability criteria for robot-mediated human-human (dyadic) interaction systems, focusing on haptic communication under network-induced time delays. Through frequency-domain analysis supported by numerical simulations, we identify both delay-independent and delay-dependent stability criteria. The delay-independent criterion guarantees stability irrespective of the delay, whereas the delay-dependent criterion is characterised by a maximum tolerable delay before instability occurs. The criteria demonstrate dependence on controller and robot dynamic parameters, where increasing stiffness reduces the maximum tolerable delay in a non-linear manner, thereby heightening system vulnerability. The proposed criteria can be generalised to a wide range of robot-mediated interactions and serve as design guidelines for stable remote dyadic systems. Experiments with robots performing human-like movements further illustrate the correlation between stability and motor performance. The findings of this paper suggest the prerequisites for effective delay-compensation strategies.
♻ ☆ Koopman Eigenfunction-Based Identification and Optimal Nonlinear Control of Turbojet Engine
Gas turbine engines are complex and highly nonlinear dynamical systems. Deriving their physics-based models can be challenging because it requires performance characteristics that are not always available, often leading to many simplifying assumptions. This paper discusses the limitations of conventional experimental methods used to derive component-level and locally linear parameter-varying models, and addresses these issues by employing identification techniques based on data collected from standard engine operation under closed-loop control. The rotor dynamics are estimated using the sparse identification of nonlinear dynamics. Subsequently, the autonomous part of the dynamics is mapped into an optimally constructed Koopman eigenfunction space. This process involves eigenvalue optimization using metaheuristic algorithms and temporal projection, followed by gradient-based eigenfunction identification. The resulting Koopman model is validated against an in-house reference component-level model. A globally optimal nonlinear feedback controller and a Kalman estimator are then designed within the eigenfunction space and compared to traditional and gain-scheduled proportional-integral controllers, as well as a proposed internal model control approach. The eigenmode structure enables targeting individual modes during optimization, leading to improved performance tuning. Results demonstrate that the Koopman-based controller surpasses other benchmark controllers in both reference tracking and disturbance rejection under sea-level and varying flight conditions, due to its global nature.
comment: 34 pages, 29 figures Under review at Springer Nonlinear Dynamics
♻ ☆ Dynamic financial processes identification using sparse regressive reservoir computers
In this document, we present key findings in structured matrix approximation theory, with applications to the regressive representation of dynamic financial processes. Initially, we explore a comprehensive approach involving generic nonlinear time delay embedding for time series data extracted from a financial or economic system under examination. Subsequently, we employ sparse least-squares and structured matrix approximation methods to discern approximate representations of the output coupling matrices. These representations play a pivotal role in establishing the regressive models corresponding to the recursive structures inherent in a given financial system. The document further introduces prototypical algorithms that leverage the aforementioned techniques. These algorithms are demonstrated through applications in approximate identification and predictive simulation of dynamic financial and economic processes, encompassing scenarios that may or may not exhibit chaotic behavior.
comment: The content of this publication represents the opinion of the researchers affiliated with the Department of Statistics and Research, but not the official opinion of the CNBS
♻ ☆ Zero-Shot Trajectory Planning for Signal Temporal Logic Tasks
Signal Temporal Logic (STL) is a powerful specification language for describing complex temporal behaviors of continuous signals, making it well-suited for high-level robotic task descriptions. However, generating executable plans for STL tasks is challenging, as it requires consideration of the coupling between the task specification and the system dynamics. Existing approaches either follow a model-based setting that explicitly requires knowledge of the system dynamics or adopt a task-oriented data-driven approach to learn plans for specific tasks. In this work, we address the problem of generating executable STL plans for systems with unknown dynamics. We propose a hierarchical planning framework that enables zero-shot generalization to new STL tasks by leveraging only task-agnostic trajectory data during offline training. The framework consists of three key components: (i) decomposing the STL specification into several progresses and time constraints, (ii) searching for timed waypoints that satisfy all progresses under time constraints, and (iii) generating trajectory segments using a pre-trained diffusion model and stitching them into complete trajectories. We formally prove that our method guarantees STL satisfaction, and simulation results demonstrate its effectiveness in generating dynamically feasible trajectories across diverse long-horizon STL tasks.
♻ ☆ A Note on Comparator-Overdrive-Delay Conditioning for Current-Mode Control
Comparator-overdrive-delay conditioning is a new control conditioning approach for high-frequency current-mode control. No existing literature rigorously studies the effect of the comparator overdrive delay on the current-mode control. The results in this paper provide insights into the mechanism of comparator-overdrive-delay conditioning.
comment: Add extra case studies
Human-Computer Interaction 10
☆ SignaApp a modern alternative to support signwriting notation for sign languages
The present work reports the development of an application called Signa App, which was designed following the philosophy of User-Centered De-sign. Signa App aims to provide a mobile platform for editing and creating texts in SignWriting notation. The proposal was based on the lack of a mo-bile application that is usable for Deaf individuals who use sign language. The application was tested with adults, children, and adolescents, and the results showed a high degree of acceptance and ease of use. The app has al-ready been introduced to the SignWriting user community, receiving posi-tive feedback. Likewise, the application is available on the Google Play Store
comment: 13 pages, 11 figures, Conference: VIII International Conference on Inclusive Technologies and Education, 2025
☆ Complementary Human-AI Clinical Reasoning in Ophthalmology
Vision impairment and blindness are a major global health challenge where gaps in the ophthalmology workforce limit access to specialist care. We evaluate AMIE, a medically fine-tuned conversational system based on Gemini with integrated web search and self-critique reasoning, using real-world clinical vignettes that reflect scenarios a general ophthalmologist would be expected to manage. We conducted two complementary evaluations: (1) a human-AI interactive diagnostic reasoning study in which ophthalmologists recorded initial differentials and plans, then reviewed AMIE's structured output and revised their answers; and (2) a masked preference and quality study comparing AMIE's narrative outputs with case author reference answers using a predefined rubric. AMIE showed standalone diagnostic performance comparable to clinicians at baseline. Crucially, after reviewing AMIE's responses, ophthalmologists tended to rank the correct diagnosis higher, reached greater agreement with one another, and enriched their investigation and management plans. Improvements were observed even when AMIE's top choice differed from or underperformed the clinician baseline, consistent with a complementary effect in which structured reasoning support helps clinicians re-rank rather than simply accept the model output. Preferences varied by clinical grade, suggesting opportunities to personalise responses by experience. Without ophthalmology-specific fine-tuning, AMIE matched clinician baseline and augmented clinical reasoning at the point of need, motivating multi-axis evaluation, domain adaptation, and prospective multimodal studies in real-world settings.
☆ Teaching Machine Learning Through Cricket: A Practical Engineering Education Approach
Teaching complex machine learning concepts such as reinforcement learning and Markov Decision Processes remains challenging in engineering education. Students often struggle to connect abstract mathematics to real-world applications. We present LearnML@Cricket, a 12-week curriculum that uses cricket analytics to teach these concepts through practical, hands-on examples. By mapping game scenarios directly to ML algorithms, students learn through doing rather than memorizing. Our curriculum includes coding laboratories, real datasets, and immediate application to engineering problems. We propose an empirical study to measure whether this approach improves both understanding and practical implementation skills compared to traditional teaching methods.
comment: 16 pages, 2 tables, Submitted in IDAA 2025
☆ egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks NeurIPS 2025
Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person's emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling - assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta's Project Aria glasses. Each session provides synchronized eye-tracking video, headmounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels' Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.
comment: Accepted for publication at NeurIPS 2025
RaycastGrasp: Eye-Gaze Interaction with Wearable Devices for Robotic Manipulation
Robotic manipulators are increasingly used to assist individuals with mobility impairments in object retrieval. However, the predominant joystick-based control interfaces can be challenging due to high precision requirements and unintuitive reference frames. Recent advances in human-robot interaction have explored alternative modalities, yet many solutions still rely on external screens or restrictive control schemes, limiting their intuitiveness and accessibility. To address these challenges, we present an egocentric, gaze-guided robotic manipulation interface that leverages a wearable Mixed Reality (MR) headset. Our system enables users to interact seamlessly with real-world objects using natural gaze fixation from a first-person perspective, while providing augmented visual cues to confirm intent and leveraging a pretrained vision model and robotic arm for intent recognition and object manipulation. Experimental results demonstrate that our approach significantly improves manipulation accuracy, reduces system latency, and achieves single-pass intention and object recognition accuracy greater than 88% across multiple real-world scenarios. These results demonstrate the system's effectiveness in enhancing intuitiveness and accessibility, underscoring its practical significance for assistive robotics applications.
comment: 5 pages, 5 figures; Accepted to: 2025 IEEE 4th International Conference on Intelligent Reality (ICIR 2025); Zitiantao Lin and Yongpeng Sang contributed equally to this work (co-first authors). Corresponding author: Yang Ye (y.ye@northeastern.edu)
☆ Beyond Reality: Designing Personal Experiences and Interactive Narratives in AR Theater
Augmented Reality (AR) technologies are redefining how we perceive and interact with the world by seamlessly integrating digital elements into our physical surroundings. These technologies offer personalized experiences and transform familiar spaces by layering new narratives onto the real world. Through increased levels of perceived agency and immersive environments, my work aims to merge the human elements of live theater with the dynamic potential of virtual entities and AI agents. This approach captures the subtlety and magic of storytelling, making theater experiences available anytime and anywhere. The system I am building introduces innovative methods for theatrical production in virtual settings, informed by my research and eight published works. These contributions highlight domain-specific insights that have shaped the design of an immersive AR Theater system. My research in building a well-designed AR stage features avatars and interactive elements that allow users to engage with stories at their own pace, granting them full agency over their experience. However, to ensure a smooth and curated experience that aligns with the director or creator's vision, several factors must be considered, especially in open-world settings that depend on natural user movement. This requires the story to be conveyed in a controlled manner, while the interaction remains intuitive and natural for the user.
comment: PhD Thesis, Media Arts and Technology, University of California, Santa Barbara, defended on December 13, 2024. Available from ProQuest Dissertations & Theses Global (Order No. 31761773). This version is shared on arXiv for broader accessibility
☆ egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks NeurIPS 2025
Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person's emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling - assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta's Project Aria glasses. Each session provides synchronized eye-tracking video, headmounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels' Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.
comment: Accepted for publication at NeurIPS 2025
♻ ☆ Robust Understanding of Human-Robot Social Interactions through Multimodal Distillation
There is a growing need for social robots and intelligent agents that can effectively interact with and support users. For the interactions to be seamless, the agents need to analyse social scenes and behavioural cues from their (robot's) perspective. Works that model human-agent interactions in social situations are few; and even those existing ones are computationally too intensive to be deployed in real time or perform poorly in real-world scenarios when only limited information is available. We propose a knowledge distillation framework that models social interactions through various multimodal cues, and yet is robust against incomplete and noisy information during inference. We train a teacher model with multimodal input (body, face and hand gestures, gaze, raw images) that transfers knowledge to a student model which relies solely on body pose. Extensive experiments on two publicly available human-robot interaction datasets demonstrate that our student model achieves an average accuracy gain of 14.75% over competitive baselines on multiple downstream social understanding tasks, even with up to 51% of its input being corrupted. The student model is also highly efficient - less than 1% in size of the teacher model in terms of parameters and its latency is 11.9% of the teacher model. Our code and related data are available at github.com/biantongfei/SocialEgoMobile.
comment: Accepted by ACM Multimedia 2025, camera-ready version
♻ ☆ PriorWeaver: Prior Elicitation via Iterative Dataset Construction
In Bayesian analysis, prior elicitation, or the process of explicating one's beliefs to inform statistical modeling, is an essential yet challenging step. Analysts often have beliefs about real-world variables and their relationships. However, existing tools require analysts to translate these beliefs and express them indirectly as probability distributions over model parameters. We present PriorWeaver, an interactive visualization system that facilitates prior elicitation through iterative dataset construction and refinement. Analysts visually express their assumptions about individual variables and their relationships. Under the hood, these assumptions create a dataset used to derive statistical priors. Prior predictive checks then help analysts compare the priors to their assumptions. In a lab study with 17 participants new to Bayesian analysis, we compare PriorWeaver to a baseline incorporating existing techniques. Compared to the baseline, PriorWeaver gave participants greater control, clarity, and confidence, leading to priors that were better aligned with their expectations.
♻ ☆ Talk, Listen, Connect: How Humans and AI Evaluate Empathy in Responses to Emotionally Charged Narratives
Social interactions promote well-being, yet barriers like geographic distance, time limitations, and mental health conditions can limit face-to-face interactions. Emotionally responsive AI systems, such as chatbots, offer new opportunities for social and emotional support, but raise critical questions about how empathy is perceived and experienced in human-AI interactions. This study examines how empathy is evaluated in AI-generated versus human responses. Using personal narratives, we explored how persona attributes (e.g., gender, empathic traits, shared experiences) and story qualities affect empathy ratings. We compared responses from standard and fine-tuned AI models with human judgments. Results show that while humans are highly sensitive to emotional vividness and shared experience, AI-responses are less influenced by these cues, often lack nuance in empathic expression. These findings highlight challenges in designing emotionally intelligent systems that respond meaningfully across diverse users and contexts, and informs the design of ethically aware tools to support social connection and well-being.
comment: 18 pages, 4 figures, 6 tables. Title updated from "Talk, Listen, Connect: Navigating Empathy in Human-AI Interactions" to "Talk, Listen, Connect: How Humans and AI Evaluate Empathy in Responses to Emotionally Charged Narratives" in this version. This is version 3 (v3) of the paper. All previous citations of arXiv:2409.15550 with the old title still refer to the same paper
Software Engineering 14
☆ A Multifaceted View on Discrimination in Software Development Careers
Conversations around diversity and inclusion in software engineering often focus on gender and racial disparities. However, the State of the Developer Nation 2025 survey with 8,717 participants revealed that other forms of discrimination are similarly prevalent but receive considerably less attention. This includes discrimination based on age, political perspective, disabilities, or cognitive differences such as neurodivergence. We conducted a secondary analysis of 800 open-ended survey responses to examine patterns of perceived discrimination, as well as related challenges and negative impacts. Our study covers multiple identity facets, including age, gender, race, and disability. We found that age- and gender-related discrimination was the most frequently reported workplace issue, but discrimination based on political and religious views emerged as further notable concerns. Most of the participants who identified as female cited gender as the primary source of discrimination, often accompanied by intersectional factors such as race, political views, age, or sexual orientation. Discrimination related to caregiving responsibilities was reported by all gender identities. Regarding the negative impacts of workplace issues, many participants described modifying their appearance or behavior in response to gender biases. Gender also appeared to influence broader career challenges, as women and non-binary respondents reported experiencing almost all workplace issues at higher rates, particularly discrimination (35%) and mental health challenges (62%). Our goal is to raise awareness in the research community that discrimination in software development is multifaceted, and to encourage researchers to select and assess relevant facets beyond age and gender when designing software engineering studies.
comment: 11 pages, 1 figure, 5 tables
☆ The Cost of Certainty: Shot Budgets in Quantum Program Testing
As quantum computing advances toward early fault-tolerant machines, testing and verification of quantum programs become urgent but costly, since each execution consumes scarce hardware resources. Unlike in classical software testing, every measurement must be carefully budgeted. This paper develops a unified framework for reasoning about how many measurements are required to verify quantum programs. The goal is to connect theoretical error bounds with concrete test strategies and to extend the analysis from individual tests to full program-level verification. We analyze the relationship between error probability, fidelity, trace distance, and the quantum Chernoff bound to establish fundamental shot count limits. These foundations are applied to three representative testing methods: the inverse test, the swap test, and the chi-square test. Both idealized and noisy devices are considered. We also introduce a program-level budgeting approach that allocates verification effort across multiple subroutines. The inverse test is the most measurement efficient, the swap test requires about twice as many shots, and the chi-square test is easiest to implement but often needs orders of magnitude more measurements. In the presence of noise, calibrated baselines may increase measurement requirements beyond theoretical estimates. At the program level, distributing a global fidelity target across many fine-grained functions can cause verification costs to grow rapidly, whereas coarser decompositions or weighted allocations remain more practical. The framework clarifies trade-offs among different testing strategies, noise handling, and program decomposition. It provides practical guidance for budgeting measurement shots in quantum program testing, helping practitioners balance rigour against cost when designing verification strategies.
☆ A First Look at the Self-Admitted Technical Debt in Test Code: Taxonomy and Detection
Self-admitted technical debt (SATD) refers to comments in which developers explicitly acknowledge code issues, workarounds, or suboptimal solutions. SATD is known to significantly increase software maintenance effort. While extensive research has examined SATD in source code, its presence and impact in test code have received no focused attention, leaving a significant gap in our understanding of how SATD manifests in testing contexts. This study, the first of its kind, investigates SATD in test code by manually analyzing 50,000 comments randomly sampled from 1.6 million comments across 1,000 open-source Java projects. From this sample, after manual analysis and filtering, we identified 615 SATD comments and classified them into 15 distinct categories, building a taxonomy of test code SATD. To investigate whether test code SATD can be detected automatically, we evaluated existing SATD detection tools, as well as both open-source and proprietary LLMs. Among the existing tools, MAT performed the best, albeit with moderate recall. To our surprise, both open-source and proprietary LLMs exhibited poor detection accuracy, primarily due to low precision. These results indicate that neither existing approaches nor current LLMs can reliably detect SATD in test code. Overall, this work provides the first large-scale analysis of SATD in test code, a nuanced understanding of its types, and the limitations of current SATD detection methods. Our findings lay the groundwork for future research on test code-specific SATD.
☆ BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles
In this paper, we propose Bootstrapped Language-Image Pretraining-driven Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a novel multimodal reinforcement learning (RL) framework for autonomous lane-keeping (LK), in which semantic embeddings generated by a vision-language model (VLM) are directly fused with geometric states, LiDAR observations, and Proportional-Integral-Derivative-based (PID) control feedback within the agent observation space. The proposed method lets the agent learn driving rules that are aware of their surroundings and easy to understand by combining high-level scene understanding from the VLM with low-level control and spatial signals. Our architecture brings together semantic, geometric, and control-aware representations to make policy learning more robust. A hybrid reward function that includes semantic alignment, LK accuracy, obstacle avoidance, and speed regulation helps learning to be more efficient and generalizable. Our method is different from the approaches that only use semantic models to shape rewards. Instead, it directly embeds semantic features into the state representation. This cuts down on expensive runtime inference and makes sure that semantic guidance is always available. The simulation results show that the proposed model is better at LK stability and adaptability than the best vision-based and multimodal RL baselines in a wide range of difficult driving situations. We make our code publicly available.
comment: https://github.com/Amin-A96/BLIP-FusePPO-A-Vision-Language-Deep-Reinforcement-Learning-Framework-for-Lane-Keeping-in-Autonomous.git
☆ Operationalizing Large Language Models with Design-Aware Contexts for Code Comment Generation
Comments are very useful to the flow of code development. With the increasing commonality of code, novice coders have been creating a significant amount of codebases. Due to lack of commenting standards, their comments are often useless, and increase the time taken to further maintain codes. This study intends to find the usefulness of large language models (LLMs) in these cases to generate potentially better comments. This study focuses on the feasibility of design documents as a context for the LLMs to generate more useful comments, as design documents are often used by maintainers to understand code when comments do not suffice.
☆ Harnessing the Power of Large Language Models for Software Testing Education: A Focus on ISTQB Syllabus
Software testing is a critical component in the software engineering field and is important for software engineering education. Thus, it is vital for academia to continuously improve and update educational methods to reflect the current state of the field. The International Software Testing Qualifications Board (ISTQB) certification framework is globally recognized and widely adopted in industry and academia. However, ISTQB-based learning has been rarely applied with recent generative artificial intelligence advances. Despite the growing capabilities of large language models (LLMs), ISTQB-based learning and instruction with LLMs have not been thoroughly explored. This paper explores and evaluates how LLMs can complement the ISTQB framework for higher education. The findings present four key contributions: (i) the creation of a comprehensive ISTQB-aligned dataset spanning over a decade, consisting of 28 sample exams and 1,145 questions; (ii) the development of a domain-optimized prompt that enhances LLM precision and explanation quality on ISTQB tasks; (iii) a systematic evaluation of state-of-the-art LLMs on this dataset; and (iv) actionable insights and recommendations for integrating LLMs into software testing education. These findings highlight the promise of LLMs in supporting ISTQB certification preparation and offer a foundation for their broader use in software engineering at higher education.
comment: 7 pages, 3 figures, 3 tables
☆ Ten Simple Rules for AI-Assisted Coding in Science
While AI coding tools have demonstrated potential to accelerate software development, their use in scientific computing raises critical questions about code quality and scientific validity. In this paper, we provide ten practical rules for AI-assisted coding that balance leveraging capabilities of AI with maintaining scientific and methodological rigor. We address how AI can be leveraged strategically throughout the development cycle with four key themes: problem preparation and understanding, managing context and interaction, testing and validation, and code quality assurance and iterative improvement. These principles serve to emphasize maintaining human agency in coding decisions, establishing robust validation procedures, and preserving the domain expertise essential for methodologically sound research. These rules are intended to help researchers harness AI's transformative potential for faster software development while ensuring that their code meets the standards of reliability, reproducibility, and scientific validity that research integrity demands.
comment: 9 pages of content; 1 table; 1 page appendix
☆ Understanding Self-Admitted Technical Debt in Test Code: An Empirical Study
Developers often opt for easier but non-optimal implementation to meet deadlines or create rapid prototypes, leading to additional effort known as technical debt to improve the code later. Oftentimes, developers explicitly document the technical debt in code comments, referred to as Self-Admitted Technical Debt (SATD). Numerous researchers have investigated the impact of SATD on different aspects of software quality and development processes. However, most of these studies focus on SATD in production code, often overlooking SATD in the test code or assuming that it shares similar characteristics with SATD in production code. In fact, a significant amount of SATD is also present in the test code, with many instances not fitting into existing categories for the production code. This study aims to fill this gap and disclose the nature of SATD in the test code by examining its distribution and types. Moreover, the relation between its presence and test quality is also analyzed. Our empirical study, involving 17,766 SATD comments (14,987 from production code, 2,779 from test code) collected from 50 repositories, demonstrates that while SATD widely exists in test code, it is not directly associated with test smells. Our study also presents comprehensive categories of SATD types in the test code, and machine learning models are developed to automatically classify SATD comments based on their types for easier management. Our results show that the CodeBERT-based model outperforms other machine learning models in terms of recall and F1-score. However, the performance varies on different types of SATD.
☆ Taming Silent Failures: A Framework for Verifiable AI Reliability
The integration of Artificial Intelligence (AI) into safety-critical systems introduces a new reliability paradigm: silent failures, where AI produces confident but incorrect outputs that can be dangerous. This paper introduces the Formal Assurance and Monitoring Environment (FAME), a novel framework that confronts this challenge. FAME synergizes the mathematical rigor of offline formal synthesis with the vigilance of online runtime monitoring to create a verifiable safety net around opaque AI components. We demonstrate its efficacy in an autonomous vehicle perception system, where FAME successfully detected 93.5% of critical safety violations that were otherwise silent. By contextualizing our framework within the ISO 26262 and ISO/PAS 8800 standards, we provide reliability engineers with a practical, certifiable pathway for deploying trustworthy AI. FAME represents a crucial shift from accepting probabilistic performance to enforcing provable safety in next-generation systems.
comment: This preprint has been accepted by IEEE Reliability Magazine. 10 pages, 3 figures
☆ LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation
Automated unit test generation is essential for robust software development, yet existing approaches struggle to generalize across multiple programming languages and operate within real-time development. While Large Language Models (LLMs) offer a promising solution, their ability to generate high coverage test code depends on prompting a concise context of the focal method. Current solutions, such as Retrieval-Augmented Generation, either rely on imprecise similarity-based searches or demand the creation of costly, language-specific static analysis pipelines. To address this gap, we present LSPRAG, a framework for concise-context retrieval tailored for real-time, language-agnostic unit test generation. LSPRAG leverages off-the-shelf Language Server Protocol (LSP) back-ends to supply LLMs with precise symbol definitions and references in real time. By reusing mature LSP servers, LSPRAG provides an LLM with language-aware context retrieval, requiring minimal per-language engineering effort. We evaluated LSPRAG on open-source projects spanning Java, Go, and Python. Compared to the best performance of baselines, LSPRAG increased line coverage by up to 174.55% for Golang, 213.31% for Java, and 31.57% for Python.
comment: 13pages, 6 figures
♻ ☆ CSnake: Detecting Self-Sustaining Cascading Failure via Causal Stitching of Fault Propagations EuroSys 2026
Recent studies have revealed that self-sustaining cascading failures in distributed systems frequently lead to widespread outages, which are challenging to contain and recover from. Existing failure detection techniques struggle to expose such failures prior to deployment, as they typically require a complex combination of specific conditions to be triggered. This challenge stems from the inherent nature of cascading failures, as they typically involve a sequence of fault propagations, each activated by distinct conditions. This paper presents CSnake, a fault injection framework to expose self-sustaining cascading failures in distributed systems. CSnake uses the novel idea of causal stitching, which causally links multiple single-fault injections in different tests to simulate complex fault propagation chains. To identify these chains, CSnake designs a counterfactual causality analysis of fault propagations - fault causality analysis (FCA): FCA compares the execution trace of a fault injection run with its corresponding profile run (i.e., same test w/o the injection) and identifies any additional faults triggered, which are considered to have a causal relationship with the injected fault. To address the large search space of fault and workload combinations, CSnake employs a three-phase allocation protocol of test budget that prioritizes faults with unique and diverse causal consequences, increasing the likelihood of uncovering conditional fault propagations. Furthermore, to avoid incorrectly connecting fault propagations from workloads with incompatible conditions, CSnake performs a local compatibility check that approximately checks the compatibility of the path constraints associated with connected fault propagations with low overhead. CSnake detected 15 bugs that cause self-sustaining cascading failures in five systems, five of which have been confirmed with two fixed.
comment: Accepted by EuroSys 2026
♻ ☆ LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets
Tangled code changes, commits that conflate unrelated modifications such as bug fixes, refactorings, and enhancements, introduce significant noise into bug datasets and adversely affect the performance of bug prediction models. Addressing this issue at a fine-grained, method-level granularity remains unexplored. This is critical to address, as recent bug prediction models, driven by practitioner demand, are increasingly focusing on finer granularity rather than traditional class- or file-level predictions. This study investigates the utility of Large Language Models (LLMs) for detecting tangled code changes by leveraging both commit messages and method-level code diffs. We formulate the problem as a binary classification task and evaluate multiple prompting strategies, including zero-shot, few-shot, and chain-of-thought prompting, using state-of-the-art proprietary LLMs such as GPT-5 and Gemini-2.0-Flash, and open-source models such as GPT-OSS-120B and CodeBERT. Our results demonstrate that combining commit messages with code diffs significantly enhances model performance, with the combined few-shot and chain-of-thought prompting achieving an F1-score of 0.883. Additionally, we explore machine learning models trained on LLM-generated embeddings, where a multi-layer perceptron classifier achieves superior performance (F1-score: 0.906, MCC: 0.807). Applying our approach to 49 open-source projects improves the distributional separability of code metrics between buggy and non-buggy methods, demonstrating the promise of LLMs for method-level commit untangling and potentially contributing to improving the accuracy of future bug prediction models.
♻ ☆ JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks
Jupyter notebooks are widely used for machine learning (ML) prototyping. Yet few debugging tools are designed for ML code in notebooks, potentially due to the lack of benchmarks. We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks. JunoBench has 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verifiable fix, ranging over popular ML libraries, including TensorFlow/Keras, PyTorch, Scikit-learn, Pandas, and NumPy, as well as notebook-specific out-of-order execution issue. To support reproducibility and ease of use, JunoBench offers a unified execution environment where crashes and fixes can be reliably reproduced. By providing realistic crashes and their resolutions, JunoBench facilitates bug detection, localization, diagnosis, and repair tailored to the interactive and iterative nature of notebook-based ML development.
♻ ☆ Innovative Software Development and Project Management Framework for Technology Startups
This paper introduces a new process that integrates inventive problem-solving methods into modern software development. The central research question addresses how tech startups can enhance their software development processes with minimal project management expertise. The Systematic Innovation Mounted Software Development Process, which blends Agile and Systematic Innovation, offers an alternative framework to facilitate idea generation in software products. This intuitive project management framework empowers technology-driven companies to manage their projects more effectively. The study aims to provide guidelines for entrepreneurs to manage projects successfully. By collaborating with the existing Agile model, the Systematic Innovation model fosters creativity and innovative problem-solving. Ultimately, this new software development process and its techniques have the potential to significantly impact the software industry, particularly for startups, as they alleviate managerial burdens and allow companies to focus on their core technologies.
Graphics 1
☆ MOGRAS: Human Motion with Grasping in 3D Scenes
Generating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate full-body motion within 3D scenes, they often lack the fidelity for fine-grained tasks like object grasping. Conversely, methods that generate precise grasping motions typically ignore the surrounding 3D scene. This gap, generating full-body grasping motions that are physically plausible within a 3D scene, remains a significant challenge. To address this, we introduce MOGRAS (Human MOtion with GRAsping in 3D Scenes), a large-scale dataset that bridges this gap. MOGRAS provides pre-grasping full-body walking motions and final grasping poses within richly annotated 3D indoor scenes. We leverage MOGRAS to benchmark existing full-body grasping methods and demonstrate their limitations in scene-aware generation. Furthermore, we propose a simple yet effective method to adapt existing approaches to work seamlessly within 3D scenes. Through extensive quantitative and qualitative experiments, we validate the effectiveness of our dataset and highlight the significant improvements our proposed method achieves, paving the way for more realistic human-scene interactions.
comment: British Machine Vision Conference Workshop - From Scene Understanding to Human Modeling
Information Retrieval 11
☆ FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation
While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3 points over the strongest iterative baseline -- establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.
comment: 30 pages, 5 figures, 5 tables. Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Agentic AI, Multi-hop Question Answering, Faithfulness
☆ PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding
Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at https://github.com/iliass-y/patenteb. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.
☆ PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks: citation retrieval, content extraction, paper discovery, and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under realistic usage conditions-via web interfaces where search operations are opaque to the user. Through controlled experiments, we find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature. Further human analysis attributes these failures to the uncontrolled expansion of retrieved context and the tendency of LLMs to prioritize semantically relevant text over task instructions. Across basic tasks, the LLMs display distinct failure behaviors: ChatGPT often withholds responses rather than risk errors, whereas Gemini produces fluent but fabricated answers. To address these issues, we develop lightweight reliability classifiers trained on PaperAsk data to identify unreliable outputs. PaperAsk provides a reproducible and diagnostic framework for advancing the reliability evaluation of LLM-based scholarly assistance systems.
☆ Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy
Retrieval over visually rich documents is essential for tasks such as legal discovery, scientific search, and enterprise knowledge management. Existing approaches fall into two paradigms: single-vector retrieval, which is efficient but coarse, and multi-vector retrieval, which is accurate but computationally expensive. To address this trade-off, we propose HEAVEN, a two-stage hybrid-vector framework. In the first stage, HEAVEN efficiently retrieves candidate pages using a single-vector method over Visually-Summarized Pages (VS-Pages), which assemble representative visual layouts from multiple pages. In the second stage, it reranks candidates with a multi-vector method while filtering query tokens by linguistic importance to reduce redundant computations. To evaluate retrieval systems under realistic conditions, we also introduce ViMDOC, the first benchmark for visually rich, multi-document, and long-document retrieval. Across four benchmarks, HEAVEN attains 99.87% of the Recall@1 performance of multi-vector models on average while reducing per-query computation by 99.82%, achieving efficiency and accuracy. Our code and datasets are available at: https://github.com/juyeonnn/HEAVEN
☆ Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search
Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to $40\%$ while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by up to $10$x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system's throughput by $10$x in a real-world deployment, while meeting our quality bar.
♻ ☆ A Survey of Long-Document Retrieval in the PLM and LLM Era
The proliferation of long-form documents presents a fundamental challenge to information retrieval (IR), as their length, dispersed evidence, and complex structures demand specialized methods beyond standard passage-level techniques. This survey provides the first comprehensive treatment of long-document retrieval (LDR), consolidating methods, challenges, and applications across three major eras. We systematize the evolution from classical lexical and early neural models to modern pre-trained (PLM) and large language models (LLMs), covering key paradigms like passage aggregation, hierarchical encoding, efficient attention, and the latest LLM-driven re-ranking and retrieval techniques. Beyond the models, we review domain-specific applications, specialized evaluation resources, and outline critical open challenges such as efficiency trade-offs, multimodal alignment, and faithfulness. This survey aims to provide both a consolidated reference and a forward-looking agenda for advancing long-document retrieval in the era of foundation models.
comment: 32 pages, 6 figures
♻ ☆ Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey
Modern information retrieval (IR) must reconcile short, ambiguous queries with increasingly diverse and dynamic corpora. Query expansion (QE) remains central to alleviating vocabulary mismatch, yet the design space has shifted with pre-trained and large language models (PLMs, LLMs). In this survey, we organize recent work along four complementary dimensions: the point of injection (implicit/embedding vs. selection-based explicit), grounding and interaction (from zero-grounding prompts to multi-round retrieve-expand loops), learning and alignment (SFT/PEFT/DPO), and knowledge-graph integration. A model-centric taxonomy is also outlined, spanning encoder-only, encoder-decoder, decoder-only, instruction-tuned, and domain or multilingual variants, with affordances for QE such as contextual disambiguation, controllable generation, and zero-shot or few-shot reasoning. Practice-oriented guidance specifies where neural QE helps most: first-stage retrieval, multi-query fusion, re-ranking, and retrieval-augmented generation (RAG). The survey compares traditional and neural QE across seven aspects and maps applications in web search, biomedicine, e-commerce, open-domain question answering/RAG, conversational and code search, and cross-lingual settings. The survey concludes with an agenda focused on reliable, safe, efficient, and adaptive QE, offering a principled blueprint for deploying and combining techniques under real-world constraints.
comment: 36 pages,3 figures,3 tables
♻ ☆ Unifying Search and Recommendation with Dual-View Representation Learning in a Generative Paradigm
Recommender systems and search engines serve as foundational elements of online platforms, with the former delivering information proactively and the latter enabling users to seek information actively. Unifying both tasks in a shared model is promising since it can enhance user modeling and item understanding. Previous approaches mainly follow a discriminative paradigm, utilizing shared encoders to process input features and task-specific heads to perform each task. However, this paradigm encounters two key challenges: gradient conflict and manual design complexity. From the information theory perspective, these challenges potentially both stem from the same issue -- low mutual information between the input features and task-specific outputs during the optimization process. To tackle these issues, we propose GenSR, a novel generative paradigm for unifying search and recommendation (S&R), which leverages task-specific prompts to partition the model's parameter space into subspaces, thereby enhancing mutual information. To construct effective subspaces for each task, GenSR first prepares informative representations for each subspace and then optimizes both subspaces in one unified model. Specifically, GenSR consists of two main modules: (1) Dual Representation Learning, which independently models collaborative and semantic historical information to derive expressive item representations; and (2) S&R Task Unifying, which utilizes contrastive learning together with instruction tuning to generate task-specific outputs effectively. Extensive experiments on two public datasets show GenSR outperforms state-of-the-art methods across S&R tasks. Our work introduces a new generative paradigm compared with previous discriminative methods and establishes its superiority from the mutual information perspective.
♻ ☆ WildClaims: Information Access Conversations in the Wild(Chat)
The rapid advancement of Large Language Models (LLMs) has transformed conversational systems into practical tools used by millions. However, the nature and necessity of information retrieval in real-world conversations remain largely unexplored, as research has focused predominantly on traditional, explicit information access conversations. The central question is: What do real-world information access conversations look like? To this end, we first conduct an observational study on the WildChat dataset, large-scale user-ChatGPT conversations, finding that users' access to information occurs implicitly as check-worthy factual assertions made by the system, even when the conversation's primary intent is non-informational, such as creative writing. To enable the systematic study of this phenomenon, we release the WildClaims dataset, a novel resource consisting of 121,905 extracted factual claims from 7,587 utterances in 3,000 WildChat conversations, each annotated for check-worthiness. Our preliminary analysis of this resource reveals that conservatively 18% to 51% of conversations contain check-worthy assertions, depending on the methods employed, and less conservatively, as many as 76% may contain such assertions. This high prevalence underscores the importance of moving beyond the traditional understanding of explicit information access, to address the implicit information access that arises in real-world user-system conversations.
♻ ☆ FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems
A systematic, reliable, and low-cost evaluation of Conversational Recommender Systems (CRSs) remains an open challenge. Existing automatic CRS evaluation methods are proven insufficient for evaluating the dynamic nature of recommendation conversations. This work proposes FACE: a Fine-grained, Aspect-based Conversation Evaluation method that provides evaluation scores for diverse turn and dialogue level qualities of recommendation conversations. FACE is reference-free and shows strong correlation with human judgments, achieving system correlation of 0.9 and turn/dialogue-level of 0.5, outperforming state-of-the-art CRS evaluation methods by a large margin. Additionally, unlike existing LLM-based methods that provide single uninterpretable scores, FACE provides insights into the system performance and enables identifying and locating problems within conversations.
♻ ☆ Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval ACM MM 2025
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.
comment: Accepted to ACM MM 2025
Image and Video Processing 14
☆ TraceTrans: Translation and Spatial Tracing for Surgical Prediction
Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.
☆ Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy
Chromatin sensitive partial wave spectroscopic (csPWS) microscopy enables label free detection of nanoscale chromatin packing alterations that occur before visible cellular transformation. However, manual nuclear segmentation limits population scale analysis needed for biomarker discovery in early cancer detection. The lack of annotated csPWS imaging data prevents direct use of standard deep learning methods. We present CFU Net, a hierarchical segmentation architecture trained with a three stage curriculum on synthetic multimodal data. CFU Net achieves near perfect performance on held out synthetic test data that represent diverse spectroscopic imaging conditions without manual annotations (Dice 0.9879, IoU 0.9895). Our approach uses physics based rendering that incorporates empirically supported chromatin packing statistics, Mie scattering models, and modality specific noise, combined with a curriculum that progresses from adversarial RGB pretraining to spectroscopic fine tuning and histology validation. CFU Net integrates five architectural elements (ConvNeXt backbone, Feature Pyramid Network, UNet plus plus dense connections, dual attention, and deep supervision) that together improve Dice over a baseline UNet by 8.3 percent. We demonstrate deployment ready INT8 quantization with 74.9 percent compression and 0.15 second inference, giving a 240 times throughput gain over manual analysis. Applied to more than ten thousand automatically segmented nuclei from synthetic test data, the pipeline extracts chromatin biomarkers that distinguish normal from pre cancerous tissue with large effect sizes (Cohens d between 1.31 and 2.98), reaching 94 percent classification accuracy. This work provides a general framework for synthetic to real transfer learning in specialized microscopy and open resources for community validation on clinical specimens.
comment: 24 pages, 5 figures and 4 tables
☆ Expert Validation of Synthetic Cervical Spine Radiographs Generated with a Denoising Diffusion Probabilistic Model
Machine learning in neurosurgery is limited by challenges in assembling large, high-quality imaging datasets. Synthetic data offers a scalable, privacy-preserving solution. We evaluated the feasibility of generating realistic lateral cervical spine radiographs using a denoising diffusion probabilistic model (DDPM) trained on 4,963 images from the Cervical Spine X-ray Atlas. Model performance was monitored via training/validation loss and Frechet inception distance, and synthetic image quality was assessed in a blinded "clinical Turing test" with six neuroradiologists and two spine-fellowship trained neurosurgeons. Experts reviewed 50 quartets containing one real and three synthetic images, identifying the real image and rating realism on a 4-point Likert scale. Experts correctly identified the real image in 29% of trials (Fleiss' kappa=0.061). Mean realism scores were comparable between real (3.323) and synthetic images (3.228, 3.258, and 3.320; p=0.383, 0.471, 1.000). Nearest-neighbor analysis found no evidence of memorization. We also provide a dataset of 20,063 synthetic radiographs. These results demonstrate that DDPM-generated cervical spine X-rays are statistically indistinguishable in realism and quality from real clinical images, offering a novel approach to creating large-scale neuroimaging datasets for ML applications in landmarking, segmentation, and classification.
comment: 10 pages, 4 figures, 1 table
☆ Frequency-Spatial Interaction Driven Network for Low-Light Image Enhancement
Low-light image enhancement (LLIE) aims at improving the perception or interpretability of an image captured in an environment with poor illumination. With the advent of deep learning, the LLIE technique has achieved significant breakthroughs. However, existing LLIE methods either ignore the important role of frequency domain information or fail to effectively promote the propagation and flow of information, limiting the LLIE performance. In this paper, we develop a novel frequency-spatial interaction-driven network (FSIDNet) for LLIE based on two-stage architecture. To be specific, the first stage is designed to restore the amplitude of low-light images to improve the lightness, and the second stage devotes to restore phase information to refine fine-grained structures. Considering that Frequency domain and spatial domain information are complementary and both favorable for LLIE, we further develop two frequency-spatial interaction blocks which mutually amalgamate the complementary spatial and frequency information to enhance the capability of the model. In addition, we construct the Information Exchange Module (IEM) to associate two stages by adequately incorporating cross-stage and cross-scale features to effectively promote the propagation and flow of information in the two-stage network structure. Finally, we conduct experiments on several widely used benchmark datasets (i.e., LOL-Real, LSRW-Huawei, etc.), which demonstrate that our method achieves the excellent performance in terms of visual results and quantitative metrics while preserving good model efficiency.
☆ LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction
Vision-Language Models (VLMs) have shown significant progress in open-set challenges. However, the limited availability of 3D datasets hinders their effective application in 3D scene understanding. We propose LOC, a general language-guided framework adaptable to various occupancy networks, supporting both supervised and self-supervised learning paradigms. For self-supervised tasks, we employ a strategy that fuses multi-frame LiDAR points for dynamic/static scenes, using Poisson reconstruction to fill voids, and assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain comprehensive voxel representations. To mitigate feature over-homogenization caused by direct high-dimensional feature distillation, we introduce Densely Contrastive Learning (DCL). DCL leverages dense voxel semantic information and predefined textual prompts. This efficiently enhances open-set recognition without dense pixel-level supervision, and our framework can also leverage existing ground truth to further improve performance. Our model predicts dense voxel features embedded in the CLIP feature space, integrating textual and image pixel information, and classifies based on text and semantic similarity. Experiments on the nuScenes dataset demonstrate the method's superior performance, achieving high-precision predictions for known classes and distinguishing unknown classes without additional training data.
☆ FPGA-based Lane Detection System incorporating Temperature and Light Control Units
Intelligent vehicles are one of the most important outcomes gained from the world tendency toward automation. Applications of IVs, whether in urban roads or robot tracks, do prioritize lane path detection. This paper proposes an FPGA-based Lane Detector Vehicle LDV architecture that relies on the Sobel algorithm for edge detection. Operating on 416 x 416 images and 150 MHz, the system can generate a valid output every 1.17 ms. The valid output consists of the number of present lanes, the current lane index, as well as its right and left boundaries. Additionally, the automated light and temperature control units in the proposed system enhance its adaptability to the surrounding environmental conditions.
comment: 5 pages, 8 figures, 3 tables
☆ Cross-Enhanced Multimodal Fusion of Eye-Tracking and Facial Features for Alzheimer's Disease Diagnosis
Accurate diagnosis of Alzheimer's disease (AD) is essential for enabling timely intervention and slowing disease progression. Multimodal diagnostic approaches offer considerable promise by integrating complementary information across behavioral and perceptual domains. Eye-tracking and facial features, in particular, are important indicators of cognitive function, reflecting attentional distribution and neurocognitive state. However, few studies have explored their joint integration for auxiliary AD diagnosis. In this study, we propose a multimodal cross-enhanced fusion framework that synergistically leverages eye-tracking and facial features for AD detection. The framework incorporates two key modules: (a) a Cross-Enhanced Fusion Attention Module (CEFAM), which models inter-modal interactions through cross-attention and global enhancement, and (b) a Direction-Aware Convolution Module (DACM), which captures fine-grained directional facial features via horizontal-vertical receptive fields. Together, these modules enable adaptive and discriminative multimodal representation learning. To support this work, we constructed a synchronized multimodal dataset, including 25 patients with AD and 25 healthy controls (HC), by recording aligned facial video and eye-tracking sequences during a visual memory-search paradigm, providing an ecologically valid resource for evaluating integration strategies. Extensive experiments on this dataset demonstrate that our framework outperforms traditional late fusion and feature concatenation methods, achieving a classification accuracy of 95.11% in distinguishing AD from HC, highlighting superior robustness and diagnostic performance by explicitly modeling inter-modal dependencies and modality-specific contributions.
comment: 35 pages, 8 figures, and 7 tables
☆ CFL-SparseMed: Communication-Efficient Federated Learning for Medical Imaging with Top-k Sparse Updates
Secure and reliable medical image classification is crucial for effective patient treatment, but centralized models face challenges due to data and privacy concerns. Federated Learning (FL) enables privacy-preserving collaborations but struggles with heterogeneous, non-IID data and high communication costs, especially in large networks. We propose \textbf{CFL-SparseMed}, an FL approach that uses Top-k Sparsification to reduce communication overhead by transmitting only the top k gradients. This unified solution effectively addresses data heterogeneity while maintaining model accuracy. It enhances FL efficiency, preserves privacy, and improves diagnostic accuracy and patient care in non-IID medical imaging settings. The reproducibility source code is available on \href{https://github.com/Aniket2241/APK_contruct}{Github}.
♻ ☆ Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration
Diffusion models show promise for image restoration, but existing methods often struggle with inconsistent fidelity and undesirable artifacts. To address this, we introduce Kernel Density Steering (KDS), a novel inference-time framework promoting robust, high-fidelity outputs through explicit local mode-seeking. KDS employs an $N$-particle ensemble of diffusion samples, computing patch-wise kernel density estimation gradients from their collective outputs. These gradients steer patches in each particle towards shared, higher-density regions identified within the ensemble. This collective local mode-seeking mechanism, acting as "collective wisdom", steers samples away from spurious modes prone to artifacts, arising from independent sampling or model imperfections, and towards more robust, high-fidelity structures. This allows us to obtain better quality samples at the expense of higher compute by simultaneously sampling multiple particles. As a plug-and-play framework, KDS requires no retraining or external verifiers, seamlessly integrating with various diffusion samplers. Extensive numerical validations demonstrate KDS substantially improves both quantitative and qualitative performance on challenging real-world super-resolution and image inpainting tasks.
♻ ☆ A Poisson-Guided Decomposition Network for Extreme Low-Light Image Enhancement
Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.
comment: 8 pages, 3 figures and 1 table
♻ ☆ Dynamic-Aware Spatio-temporal Representation Learning for Dynamic MRI Reconstruction MICCAI2025
Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the use of deep learning techniques. Especially, the practical difficulty of obtaining ground truth data has led to the emergence of unsupervised learning approaches. A recent promising method among them is implicit neural representation (INR), which defines the data as a continuous function that maps coordinate values to the corresponding signal values. This allows for filling in missing information only with incomplete measurements and solving the inverse problem effectively. Nevertheless, previous works incorporating this method have faced drawbacks such as long optimization time and the need for extensive hyperparameter tuning. To address these issues, we propose Dynamic-Aware INR (DA-INR), an INR-based model for dynamic MRI reconstruction that captures the spatial and temporal continuity of dynamic MRI data in the image domain and explicitly incorporates the temporal redundancy of the data into the model structure. As a result, DA-INR outperforms other models in reconstruction quality even at extreme undersampling ratios while significantly reducing optimization time and requiring minimal hyperparameter tuning.
comment: MICCAI2025
♻ ☆ Continuous and complete liver vessel segmentation with graph-attention guided diffusion
Improving connectivity and completeness are the most challenging aspects of liver vessel segmentation, especially for small vessels. These challenges require both learning the continuous vessel geometry, and focusing on small vessel detection. However, current methods do not explicitly address these two aspects and cannot generalize well when constrained by inconsistent annotations. Here, we take advantage of the generalization of the diffusion model and explicitly integrate connectivity and completeness in our diffusion-based segmentation model. Specifically, we use a graph-attention module that adds knowledge about vessel geometry, and thus adds continuity. Additionally, we perform the graph-attention at multiple-scales, thus focusing on small liver vessels. Our method outperforms eight state-of-the-art medical segmentation methods on two public datasets: 3D-ircadb-01 and LiVS. Our code is available at https://github.com/ZhangXiaotong015/GATSegDiff.
comment: Accepted by Knowledge-Based Systems
♻ ☆ Macro2Micro: A Rapid and Precise Cross-modal Magnetic Resonance Imaging Synthesis using Multi-scale Structural Brain Similarity
The human brain is a complex system requiring both macroscopic and microscopic components for comprehensive understanding. However, mapping nonlinear relationships between these scales remains challenging due to technical limitations and the high cost of multimodal Magnetic Resonance Imaging (MRI) acquisition. To address this, we introduce Macro2Micro, a deep learning framework that predicts brain microstructure from macrostructure using a Generative Adversarial Network (GAN). Based on the hypothesis that microscale structural information can be inferred from macroscale structures, Macro2Micro explicitly encodes multiscale brain information into distinct processing branches. To enhance artifact elimination and output quality, we propose a simple yet effective auxiliary discriminator and learning objective. Extensive experiments demonstrated that Macro2Micro faithfully translates T1-weighted MRIs into corresponding Fractional Anisotropy (FA) images, achieving a 6.8\% improvement in the Structural Similarity Index Measure (SSIM) compared to previous methods, while retaining the individual biological characteristics of the brain. With an inference time of less than 0.01 seconds per MR modality translation, Macro2Micro introduces the potential for real-time multimodal rendering in medical and research applications. The code will be made available upon acceptance.
comment: The code will be made available upon acceptance
♻ ☆ Reducing the Representation Error of GAN Image Priors Using the Deep Decoder
Generative models, such as GANs, learn an explicit low-dimensional representation of a particular class of images, and so they may be used as natural image priors for solving inverse problems such as image restoration and compressive sensing. GAN priors have demonstrated impressive performance on these tasks, but they can exhibit substantial representation error for both in-distribution and out-of-distribution images, because of the mismatch between the learned, approximate image distribution and the data generating distribution. In this paper, we demonstrate a method for reducing the representation error of GAN priors by modeling images as the linear combination of a GAN prior with a Deep Decoder. The deep decoder is an underparameterized and most importantly unlearned natural signal model similar to the Deep Image Prior. No knowledge of the specific inverse problem is needed in the training of the GAN underlying our method. For compressive sensing and image superresolution, our hybrid model exhibits consistently higher PSNRs than both the GAN priors and Deep Decoder separately, both on in-distribution and out-of-distribution images. This model provides a method for extensibly and cheaply leveraging both the benefits of learned and unlearned image recovery priors in inverse problems.
Systems and Control 16
☆ Resilient Composite Control for Stability Enhancement in EV Integrated DC Microgrids
When electric vehicles (EVs) are integrated into standalone DC microgrids (DCMGs), stability issues arise due to their constant power load (CPL) behavior, which provides negative incremental impedance (NII). In addition, the microgrids suffer from an inherent low-inertia problem. Therefore, this study presents a composite controller incorporating a global integral terminal sliding mode controller with a backstepping controller. A virtual capacitor is employed to mitigate the low-inertia issue and strengthen the DC-bus response. An improved fractional power-based reaching law decreases chattering and accelerates convergence. Exact feedback linearization converts the nonlinear boost converter model into Brunovsky's canonical form, mitigating NII effects and non-minimum phase issues. The entire system stability is verified using Lyapunov control theory. Simulation outcomes confirm superior performance, with 34.4-53.3% reduction in overshoot, 52.9-74.9% in undershoot, and 12-47.4% in settling time compared to the existing controller.
☆ A Novel Multi-Timescale Stability-Preserving Hierarchical Reinforcement Learning Controller Framework for Adaptive Control in High-Dimensional Dynamical Systems
Controlling high-dimensional stochastic systems, critical in robotics, autonomous vehicles, and hyperchaotic systems, faces the curse of dimensionality, lacks temporal abstraction, and often fails to ensure stochastic stability. To overcome these limitations, this study introduces the Multi-Timescale Lyapunov-Constrained Hierarchical Reinforcement Learning (MTLHRL) framework. MTLHRL integrates a hierarchical policy within a semi-Markov Decision Process (SMDP), featuring a high-level policy for strategic planning and a low-level policy for reactive control, which effectively manages complex, multi-timescale decision-making and reduces dimensionality overhead. Stability is rigorously enforced using a neural Lyapunov function optimized via Lagrangian relaxation and multi-timescale actor-critic updates, ensuring mean-square boundedness or asymptotic stability in the face of stochastic dynamics. The framework promotes efficient and reliable learning through trust-region constraints and decoupled optimization. Extensive simulations on an 8D hyperchaotic system and a 5-DOF robotic manipulator demonstrate MTLHRL's empirical superiority. It significantly outperforms baseline methods in both stability and performance, recording the lowest error indices (e.g., Integral Absolute Error (IAE): 3.912 in hyperchaotic control and IAE: 1.623 in robotics), achieving faster convergence, and exhibiting superior disturbance rejection. MTLHRL offers a theoretically grounded and practically viable solution for robust control of complex stochastic systems.
☆ Politics, Inequality, and the Robustness of Shared Infrastructure Systems
Our infrastructure systems enable our well-being by allowing us to move, store, and transform materials and information given considerable social and environmental variation. Critically, this ability is shaped by the degree to which society invests in infrastructure, a fundamentally political question in large public systems. There, infrastructure providers are distinguished from users through political processes, such as elections, and there is considerable heterogeneity among users. Previous political economic models have not taken into account (i) dynamic infrastructures, (ii) dynamic user preferences, and (iii) alternatives to rational actor theory. Meanwhile, engineering often neglects politics. We address these gaps with a general dynamic model of shared infrastructure systems that incorporates theories from political economy, social-ecological systems, and political psychology. We use the model to develop propositions on how multiple characteristics of the political process impact the robustness of shared infrastructure systems to capacity shocks and unequal opportunity for private infrastructure investment. Under user fees, inequality decreases robustness, but taxing private infrastructure use can increase robustness if non-elites have equal political influence. Election cycle periods have a nonlinear effect where increasing them increases robustness up to a point but decreases robustness beyond that point. Further, there is a negative relationship between the ideological sensitivity of candidates and robustness. Overall, the biases of voters and candidates (whether they favor tax increases or decreases) mediate these political-economic effects on robustness because biases may or may not match the reality of system needs (whether system recovery requires tax increases).
☆ Vector-Valued Native Space Embedding for Adaptive State Observation
This paper combines vector-valued reproducing kernel Hilbert space (vRKHS) embedding with robust adaptive observation, yielding an algorithm that is both non-parametric and robust. The main contribution of this paper lies in the ability of the proposed system to estimate the state of a plan model whose matched uncertainties are elements of an infinite-dimensional native space. The plant model considered in this paper also suffers from unmatched uncertainties. Finally, the measured output is affected by disturbances as well. Upper bounds on the state observation error are provided in an analytical form. The proposed theoretical results are applied to the problem of estimating the state of a rigid body.
☆ Model-Free Power System Stability Enhancement with Dissipativity-Based Neural Control SC
The integration of converter-interfaced generation introduces new transient stability challenges to modern power systems. Classical Lyapunov- and scalable passivity-based approaches typically rely on restrictive assumptions, and finding storage functions for large grids is generally considered intractable. Furthermore, most methods require an accurate grid dynamics model. To address these challenges, we propose a model-free, nonlinear, and dissipativity-based controller which, when applied to grid-connected virtual synchronous generators (VSGs), enhances power system transient stability. Using input-state data, we train neural networks to learn dissipativity-characterizing matrices that yield stabilizing controllers. Furthermore, we incorporate cost function shaping to improve the performance with respect to the user-specified objectives. Numerical results on a modified, all-VSG Kundur two-area power system validate the effectiveness of the proposed approach.
comment: 7 pages, 6 figures, submitted to the 24th Power Systems Computation Conference (PSCC 2026)
☆ Fair Cost Allocation in Energy Communities: A DLMP-based Bilevel Optimization with a Shapley Value Approach
Energy communities (ECs) are emerging as a promising decentralized model for managing cooperative distributed energy resources (DERs). As these communities expand and their operations become increasingly integrated into the grid, ensuring fairness in allocating operating costs among participants becomes a challenge. In distribution networks, DER operations at the community level can influence Distribution Locational Marginal Prices (DLMPs), which in turn affect system's operation cost. This interdependence between local decisions and system-level pricing introduces new challenges for fair and transparent cost allocation. Despite growing interest in fairness-aware methods, most methods do not account for the impact of DLMPs. To fill this gap, we propose a bilevel optimization model in which a Community Energy Aggregator (CEA) schedules DERs across multiple ECs while a Distribution System Operator (DSO) determines DLMPs through network-constrained dispatch. Leveraging the Karush-Kuhn-Tucker (KKT) conditions and strong duality, the bilevel model is reformulated into a tractable single-level problem. We achieve fairness in the cost allocation by applying the Shapley value to quantify each community's marginal contribution to system-wide cost savings. The effectiveness of the proposed method is validated through simulations on several benchmark distribution systems.
☆ Adapting Noise-Driven PUF and AI for Secure WBG ICS: A Proof-of-Concept Study
Wide-bandgap (WBG) technologies offer unprecedented improvements in power system efficiency, size, and performance, but also introduce unique sensor corruption and cybersecurity risks in industrial control systems (ICS), particularly due to high-frequency noise and sophisticated cyber-physical threats. This proof-of-concept (PoC) study demonstrates the adaptation of a noise-driven physically unclonable function (PUF) and machine learning (ML)-assisted anomaly detection framework to the demanding environment of WBG-based ICS sensor pathways. By extracting entropy from unavoidable WBG switching noise (up to 100 kHz) as a PUF source, and simultaneously using this noise as a real-time threat indicator, the proposed system unites hardware-level authentication and anomaly detection. Our approach integrates hybrid machine learning (ML) models with adaptive Bayesian filtering, providing robust and low-latency detection capabilities resilient to both natural electromagnetic interference (EMI) and active adversarial manipulation. Through detailed simulations of WBG modules under benign and attack scenarios--including EMI injection, signal tampering, and node impersonation--we achieve 95% detection accuracy and sub-millisecond processing latency. These results demonstrate the feasibility of physics-driven, dual-use noise exploitation as a scalable ICS defense primitive. Our findings lay the groundwork for next-generation security strategies that leverage inherent device characteristics, bridging hardware and artificial intelligence (AI) for enhanced protection of critical ICS infrastructure.
☆ Distributed Stochastic Proximal Algorithm on Riemannian Submanifolds for Weakly-convex Functions
This paper aims to investigate the distributed stochastic optimization problems on compact embedded submanifolds (in the Euclidean space) for multi-agent network systems. To address the manifold structure, we propose a distributed Riemannian stochastic proximal algorithm framework by utilizing the retraction and Riemannian consensus protocol, and analyze three specific algorithms: the distributed Riemannian stochastic subgradient, proximal point, and prox-linear algorithms. When the local costs are weakly-convex and the initial points satisfy certain conditions, we show that the iterates generated by this framework converge to a nearly stationary point in expectation while achieving consensus. We further establish the convergence rate of the algorithm framework as $\mathcal{O}(\frac{1+\kappa_g}{\sqrt{k}})$ where $k$ denotes the number of iterations and $\kappa_g$ shows the impact of manifold geometry on the algorithm performance. Finally, numerical experiments are implemented to demonstrate the theoretical results and show the empirical performance.
☆ Taming Silent Failures: A Framework for Verifiable AI Reliability
The integration of Artificial Intelligence (AI) into safety-critical systems introduces a new reliability paradigm: silent failures, where AI produces confident but incorrect outputs that can be dangerous. This paper introduces the Formal Assurance and Monitoring Environment (FAME), a novel framework that confronts this challenge. FAME synergizes the mathematical rigor of offline formal synthesis with the vigilance of online runtime monitoring to create a verifiable safety net around opaque AI components. We demonstrate its efficacy in an autonomous vehicle perception system, where FAME successfully detected 93.5% of critical safety violations that were otherwise silent. By contextualizing our framework within the ISO 26262 and ISO/PAS 8800 standards, we provide reliability engineers with a practical, certifiable pathway for deploying trustworthy AI. FAME represents a crucial shift from accepting probabilistic performance to enforcing provable safety in next-generation systems.
comment: This preprint has been accepted by IEEE Reliability Magazine. 10 pages, 3 figures
☆ TRASE-NODEs: Trajectory Sensitivity-aware Neural Ordinary Differential Equations for Efficient Dynamic Modeling
Modeling dynamical systems is crucial across the science and engineering fields for accurate prediction, control, and decision-making. Recently, machine learning (ML) approaches, particularly neural ordinary differential equations (NODEs), have emerged as a powerful tool for data-driven modeling of continuous-time dynamics. Nevertheless, standard NODEs require a large number of data samples to remain consistent under varying control inputs, posing challenges to generate sufficient simulated data and ensure the safety of control design. To address this gap, we propose trajectory-sensitivity-aware (TRASE-)NODEs, which construct an augmented system for both state and sensitivity, enabling simultaneous learning of their dynamics. This formulation allows the adjoint method to update gradients in a memory-efficient manner and ensures that control-input effects are captured in the learned dynamics. We evaluate TRASE-NODEs using damped oscillator and inverter-based resources (IBRs). The results show that TRASE-NODEs generalize better from the limited training data, yielding lower prediction errors than standard NODEs for both examples. The proposed framework offers a data-efficient, control-oriented modeling approach suitable for dynamic systems that require accurate trajectory sensitivity prediction.
☆ From Time Series to Affine Systems
The paper extends core results of behavioral systems theory from linear to affine time-invariant systems. We characterize the behavior of affine time-invariant systems via kernel, input-output, state-space, and finite-horizon data-driven representations, demonstrating a range of structural parallels with linear time-invariant systems. Building on these representations, we introduce a new persistence of excitation condition tailored to the model class of affine time-invariant systems. The condition yields a new fundamental lemma that parallels the classical result for linear systems while provably reducing data requirements. Our analysis highlights that excitation conditions must be adapted to the model class: overlooking structural differences may lead to unnecessarily conservative data requirements.
comment: Submitted to the IEEE Transactions on Automatic Control
♻ ☆ A general framework for supporting economic feasibility of generator and storage energy systems through capacity and dispatch optimization
Integration of various electricity-generating technologies (such as natural gas, wind, nuclear, etc.) with storage systems (such as thermal, battery electric, hydrogen, etc.) has the potential to improve the economic competitiveness of modern energy systems. Driven by the need to efficiently assess the economic feasibility of various energy system configurations in early system concept development, this work outlines a versatile computational framework for assessing the net present value of various integrated storage technologies. The subsystems' fundamental dynamics are defined, with a particular emphasis on balancing critical physical and economic domains to enable optimal decision-making in the context of capacity and dispatch optimization. In its presented form, the framework formulates a linear, convex optimization problem that can be efficiently solved using a direct transcription approach in the open-source software DTQP. Three case studies demonstrate and validate the framework's capabilities, highlighting its value and computational efficiency in facilitating the economic assessment of various energy system configurations. In particular, natural gas with thermal storage and carbon capture, wind energy with battery storage, and nuclear with hydrogen are demonstrated.
comment: 16 pages, 10 figures
♻ ☆ Accelerated Gradient Methods for Nonconvex Optimization: Escape Trajectories From Strict Saddle Points and Convergence to Local Minima
This paper considers the problem of understanding the behavior of a general class of accelerated gradient methods on smooth nonconvex functions. Motivated by some recent works that have proposed effective algorithms, based on Polyak's heavy ball method and the Nesterov accelerated gradient method, to achieve convergence to a local minimum of nonconvex functions, this work proposes a broad class of Nesterov-type accelerated methods and puts forth a rigorous study of these methods encompassing the escape from saddle points and convergence to local minima through both an asymptotic and a non-asymptotic analysis. In the asymptotic regime, this paper answers an open question of whether Nesterov's accelerated gradient method (NAG) with variable momentum parameter avoids strict saddle points almost surely. This work also develops two metrics of asymptotic rates of convergence and divergence, and evaluates these two metrics for several popular standard accelerated methods such as the NAG and Nesterov's accelerated gradient with constant momentum (NCM) near strict saddle points. In the non-asymptotic regime, this work provides an analysis that leads to the "linear" exit time estimates from strict saddle neighborhoods for trajectories of these accelerated methods as well the necessary conditions for the existence of such trajectories. Finally, this work studies a sub-class of accelerated methods that can converge in convex neighborhoods of nonconvex functions with a near optimal rate to a local minimum and at the same time this sub-class offers superior saddle-escape behavior compared to that of NAG.
comment: 122 pages, 20 figures; accepted for publication in Foundations of Computational Mathematics (FoCM)
♻ ☆ Security of Gradient Tracking Algorithms Against Malicious Agents
Consensus algorithms are fundamental to multi-agent distributed optimization, and their security under adversarial conditions is an active area of research. While prior works primarily establish conditions for successful global consensus under attack, little is known about system behavior when these conditions are violated. This paper addresses this gap by investigating the robustness of the Wang--Elia algorithm, which is a robust to noise version of gradient tracking algorithm, in the presence of malicious agents. We consider a network of agents collaboratively minimizing a global cost function, where a subset of agents may transmit faulty information to disrupt consensus. To quantify resilience, we formulate a security metric as an optimization problem, which is rooted in centralized attack detection literature. We provide a tractable reformulation of the optimization problem, and derive conditions under which the metric becomes unbounded, identifying undetectable attack signals that reveal inherent vulnerabilities. To facilitate design and analysis, we propose a well-posed variant of the metric and propose design methods to enhance network robustness against stealthy adversarial attacks. Numerical examples demonstrate the effectiveness of the proposed framework to enhance the resilience of multi-agent distributed optimization.
comment: under review
♻ ☆ Prognostic Framework for Robotic Manipulators Operating Under Dynamic Task Severities
Robotic manipulators are critical in many applications but are known to degrade over time. This degradation is influenced by the nature of the tasks performed by the robot. Tasks with higher severity, such as handling heavy payloads, can accelerate the degradation process. One way this degradation is reflected is in the position accuracy of the robot's end-effector. In this paper, we present a prognostic modeling framework that predicts a robotic manipulator's Remaining Useful Life (RUL) while accounting for the effects of task severity. Our framework represents the robot's position accuracy as a Brownian motion process with a random drift parameter that is influenced by task severity. The dynamic nature of task severity is modeled using a continuous-time Markov chain (CTMC). To evaluate RUL, we discuss two approaches -- (1) a novel closed-form expression for Remaining Lifetime Distribution (RLD), and (2) Monte Carlo simulations, commonly used in prognostics literature. Theoretical results establish the equivalence between these RUL computation approaches. We validate our framework through experiments using two distinct physics-based simulators for planar and spatial robot fleets. Our findings show that robots in both fleets experience shorter RUL when handling a higher proportion of high-severity tasks.
comment: Accepted for Publication in IEEE Transactions on Systems, Man, and Cybernetics: Systems
♻ ☆ $Δ_T$ Noise in Mesoscopic Hybrid Junctions: Influence of Barrier Strength and Thermal Bias
Quantum noise is a fundamental probe of quantum transport phenomena, offering insights into current correlations and wave-particle duality. A particularly intriguing form of such noise, $\Delta_T$ noise, emerges under a finite temperature difference in the absence of charge current at zero voltage bias. In this work, we investigate $\Delta_T$ noise in mesoscopic hybrid junctions incorporating insulating barriers, where the average charge current remains zero at zero bias. Using quantum shot noise measurements, we demonstrate that $\Delta_T$ noise in metal-insulator-superconductor (NIS) junctions is approximately $16$ times greater than in metal-insulator-metal (NIN) counterparts. Our analysis further reveals that $\Delta_T$ noise exhibits a non-monotonic dependence on barrier strength, rising to a peak before declining, while increasing monotonically with the applied temperature bias. These findings underscore the rich interplay between thermal gradients and barrier properties in determining quantum noise characteristics in hybrid mesoscopic systems.
comment: 19 pages, 6 figures, accepted for publication in J. Phys. Condensed Matter (2025)
Multimedia 2
☆ Frequency-Spatial Interaction Driven Network for Low-Light Image Enhancement
Low-light image enhancement (LLIE) aims at improving the perception or interpretability of an image captured in an environment with poor illumination. With the advent of deep learning, the LLIE technique has achieved significant breakthroughs. However, existing LLIE methods either ignore the important role of frequency domain information or fail to effectively promote the propagation and flow of information, limiting the LLIE performance. In this paper, we develop a novel frequency-spatial interaction-driven network (FSIDNet) for LLIE based on two-stage architecture. To be specific, the first stage is designed to restore the amplitude of low-light images to improve the lightness, and the second stage devotes to restore phase information to refine fine-grained structures. Considering that Frequency domain and spatial domain information are complementary and both favorable for LLIE, we further develop two frequency-spatial interaction blocks which mutually amalgamate the complementary spatial and frequency information to enhance the capability of the model. In addition, we construct the Information Exchange Module (IEM) to associate two stages by adequately incorporating cross-stage and cross-scale features to effectively promote the propagation and flow of information in the two-stage network structure. Finally, we conduct experiments on several widely used benchmark datasets (i.e., LOL-Real, LSRW-Huawei, etc.), which demonstrate that our method achieves the excellent performance in terms of visual results and quantitative metrics while preserving good model efficiency.
♻ ☆ Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval ACM MM 2025
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.
comment: Accepted to ACM MM 2025
Computer Vision and Pattern Recognition 25
☆ Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent
When a vision model performs image recognition, which visual attributes drive its predictions? Detecting unintended reliance on specific visual features is critical for ensuring model robustness, preventing overfitting, and avoiding spurious correlations. We introduce an automated framework for detecting such dependencies in trained vision models. At the core of our method is a self-reflective agent that systematically generates and tests hypotheses about visual attributes that a model may rely on. This process is iterative: the agent refines its hypotheses based on experimental outcomes and uses a self-evaluation protocol to assess whether its findings accurately explain model behavior. When inconsistencies arise, the agent self-reflects over its findings and triggers a new cycle of experimentation. We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies across 18 categories. Our results show that the agent's performance consistently improves with self-reflection, with a significant performance increase over non-reflective baselines. We further demonstrate that the agent identifies real-world visual attribute dependencies in state-of-the-art models, including CLIP's vision encoder and the YOLOv8 object detector.
comment: 32 pages, 10 figures, Neurips 2025
☆ Visual Diffusion Models are Geometric Solvers
In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. We first demonstrate this on the Inscribed Square Problem, a long-standing problem in geometry that asks whether every Jordan curve contains four points forming a square. We then extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Simple Polygon Problem. Our method treats each problem instance as an image and trains a standard visual diffusion model that transforms Gaussian noise into an image representing a valid approximate solution that closely matches the exact one. The model learns to transform noisy geometric structures into correct configurations, effectively recasting geometric reasoning as image generation. Unlike prior work that necessitates specialized architectures and domain-specific adaptations when applying diffusion to parametric geometric representations, we employ a standard visual diffusion model that operates on the visual representation of the problem. This simplicity highlights a surprising bridge between generative modeling and geometric problem solving. Beyond the specific problems studied here, our results point toward a broader paradigm: operating in image space provides a general and practical framework for approximating notoriously hard problems, and opens the door to tackling a far wider class of challenging geometric tasks.
comment: Project page: https://kariander1.github.io/visual-geo-solver/
☆ BachVid: Training-Free Video Generation with Consistent Background and Character
Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT's attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method leverages this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into corresponding positions in newly generated videos, ensuring both foreground and background consistency across multiple videos. Experimental results demonstrate that BachVid achieves robust consistency in generated videos without requiring additional training, offering a novel and efficient solution for consistent video generation without relying on reference images or additional training.
comment: Project page: https://wolfball.github.io/bachvid
☆ On Thin Ice: Towards Explainable Conservation Monitoring via Attribution and Perturbations NeurIPS
Computer vision can accelerate ecological research and conservation monitoring, yet adoption in ecology lags in part because of a lack of trust in black-box neural-network-based models. We seek to address this challenge by applying post-hoc explanations to provide evidence for predictions and document limitations that are important to field deployment. Using aerial imagery from Glacier Bay National Park, we train a Faster R-CNN to detect pinnipeds (harbor seals) and generate explanations via gradient-based class activation mapping (HiResCAM, LayerCAM), local interpretable model-agnostic explanations (LIME), and perturbation-based explanations. We assess explanations along three axes relevant to field use: (i) localization fidelity: whether high-attribution regions coincide with the animal rather than background context; (ii) faithfulness: whether deletion/insertion tests produce changes in detector confidence; and (iii) diagnostic utility: whether explanations reveal systematic failure modes. Explanations concentrate on seal torsos and contours rather than surrounding ice/rock, and removal of the seals reduces detection confidence, providing model-evidence for true positives. The analysis also uncovers recurrent error sources, including confusion between seals and black ice and rocks. We translate these findings into actionable next steps for model development, including more targeted data curation and augmentation. By pairing object detection with post-hoc explainability, we can move beyond "black-box" predictions toward auditable, decision-supporting tools for conservation monitoring.
comment: NeurIPS Imageomics Workshop 2025
☆ WorldGrow: Generating Infinite 3D World
We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.
comment: Project page: https://world-grow.github.io/ Code: https://github.com/world-grow/WorldGrow
☆ Foundation Models in Dermatopathology: Skin Tissue Classification
The rapid generation of whole-slide images (WSIs) in dermatopathology necessitates automated methods for efficient processing and accurate classification. This study evaluates the performance of two foundation models, UNI and Virchow2, as feature extractors for classifying WSIs into three diagnostic categories: melanocytic, basaloid, and squamous lesions. Patch-level embeddings were aggregated into slide-level features using a mean-aggregation strategy and subsequently used to train multiple machine learning classifiers, including logistic regression, gradient-boosted trees, and random forest models. Performance was assessed using precision, recall, true positive rate, false positive rate, and the area under the receiver operating characteristic curve (AUROC) on the test set. Results demonstrate that patch-level features extracted using Virchow2 outperformed those extracted via UNI across most slide-level classifiers, with logistic regression achieving the highest accuracy (90%) for Virchow2, though the difference was not statistically significant. The study also explored data augmentation techniques and image normalization to enhance model robustness and generalizability. The mean-aggregation approach provided reliable slide-level feature representations. All experimental results and metrics were tracked and visualized using WandB.ai, facilitating reproducibility and interpretability. This research highlights the potential of foundation models for automated WSI classification, providing a scalable and effective approach for dermatopathological diagnosis while paving the way for future advancements in slide-level representation learning.
Self-Supervised Learning of Synapse Types from EM Images
Separating synapses into different classes based on their appearance in EM images has many applications in biology. Examples may include assigning a neurotransmitter to a particular class, or separating synapses whose strength can be modulated from those whose strength is fixed. Traditionally, this has been done in a supervised manner, giving the classification algorithm examples of the different classes. Here we instead separate synapses into classes based only on the observation that nearby synapses in the same neuron are likely more similar than synapses chosen randomly from different cells. We apply our methodology to data from {\it Drosophila}. Our approach has the advantage that the number of synapse types does not need to be known in advance. It may also provide a principled way to select ground-truth that spans the range of synapse structure.
☆ Long-tailed Species Recognition in the NACTI Wildlife Dataset
As most ''in the wild'' data collections of the natural world, the North America Camera Trap Images (NACTI) dataset shows severe long-tailed class imbalance, noting that the largest 'Head' class alone covers >50% of the 3.7M images in the corpus. Building on the PyTorch Wildlife model, we present a systematic study of Long-Tail Recognition methodologies for species recognition on the NACTI dataset covering experiments on various LTR loss functions plus LTR-sensitive regularisation. Our best configuration achieves 99.40% Top-1 accuracy on our NACTI test data split, substantially improving over a 95.51% baseline using standard cross-entropy with Adam. This also improves on previously reported top performance in MLWIC2 at 96.8% albeit using partly unpublished (potentially different) partitioning, optimiser, and evaluation protocols. To evaluate domain shifts (e.g. night-time captures, occlusion, motion-blur) towards other datasets we construct a Reduced-Bias Test set from the ENA-Detection dataset where our experimentally optimised long-tail enhanced model achieves leading 52.55% accuracy (up from 51.20% with WCE loss), demonstrating stronger generalisation capabilities under distribution shift. We document the consistent improvements of LTR-enhancing scheduler choices in this NACTI wildlife domain, particularly when in tandem with state-of-the-art LTR losses. We finally discuss qualitative and quantitative shortcomings that LTR methods cannot sufficiently address, including catastrophic breakdown for 'Tail' classes under severe domain shift. For maximum reproducibility we publish all dataset splits, key code, and full network weights.
☆ Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging ICCV 2025
Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMU-based tracking compromises translation estimates and accurate relative positioning between individuals, as inertial cues are inherently self-referential and provide no direct spatial reference for others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors - both on each individual and across multiple individuals. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people's global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous state-of-the-art methods in accuracy and robustness across synthetic and real-world data, showing the promise of IMU+UWB-based multi-human motion capture in the wild. Code, models, dataset: https://github.com/eth-siplab/GroupInertialPoser
comment: Accepted by ICCV 2025, Code: https://github.com/eth-siplab/GroupInertialPoser
☆ A Dynamic Knowledge Distillation Method Based on the Gompertz Curve
This paper introduces a novel dynamic knowledge distillation framework, Gompertz-CNN, which integrates the Gompertz growth model into the training process to address the limitations of traditional knowledge distillation. Conventional methods often fail to capture the evolving cognitive capacity of student models, leading to suboptimal knowledge transfer. To overcome this, we propose a stage-aware distillation strategy that dynamically adjusts the weight of distillation loss based on the Gompertz curve, reflecting the student's learning progression: slow initial growth, rapid mid-phase improvement, and late-stage saturation. Our framework incorporates Wasserstein distance to measure feature-level discrepancies and gradient matching to align backward propagation behaviors between teacher and student models. These components are unified under a multi-loss objective, where the Gompertz curve modulates the influence of distillation losses over time. Extensive experiments on CIFAR-10 and CIFAR-100 using various teacher-student architectures (e.g., ResNet50 and MobileNet_v2) demonstrate that Gompertz-CNN consistently outperforms traditional distillation methods, achieving up to 8% and 4% accuracy gains on CIFAR-10 and CIFAR-100, respectively.
comment: 15 pages, 2 figures
☆ DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning
Compared to 2D data, the scale of point cloud data in different domains available for training, is quite limited. Researchers have been trying to combine these data of different domains for masked autoencoder (MAE) pre-training to leverage such a data scarcity issue. However, the prior knowledge learned from mixed domains may not align well with the downstream 3D point cloud analysis tasks, leading to degraded performance. To address such an issue, we propose the Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE), an MAE pre-training method, to adaptively integrate the knowledge of cross-domain datasets for general point cloud analysis. In DAP-MAE, we design a heterogeneous domain adapter that utilizes an adaptation mode during pre-training, enabling the model to comprehensively learn information from point clouds across different domains, while employing a fusion mode in the fine-tuning to enhance point cloud features. Meanwhile, DAP-MAE incorporates a domain feature generator to guide the adaptation of point cloud features to various downstream tasks. With only one pre-training, DAP-MAE achieves excellent performance across four different point cloud analysis tasks, reaching 95.18% in object classification on ScanObjectNN and 88.45% in facial expression recognition on Bosphorus.
comment: 14 pages, 7 figures, conference
☆ Epipolar Geometry Improves Video Generation Models
Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite massive training data, these models fail to capture fundamental geometric principles underlying visual content. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable camera trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics, which produce noisy targets that compromise alignment quality. Training on static scenes with dynamic cameras ensures high-quality measurements while the model generalizes effectively to diverse dynamic content. By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos without compromising visual quality.
☆ Modest-Align: Data-Efficient Alignment for Vision-Language Models
Cross-modal alignment aims to map heterogeneous modalities into a shared latent space, as exemplified by models like CLIP, which benefit from large-scale image-text pretraining for strong recognition capabilities. However, when operating in resource-constrained settings with limited or low-quality data, these models often suffer from overconfidence and degraded performance due to the prevalence of ambiguous or weakly correlated image-text pairs. Current contrastive learning approaches, which rely on single positive pairs, further exacerbate this issue by reinforcing overconfidence on uncertain samples. To address these challenges, we propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency. Our approach leverages two complementary strategies -- Random Perturbation, which introduces controlled noise to simulate uncertainty, and Embedding Smoothing, which calibrates similarity distributions in the embedding space. These mechanisms collectively reduce overconfidence and improve performance on noisy or weakly aligned samples. Extensive experiments across multiple benchmark datasets demonstrate that Modest-Align outperforms state-of-the-art methods in retrieval tasks, achieving competitive results with over 100x less training data and 600x less GPU time than CLIP. Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
☆ S3OD: Towards Generalizable Salient Object Detection with Synthetic Data
Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that naturally handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained solely on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.
☆ Automated interictal epileptic spike detection from simple and noisy annotations in MEG data
In drug-resistant epilepsy, presurgical evaluation of epilepsy can be considered. Magnetoencephalography (MEG) has been shown to be an effective exam to inform the localization of the epileptogenic zone through the localization of interictal epileptic spikes. Manual detection of these pathological biomarkers remains a fastidious and error-prone task due to the high dimensionality of MEG recordings, and interrater agreement has been reported to be only moderate. Current automated methods are unsuitable for clinical practice, either requiring extensively annotated data or lacking robustness on non-typical data. In this work, we demonstrate that deep learning models can be used for detecting interictal spikes in MEG recordings, even when only temporal and single-expert annotations are available, which represents real-world clinical practice. We propose two model architectures: a feature-based artificial neural network (ANN) and a convolutional neural network (CNN), trained on a database of 59 patients, and evaluated against a state-of-the-art model to classify short time windows of signal. In addition, we employ an interactive machine learning strategy to iteratively improve our data annotation quality using intermediary model outputs. Both proposed models outperform the state-of-the-art model (F1-scores: CNN=0.46, ANN=0.44) when tested on 10 holdout test patients. The interactive machine learning strategy demonstrates that our models are robust to noisy annotations. Overall, results highlight the robustness of models with simple architectures when analyzing complex and imperfectly annotated data. Our method of interactive machine learning offers great potential for faster data annotation, while our models represent useful and efficient tools for automated interictal spikes detection.
comment: 17 pages, 7 Figures
♻ ☆ A Geometric Approach to Steerable Convolutions
In contrast to the somewhat abstract, group theoretical approach adopted by many papers, our work provides a new and more intuitive derivation of steerable convolutional neural networks in $d$ dimensions. This derivation is based on geometric arguments and fundamental principles of pattern matching. We offer an intuitive explanation for the appearance of the Clebsch--Gordan decomposition and spherical harmonic basis functions. Furthermore, we suggest a novel way to construct steerable convolution layers using interpolation kernels that improve upon existing implementation, and offer greater robustness to noisy data.
♻ ☆ DynamicPAE: Generating Scene-Aware Physical Adversarial Examples in Real-Time
Physical adversarial examples (PAEs) are regarded as whistle-blowers of real-world risks in deep-learning applications, thus worth further investigation. However, current PAE generation studies show limited adaptive attacking ability to diverse and varying scenes, revealing the urgent requirement of dynamic PAEs that are generated in real time and conditioned on the observation from the attacker. The key challenge in generating dynamic PAEs is learning the sparse relation between PAEs and the observation of attackers under the noisy feedback of attack training. To address the challenge, we present DynamicPAE, the first generative framework that enables scene-aware real-time physical attacks. Specifically, to address the noisy feedback problem that obfuscates the exploration of scene-related PAEs, we introduce the residual-guided adversarial pattern exploration technique. Residual-guided training, which relaxes the attack training with a reconstruction task, is proposed to enrich the feedback information, thereby achieving a more comprehensive exploration of PAEs. To address the alignment problem between the trained generator and the real-world scenario, we introduce the distribution-matched attack scenario alignment, consisting of the conditional-uncertainty-aligned data module and the skewness-aligned objective re-weighting module. The former aligns the training environment with the incomplete observation of the real-world attacker. The latter facilitates consistent stealth control across different attack targets with the skewness controller. Extensive digital and physical evaluations demonstrate the superior attack performance of DynamicPAE, attaining a 2.07 $\times$ boost (58.8% average AP drop under attack) on representative object detectors (e.g., DETR) over state-of-the-art static PAE generating methods. Overall, our work opens the door to end-to-end modeling of dynamic PAEs.
♻ ☆ Mixture of Experts in Image Classification: What's the Sweet Spot?
Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across domains. However, their application to image classification remains limited, often requiring billion-scale datasets to be competitive. In this work, we explore the integration of MoE layers into image classification architectures using open datasets. We conduct a systematic analysis across different MoE configurations and model scales. We find that moderate parameter activation per sample provides the best trade-off between performance and efficiency. However, as the number of activated parameters increases, the benefits of MoE diminish. Our analysis yields several practical insights for vision MoE design. First, MoE layers most effectively strengthen tiny and mid-sized models, while gains taper off for large-capacity networks and do not redefine state-of-the-art ImageNet performance. Second, a Last-2 placement heuristic offers the most robust cross-architecture choice, with Every-2 slightly better for Vision Transform (ViT), and both remaining effective as data and model scale increase. Third, larger datasets (e.g., ImageNet-21k) allow more experts, up to 16, for ConvNeXt to be utilized effectively without changing placement, as increased data reduces overfitting and promotes broader expert specialization. Finally, a simple linear router performs best, suggesting that additional routing complexity yields no consistent benefit.
comment: Published in Transactions on Machine Learning Research
♻ ☆ Metropolis-Hastings Sampling for 3D Gaussian Reconstruction NeurIPS 2025
We propose an adaptive sampling framework for 3D Gaussian Splatting (3DGS) that leverages comprehensive multi-view photometric error signals within a unified Metropolis-Hastings approach. Vanilla 3DGS heavily relies on heuristic-based density-control mechanisms (e.g., cloning, splitting, and pruning), which can lead to redundant computations or premature removal of beneficial Gaussians. Our framework overcomes these limitations by reformulating densification and pruning as a probabilistic sampling process, dynamically inserting and relocating Gaussians based on aggregated multi-view errors and opacity scores. Guided by Bayesian acceptance tests derived from these error-based importance scores, our method substantially reduces reliance on heuristics, offers greater flexibility, and adaptively infers Gaussian distributions without requiring predefined scene complexity. Experiments on benchmark datasets, including Mip-NeRF360, Tanks and Temples and Deep Blending, show that our approach reduces the number of Gaussians needed, achieving faster convergence while matching or modestly surpassing the view-synthesis quality of state-of-the-art models.
comment: NeurIPS 2025. Project Page: https://hjhyunjinkim.github.io/MH-3DGS
♻ ☆ Lightweight Facial Landmark Detection in Thermal Images via Multi-Level Cross-Modal Knowledge Transfer
Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student's learned representations by feeding them back into the frozen teacher's prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.
♻ ☆ FORLA: Federated Object-centric Representation Learning with Slot Attention
Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling domain-specific factors without supervision. We introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student-teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.
comment: Accepted by Neurips2025
♻ ☆ SegMASt3R: Geometry Grounded Segment Matching NeurIPS 2025
Segment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. Unlike keypoint matching, which focuses on localized features, segment matching captures structured regions, offering greater robustness to occlusions, lighting variations, and viewpoint changes. In this paper, we leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching, a challenging setting involving extreme viewpoint shifts. We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180 degree view-point change rotation. Extensive experiments show that our approach outperforms state-of-the-art methods, including the SAM2 video propagator and local feature matching methods, by up to 30% on the AUPRC metric, on ScanNet++ and Replica datasets. We further demonstrate benefits of the proposed model on relevant downstream tasks, including 3D instance mapping and object-relative navigation. Project Page: https://segmast3r.github.io/
comment: Accepted to The 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025) as a Spotlight (top 3.5%)
♻ ☆ The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models
Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, focusing on how visual information is processed and transferred to the textual domain. We compare native multimodal VLMs, models trained from scratch on multimodal data to generate both text and images, and non-native multimodal VLMs, models adapted from pre-trained large language models or capable of generating only text, highlighting key differences in information flow. We find that in native multimodal VLMs, image and text embeddings are more separated within the residual stream. Moreover, VLMs differ in how visual information reaches text: non-native multimodal VLMs exhibit a distributed communication pattern, where information is exchanged through multiple image tokens, whereas models trained natively for joint image and text generation tend to rely on a single post-image token that acts as a narrow gate for visual information. We show that ablating this single token significantly deteriorates image-understanding performance, whereas targeted, token-level interventions reliably steer image semantics and downstream text with fine-grained control.
♻ ☆ Operational Change Detection for Geographical Information: Overview and Challenges
Rapid evolution of territories due to climate change and human impact requires prompt and effective updates to geospatial databases maintained by the National Mapping Agency. This paper presents a comprehensive overview of change detection methods tailored for the operational updating of large-scale geographic databases. This review first outlines the fundamental definition of change, emphasizing its multifaceted nature, from temporal to semantic characterization. It categorizes automatic change detection methods into four main families: rule-based, statistical, machine learning, and simulation methods. The strengths, limitations, and applicability of every family are discussed in the context of various input data. Then, key applications for National Mapping Agencies are identified, particularly the optimization of geospatial database updating, change-based phenomena, and dynamics monitoring. Finally, the paper highlights the current challenges for leveraging change detection such as the variability of change definition, the missing of relevant large-scale datasets, the diversity of input data, the unstudied no-change detection, the human in the loop integration and the operational constraints. The discussion underscores the necessity for ongoing innovation in change detection techniques to address the future needs of geographic information systems for national mapping agencies.
comment: Preprint under review
♻ ☆ Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning EMNLP 2025
Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.
comment: EMNLP 2025. The first two authors contributed equally. Project page: https://sites.google.com/cs.unc.edu/videorts2025/
Machine Learning 32
☆ Equivariance by Contrast: Identifiable Equivariant Embeddings from Unlabeled Finite Group Actions NeurIPS 2025
We propose Equivariance by Contrast (EbC) to learn equivariant embeddings from observation pairs $(\mathbf{y}, g \cdot \mathbf{y})$, where $g$ is drawn from a finite group acting on the data. Our method jointly learns a latent space and a group representation in which group actions correspond to invertible linear maps -- without relying on group-specific inductive biases. We validate our approach on the infinite dSprites dataset with structured transformations defined by the finite group $G:= (R_m \times \mathbb{Z}_n \times \mathbb{Z}_n)$, combining discrete rotations and periodic translations. The resulting embeddings exhibit high-fidelity equivariance, with group operations faithfully reproduced in latent space. On synthetic data, we further validate the approach on the non-abelian orthogonal group $O(n)$ and the general linear group $GL(n)$. We also provide a theoretical proof for identifiability. While broad evaluation across diverse group types on real-world data remains future work, our results constitute the first successful demonstration of general-purpose encoder-only equivariant learning from group action observations alone, including non-trivial non-abelian groups and a product group motivated by modeling affine equivariances in computer vision.
comment: Accepted at NeurIPS 2025. The last two authors contributed equally. Code is available at https://github.com/dynamical-inference/ebc
☆ Visual Diffusion Models are Geometric Solvers
In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. We first demonstrate this on the Inscribed Square Problem, a long-standing problem in geometry that asks whether every Jordan curve contains four points forming a square. We then extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Simple Polygon Problem. Our method treats each problem instance as an image and trains a standard visual diffusion model that transforms Gaussian noise into an image representing a valid approximate solution that closely matches the exact one. The model learns to transform noisy geometric structures into correct configurations, effectively recasting geometric reasoning as image generation. Unlike prior work that necessitates specialized architectures and domain-specific adaptations when applying diffusion to parametric geometric representations, we employ a standard visual diffusion model that operates on the visual representation of the problem. This simplicity highlights a surprising bridge between generative modeling and geometric problem solving. Beyond the specific problems studied here, our results point toward a broader paradigm: operating in image space provides a general and practical framework for approximating notoriously hard problems, and opens the door to tackling a far wider class of challenging geometric tasks.
comment: Project page: https://kariander1.github.io/visual-geo-solver/
☆ Mechanistic Interpretability for Neural TSP Solvers
Neural networks have advanced combinatorial optimization, with Transformer-based solvers achieving near-optimal solutions on the Traveling Salesman Problem (TSP) in milliseconds. However, these models operate as black boxes, providing no insight into the geometric patterns they learn or the heuristics they employ during tour construction. We address this opacity by applying sparse autoencoders (SAEs), a mechanistic interpretability technique, to a Transformer-based TSP solver, representing the first application of activation-based interpretability methods to operations research models. We train a pointer network with reinforcement learning on 100-node instances, then fit an SAE to the encoder's residual stream to discover an overcomplete dictionary of interpretable features. Our analysis reveals that the solver naturally develops features mirroring fundamental TSP concepts: boundary detectors that activate on convex-hull nodes, cluster-sensitive features responding to locally dense regions, and separator features encoding geometric partitions. These findings provide the first model-internal account of what neural TSP solvers compute before node selection, demonstrate that geometric structure emerges without explicit supervision, and suggest pathways toward transparent hybrid systems that combine neural efficiency with algorithmic interpretability. Interactive feature explorer: https://reubennarad.github.io/TSP_interp
☆ On Uncertainty Calibration for Equivariant Functions
Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.
comment: Under review at Transactions on Machine Learning Research (TMLR). Code is available at https://github.com/EdwardBerman/EquiUQ . Excited to share this paper, comments welcome :D
☆ Multimodal Datasets with Controllable Mutual Information
We introduce a framework for generating highly multimodal datasets with explicitly calculable mutual information between modalities. This enables the construction of benchmark datasets that provide a novel testbed for systematic studies of mutual information estimators and multimodal self-supervised learning techniques. Our framework constructs realistic datasets with known mutual information using a flow-based generative model and a structured causal framework for generating correlated latent variables.
comment: 15 pages, 4 figures, 1 table. Our code is publicly available at https://github.com/RKHashmani/MmMi-Datasets
☆ Optimal Graph Clustering without Edge Density Signals
This paper establishes the theoretical limits of graph clustering under the Popularity-Adjusted Block Model (PABM), addressing limitations of existing models. In contrast to the Stochastic Block Model (SBM), which assumes uniform vertex degrees, and to the Degree-Corrected Block Model (DCBM), which applies uniform degree corrections across clusters, PABM introduces separate popularity parameters for intra- and inter-cluster connections. Our main contribution is the characterization of the optimal error rate for clustering under PABM, which provides novel insights on clustering hardness: we demonstrate that unlike SBM and DCBM, cluster recovery remains possible in PABM even when traditional edge-density signals vanish, provided intra- and inter-cluster popularity coefficients differ. This highlights a dimension of degree heterogeneity captured by PABM but overlooked by DCBM: local differences in connectivity patterns can enhance cluster separability independently of global edge densities. Finally, because PABM exhibits a richer structure, its expected adjacency matrix has rank between $k$ and $k^2$, where $k$ is the number of clusters. As a result, spectral embeddings based on the top $k$ eigenvectors may fail to capture important structural information. Our numerical experiments on both synthetic and real datasets confirm that spectral clustering algorithms incorporating $k^2$ eigenvectors outperform traditional spectral approaches.
☆ DEEDEE: Fast and Scalable Out-of-Distribution Dynamics Detection
Deploying reinforcement learning (RL) in safety-critical settings is constrained by brittleness under distribution shift. We study out-of-distribution (OOD) detection for RL time series and introduce DEEDEE, a two-statistic detector that revisits representation-heavy pipelines with a minimal alternative. DEEDEE uses only an episodewise mean and an RBF kernel similarity to a training summary, capturing complementary global and local deviations. Despite its simplicity, DEEDEE matches or surpasses contemporary detectors across standard RL OOD suites, delivering a 600-fold reduction in compute (FLOPs / wall-time) and an average 5% absolute accuracy gain over strong baselines. Conceptually, our results indicate that diverse anomaly types often imprint on RL trajectories through a small set of low-order statistics, suggesting a compact foundation for OOD detection in complex environments.
☆ Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations NeurIPS 2025
Knowledge distillation is a promising approach to transfer capabilities from complex teacher models to smaller, resource-efficient student models that can be deployed easily, particularly in task-aware scenarios. However, existing methods of task-aware distillation typically require substantial quantities of data which may be unavailable or expensive to obtain in many practical scenarios. In this paper, we address this challenge by introducing a novel strategy called Counterfactual-explanation-infused Distillation CoD for few-shot task-aware knowledge distillation by systematically infusing counterfactual explanations. Counterfactual explanations (CFEs) refer to inputs that can flip the output prediction of the teacher model with minimum perturbation. Our strategy CoD leverages these CFEs to precisely map the teacher's decision boundary with significantly fewer samples. We provide theoretical guarantees for motivating the role of CFEs in distillation, from both statistical and geometric perspectives. We mathematically show that CFEs can improve parameter estimation by providing more informative examples near the teacher's decision boundary. We also derive geometric insights on how CFEs effectively act as knowledge probes, helping the students mimic the teacher's decision boundaries more effectively than standard data. We perform experiments across various datasets and LLMs to show that CoD outperforms standard distillation approaches in few-shot regimes (as low as 8-512 samples). Notably, CoD only uses half of the original samples used by the baselines, paired with their corresponding CFEs and still improves performance.
comment: NeurIPS 2025
☆ DeepAgent: A General Reasoning Agent with Scalable Toolsets
Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To address the challenges of long-horizon interactions, particularly the context length explosion from multiple tool calls and the accumulation of interaction history, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. This work takes a step toward more general and capable agents for real-world applications. The code and demo are available at https://github.com/RUC-NLPIR/DeepAgent.
☆ Generative Correlation Manifolds: Generating Synthetic Data with Preserved Higher-Order Correlations
The increasing need for data privacy and the demand for robust machine learning models have fueled the development of synthetic data generation techniques. However, current methods often succeed in replicating simple summary statistics but fail to preserve both the pairwise and higher-order correlation structure of the data that define the complex, multi-variable interactions inherent in real-world systems. This limitation can lead to synthetic data that is superficially realistic but fails when used for sophisticated modeling tasks. In this white paper, we introduce Generative Correlation Manifolds (GCM), a computationally efficient method for generating synthetic data. The technique uses Cholesky decomposition of a target correlation matrix to produce datasets that, by mathematical proof, preserve the entire correlation structure -- from simple pairwise relationships to higher-order interactions -- of the source dataset. We argue that this method provides a new approach to synthetic data generation with potential applications in privacy-preserving data sharing, robust model training, and simulation.
☆ Enhancing Tactile-based Reinforcement Learning for Robotic Control
Achieving safe, reliable real-world robotic manipulation requires agents to evolve beyond vision and incorporate tactile sensing to overcome sensory deficits and reliance on idealised state information. Despite its potential, the efficacy of tactile sensing in reinforcement learning (RL) remains inconsistent. We address this by developing self-supervised learning (SSL) methodologies to more effectively harness tactile observations, focusing on a scalable setup of proprioception and sparse binary contacts. We empirically demonstrate that sparse binary tactile signals are critical for dexterity, particularly for interactions that proprioceptive control errors do not register, such as decoupled robot-object motions. Our agents achieve superhuman dexterity in complex contact tasks (ball bouncing and Baoding ball rotation). Furthermore, we find that decoupling the SSL memory from the on-policy memory can improve performance. We release the Robot Tactile Olympiad (RoTO) benchmark to standardise and promote future research in tactile-based manipulation. Project page: https://elle-miller.github.io/tactile_rl
♻ ☆ Causal Climate Emulation with Bayesian Filtering
Traditional models of climate change use complex systems of coupled equations to simulate physical processes across the Earth system. These simulations are highly computationally expensive, limiting our predictions of climate change and analyses of its causes and effects. Machine learning has the potential to quickly emulate data from climate models, but current approaches are not able to incorporate physically-based causal relationships. Here, we develop an interpretable climate model emulator based on causal representation learning. We derive a novel approach including a Bayesian filter for stable long-term autoregressive emulation. We demonstrate that our emulator learns accurate climate dynamics, and we show the importance of each one of its components on a realistic synthetic dataset and data from two widely deployed climate models.
comment: 37 pages, 26 figures
♻ ☆ Intrinsic Goals for Autonomous Agents: Model-Based Exploration in Virtual Zebrafish Predicts Ethological Behavior and Whole-Brain Dynamics
Autonomy is a hallmark of animal intelligence, enabling adaptive and intelligent behavior in complex environments without relying on external reward or task structure. Existing reinforcement learning approaches to exploration in reward-free environments, including a class of methods known as model-based intrinsic motivation, exhibit inconsistent exploration patterns and do not converge to an exploratory policy, thus failing to capture robust autonomous behaviors observed in animals. Moreover, systems neuroscience has largely overlooked the neural basis of autonomy, focusing instead on experimental paradigms where animals are motivated by external reward rather than engaging in ethological, naturalistic and task-independent behavior. To bridge these gaps, we introduce a novel model-based intrinsic drive explicitly designed after the principles of autonomous exploration in animals. Our method (3M-Progress) achieves animal-like exploration by tracking divergence between an online world model and a fixed prior learned from an ecological niche. To the best of our knowledge, we introduce the first autonomous embodied agent that predicts brain data entirely from self-supervised optimization of an intrinsic goal -- without any behavioral or neural training data -- demonstrating that 3M-Progress agents capture the explainable variance in behavioral patterns and whole-brain neural-glial dynamics recorded from autonomously behaving larval zebrafish, thereby providing the first goal-driven, population-level model of neural-glial computation. Our findings establish a computational framework connecting model-based intrinsic motivation to naturalistic behavior, providing a foundation for building artificial agents with animal-like autonomy.
comment: 17 pages, 7 figures
♻ ☆ System-Embedded Diffusion Bridge Models NeurIPS 2025
Solving inverse problems -- recovering signals from incomplete or noisy measurements -- is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic processes conditioned on paired clean and corrupted data. While the former typically assume knowledge of the measurement model, the latter have largely overlooked this structural information. We introduce System embedded Diffusion Bridge Models (SDBs), a new class of supervised bridge methods that explicitly embed the known linear measurement system into the coefficients of a matrix-valued SDE. This principled integration yields consistent improvements across diverse linear inverse problems and demonstrates robust generalization under system misspecification between training and deployment, offering a promising solution to real-world applications.
comment: NeurIPS 2025
♻ ☆ SimuRA: A World-Model-Driven Simulative Reasoning Architecture for General Goal-Oriented Agents
AI agents built on foundation models hold enormous promise. Current practice, however, focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also faces practical limitations from black-box autoregressive reasoning, where decisions unfold token by token without explicit simulation or counterfactual evaluation of outcomes. Humans, on the other hand, reason and plan by mentally simulating the consequences of actions within an internal model of the world -- a capability that supports flexible, goal-directed behavior across diverse contexts. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of an optimal agent in any general environment, SimuRA addresses the limitations of black-box autoregressive reasoning by incorporating the world model for planning via simulation. Our prototype world model is implemented using LLMs as a substrate, leveraging the natural language as a discrete, hierarchical representation grounded in concepts for planning, while remaining model-agnostic. On complex web-browsing tasks such as flight search, SimuRA improves the success rate from 0% to 32.2% compared to a representative open-web agent baseline. Across tasks, world-model-based planning achieves up to 124% higher task completion rates than a matched black-box autoregressive baseline, demonstrating the advantages of simulative reasoning. We release ReasonerAgent-Web, a web-browsing agent built on SimuRA, as an open-source research demo.
comment: This submission has been updated to adjust the scope and presentation of the work
♻ ☆ Federated Unlearning Made Practical: Seamless Integration via Negated Pseudo-Gradients
The right to be forgotten is a fundamental principle of privacy-preserving regulations and extends to Machine Learning (ML) paradigms such as Federated Learning (FL). While FL enhances privacy by enabling collaborative model training without sharing private data, trained models still retain the influence of training data. Federated Unlearning (FU) methods recently proposed often rely on impractical assumptions for real-world FL deployments, such as storing client update histories or requiring access to a publicly available dataset. To address these constraints, this paper introduces a novel method that leverages negated Pseudo-gradients Updates for Federated Unlearning (PUF). Our approach only uses standard client model updates, which are employed during regular FL rounds, and interprets them as pseudo-gradients. When a client needs to be forgotten, we apply the negation of their pseudo-gradients, appropriately scaled, to the global model. Unlike state-of-the-art mechanisms, PUF seamlessly integrates with FL workflows, incurs no additional computational and communication overhead beyond standard FL rounds, and supports concurrent unlearning requests. We extensively evaluated the proposed method on two well-known benchmark image classification datasets (CIFAR-10 and CIFAR-100) and a real-world medical imaging dataset for segmentation (ProstateMRI), using three different neural architectures: two residual networks and a vision transformer. The experimental results across various settings demonstrate that PUF achieves state-of-the-art forgetting effectiveness and recovery time, without relying on any additional assumptions.
♻ ☆ Reinforcement Learning with Action Chunking NeurIPS 2025
We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
comment: The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025); 36 pages, 17 figures
♻ ☆ Quantum Temporal Fusion Transformer
The \textit{Temporal Fusion Transformer} (TFT), proposed by Lim \textit{et al.}, published in \textit{International Journal of Forecasting} (2021), is a state-of-the-art attention-based deep neural network architecture specifically designed for multi-horizon time series forecasting. It has demonstrated significant performance improvements over existing benchmarks. In this work, we introduce the Quantum Temporal Fusion Transformer (QTFT), a quantum-enhanced hybrid quantum-classical architecture that extends the capabilities of the classical TFT framework. The core idea of this work is inspired by the foundation studies, \textit{The Power of Quantum Neural Networks} by Amira Abbas \textit{et al.} and \textit{Quantum Vision Transformers} by El Amine Cherrat \textit{et al.}, published in \textit{ Nature Computational Science} (2021) and \textit{Quantum} (2024), respectively. A key advantage of our approach lies in its foundation on a variational quantum algorithm, enabling implementation on current noisy intermediate-scale quantum (NISQ) devices without strict requirements on the number of qubits or circuit depth. Our results demonstrate that QTFT is successfully trained on the forecasting datasets and is capable of accurately predicting future values. In particular, our experimental results on two different datasets display that the model outperforms its classical counterpart in terms of both training and test loss. These results indicate the prospect of using quantum computing to boost deep learning architectures in complex machine learning tasks.
♻ ☆ A Gravity-informed Spatiotemporal Transformer for Human Activity Intensity Prediction
Human activity intensity prediction is crucial to many location-based services. Despite tremendous progress in modeling dynamics of human activity, most existing methods overlook physical constraints of spatial interaction, leading to uninterpretable spatial correlations and over-smoothing phenomenon. To address these limitations, this work proposes a physics-informed deep learning framework, namely Gravity-informed Spatiotemporal Transformer (Gravityformer) by integrating the universal law of gravitation to refine transformer attention. Specifically, it (1) estimates two spatially explicit mass parameters based on spatiotemporal embedding feature, (2) models the spatial interaction in end-to-end neural network using proposed adaptive gravity model to learn the physical constraint, and (3) utilizes the learned spatial interaction to guide and mitigate the over-smoothing phenomenon in transformer attention. Moreover, a parallel spatiotemporal graph convolution transformer is proposed for achieving a balance between coupled spatial and temporal learning. Systematic experiments on six real-world large-scale activity datasets demonstrate the quantitative and qualitative superiority of our model over state-of-the-art benchmarks. Additionally, the learned gravity attention matrix can be not only disentangled and interpreted based on geographical laws, but also improved the generalization in zero-shot cross-region inference. This work provides a novel insight into integrating physical laws with deep learning for spatiotemporal prediction.
comment: IEEE TPAMI 2025. 18 pages, 14 figures
♻ ☆ Mixture of Experts in Image Classification: What's the Sweet Spot?
Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across domains. However, their application to image classification remains limited, often requiring billion-scale datasets to be competitive. In this work, we explore the integration of MoE layers into image classification architectures using open datasets. We conduct a systematic analysis across different MoE configurations and model scales. We find that moderate parameter activation per sample provides the best trade-off between performance and efficiency. However, as the number of activated parameters increases, the benefits of MoE diminish. Our analysis yields several practical insights for vision MoE design. First, MoE layers most effectively strengthen tiny and mid-sized models, while gains taper off for large-capacity networks and do not redefine state-of-the-art ImageNet performance. Second, a Last-2 placement heuristic offers the most robust cross-architecture choice, with Every-2 slightly better for Vision Transform (ViT), and both remaining effective as data and model scale increase. Third, larger datasets (e.g., ImageNet-21k) allow more experts, up to 16, for ConvNeXt to be utilized effectively without changing placement, as increased data reduces overfitting and promotes broader expert specialization. Finally, a simple linear router performs best, suggesting that additional routing complexity yields no consistent benefit.
comment: Published in Transactions on Machine Learning Research
♻ ☆ CityAQVis: Integrated ML-Visualization Sandbox Tool for Pollutant Estimation in Urban Regions Using Multi-Source Data (Software Article)
Urban air pollution poses significant risks to public health, environmental sustainability, and policy planning. Effective air quality management requires predictive tools that can integrate diverse datasets and communicate complex spatial and temporal pollution patterns. There is a gap in interactive tools with seamless integration of forecasting and visualization of spatial distributions of air pollutant concentrations. We present CityAQVis, an interactive machine learning ML sandbox tool designed to predict and visualize pollutant concentrations at the ground level using multi-source data, which includes satellite observations, meteorological parameters, population density, elevation, and nighttime lights. While traditional air quality visualization tools often lack forecasting capabilities, CityAQVis enables users to build and compare predictive models, visualizing the model outputs and offering insights into pollution dynamics at the ground level. The pilot implementation of the tool is tested through case studies predicting nitrogen dioxide (NO2) concentrations in metropolitan regions, highlighting its adaptability to various pollutants. Through an intuitive graphical user interface (GUI), the user can perform comparative visualizations of the spatial distribution of surface-level pollutant concentration in two different urban scenarios. Our results highlight the potential of ML-driven visual analytics to improve situational awareness and support data-driven decision-making in air quality management.
comment: 19 pages, 10 figures, 2 tables
♻ ☆ Fixed-Point RNNs: Interpolating from Diagonal to Dense NeurIPS 2025
Linear recurrent neural networks (RNNs) and state-space models (SSMs) such as Mamba have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures. Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing. In this paper, we investigate parameterizations of a large class of dense linear RNNs as fixed-points of parallelizable diagonal linear RNNs. The resulting models can naturally trade expressivity for efficiency at a fixed number of parameters and achieve state-of-the-art results on the state-tracking benchmarks $A_5$ and $S_5$, while matching performance on copying and other tasks.
comment: NeurIPS 2025 (Spotlight)
♻ ☆ Convergence and Generalization of Anti-Regularization for Parametric Models
Anti-regularization introduces a reward term with a reversed sign into the loss function, deliberately amplifying model expressivity in small-sample regimes while ensuring that the intervention gradually vanishes as the sample size grows through a power-law decay schedule. We formalize spectral safety conditions and trust-region constraints, and we design a lightweight safeguard that combines a projection operator with gradient clipping to guarantee stable intervention. Theoretical analysis extends to linear smoothers and the Neural Tangent Kernel regime, providing practical guidance on the choice of decay exponents through the balance between empirical risk and variance. Empirical results show that Anti-regularization mitigates underfitting in both regression and classification while preserving generalization and improving calibration. Ablation studies confirm that the decay schedule and safeguards are essential to avoiding overfitting and instability. As an alternative, we also propose a degrees-of-freedom targeting schedule that maintains constant per-sample complexity. Anti-regularization constitutes a simple and reproducible procedure that integrates seamlessly into standard empirical risk minimization pipelines, enabling robust learning under limited data and resource constraints by intervening only when necessary and vanishing otherwise.
comment: v3: Revised the paragraph under Theoretical Analysis (English translation and typo corrections)
♻ ☆ BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https://github.com/bowang-lab/BioReason
comment: 28 pages, 4 figures, 4 tables
♻ ☆ Lightweight Facial Landmark Detection in Thermal Images via Multi-Level Cross-Modal Knowledge Transfer
Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student's learned representations by feeding them back into the frozen teacher's prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.
♻ ☆ MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents
Building general-purpose graphical user interface (GUI) agents has become increasingly promising with the progress in vision language models. However, developing effective mobile GUI agents with reinforcement learning (RL) remains challenging due to the heavy-tailed distribution of task difficulty and the inefficiency of large-scale environment sampling. We present an online agentic reinforcement learning framework MobileRL to enhance GUI agents in mobile environments. Its core component is the Difficulty-ADAptive GRPO (ADAGRPO) algorithm. In ADAGRPO, we design difficulty-adaptive positive replay and failure curriculum filtering to adapt the model to different task difficulties. We introduce the shortest-path reward adjustment strategy to reshape rewards concerning the task length in multi-turn agentic tasks. Those strategies jointly stabilize RL training, improve sample efficiency, and generate strong performance across diverse mobile apps and tasks. We apply MOBILERL to two open models (Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base). The resultant MOBILERL-9B model achieves state-of-the-art results in terms of success rates on both AndroidWorld (80.2%) and AndroidLab (53.6%). The MOBILERL framework is open-sourced at: https://github.com/THUDM/MobileRL.
♻ ☆ Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution
Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from computational costs or the excessive need for parallel computation. In this study, we introduce a scalable hybrid framework, which is designed to address several important problems, including scalability, noise robustness, and reliable results. We utilized a pre-trained language model to encode each structured data into corresponding semantic embedding vectors. Subsequently, after retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage using fuzzy string matching techniques to refine classification on the unlabeled data. This approach was applied to a real-world entity resolution task, which exposed a linkage between a central user management database and numerous shared hosting server records. Compared to other methods, this approach exhibits an outstanding performance in terms of both processing time and robustness, making it a reliable solution for a server-side product. Crucially, this efficiency does not compromise results, as the system maintains a high retrieval recall of approximately 0.97. The scalability of the framework makes it deployable on standard CPU-based infrastructure, offering a practical and effective solution for enterprise-level data integrity auditing.
comment: Accepted at ICCKE 2025 Conference. 6 tables, 7 figures
♻ ☆ Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
comment: 19 pages, 5 figures
♻ ☆ Planning and Learning in Average Risk-aware MDPs
For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo (MLMC) method, and an off-policy algorithm dedicated to utility-based shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.
♻ ☆ Combinatorial Creativity: A New Frontier in Generalization Abilities
Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, bridging the gap between human and machine intelligence.
comment: Preprint. The first two authors contributed equally
♻ ☆ FORLA: Federated Object-centric Representation Learning with Slot Attention
Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling domain-specific factors without supervision. We introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student-teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.
comment: Accepted by Neurips2025
♻ ☆ The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models
Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, focusing on how visual information is processed and transferred to the textual domain. We compare native multimodal VLMs, models trained from scratch on multimodal data to generate both text and images, and non-native multimodal VLMs, models adapted from pre-trained large language models or capable of generating only text, highlighting key differences in information flow. We find that in native multimodal VLMs, image and text embeddings are more separated within the residual stream. Moreover, VLMs differ in how visual information reaches text: non-native multimodal VLMs exhibit a distributed communication pattern, where information is exchanged through multiple image tokens, whereas models trained natively for joint image and text generation tend to rely on a single post-image token that acts as a narrow gate for visual information. We show that ablating this single token significantly deteriorates image-understanding performance, whereas targeted, token-level interventions reliably steer image semantics and downstream text with fine-grained control.
Human-Computer Interaction 19
☆ Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging ICCV 2025
Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMU-based tracking compromises translation estimates and accurate relative positioning between individuals, as inertial cues are inherently self-referential and provide no direct spatial reference for others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors - both on each individual and across multiple individuals. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people's global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous state-of-the-art methods in accuracy and robustness across synthetic and real-world data, showing the promise of IMU+UWB-based multi-human motion capture in the wild. Code, models, dataset: https://github.com/eth-siplab/GroupInertialPoser
comment: Accepted by ICCV 2025, Code: https://github.com/eth-siplab/GroupInertialPoser
☆ Human and AI Trust: Trust Attitude Measurement Instrument
With the current progress of Artificial Intelligence (AI) technology and its increasingly broader applications, trust is seen as a required criterion for AI usage, acceptance, and deployment. A robust measurement instrument is essential to correctly evaluate trust from a human-centered perspective. This paper describes the development and validation process of a trust measure instrument, which follows psychometric principles, and consists of a 16-items trust scale. The instrument was built explicitly for research in human-AI interaction to measure trust attitudes towards AI systems from layperson (non-expert) perspective. The use-case we used to develop the scale was in the context of AI medical support systems (specifically cancer/health prediction). The scale development (Measurement Item Development) and validation (Measurement Item Evaluation) involved six research stages: item development, item evaluation, survey administration, test of dimensionality, test of reliability, and test of validity. The results of the six-stages evaluation show that the proposed trust measurement instrument is empirically reliable and valid for systematically measuring and comparing non-experts' trust in AI Medical Support Systems.
☆ Actionable Cybersecurity Notifications for Smart Homes: A User Study on the Role of Length and Complexity
The proliferation of smart home devices has increased convenience but also introduced cybersecurity risks for everyday users, as many devices lack robust security features. Intrusion Detection Systems are a prominent approach to detecting cybersecurity threats. However, their alerts often use technical terms and require users to interpret them correctly, which is challenging for a typical smart home user. Large Language Models can bridge this gap by translating IDS alerts into actionable security notifications. However, it has not yet been clear what an actionable cybersecurity notification should look like. In this paper, we conduct an experimental online user study with 130 participants to examine how the length and complexity of LLM-generated notifications affect user likability, understandability, and motivation to act. Our results show that intermediate-complexity notifications are the most effective across all user groups, regardless of their technological proficiency. Across the board, users rated beginner-level messages as more effective when they were longer, while expert-level messages were rated marginally more effective when they were shorter. These findings provide insights for designing security notifications that are both actionable and broadly accessible to smart home users.
comment: This version of the article has been accepted for publication, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/978-3-032-07989-3_19
☆ Co-Designing with Multiple Stakeholders and Datasets: A Community-Centered Process to Understand Youth Deviance in the Italian City of Turin
This paper presents the co-design and design evaluation of Sbocciamo Torino civic tool, which helps understand and act upon the issues of youth deviance in the Italian city of Turin through multi-stakeholder collaboration and collaborative data analysis. Rooted in research through design and participatory design methodologies, the civic tool integrates a data dashboard, stakeholder committee, and structured co-design sessions to facilitate collaborative analysis and intervention planning. The civic tool was developed in partnership with municipal authorities, law enforcement, NGOs, and social services, and reflects their institutional priorities while centering community knowledge. We describe the iterative co-design process, including stakeholder workshops for design, validation, training, and evaluation. The civic tool's impact on stakeholder trust, collaboration, and decision-making was assessed through surveys and open-ended questionnaires. Our findings show that stakeholders valued the inclusive design approach and data-driven collaboration while revealing barriers in communication, data literacy, and operational coordination. Furthermore, political and institutional support was identified as critical to the civic tool's success. This paper contributes to research on community technologies by demonstrating how civic tools can be collaboratively developed to navigate wicked social problems through participatory design.
☆ Assessing the Real-World Utility of Explainable AI for Arousal Diagnostics: An Application-Grounded User Study
Artificial intelligence (AI) systems increasingly match or surpass human experts in biomedical signal interpretation. However, their effective integration into clinical practice requires more than high predictive accuracy. Clinicians must discern \textit{when} and \textit{why} to trust algorithmic recommendations. This work presents an application-grounded user study with eight professional sleep medicine practitioners, who score nocturnal arousal events in polysomnographic data under three conditions: (i) manual scoring, (ii) black-box (BB) AI assistance, and (iii) transparent white-box (WB) AI assistance. Assistance is provided either from the \textit{start} of scoring or as a post-hoc quality-control (\textit{QC}) review. We systematically evaluate how the type and timing of assistance influence event-level and clinically most relevant count-based performance, time requirements, and user experience. When evaluated against the clinical standard used to train the AI, both AI and human-AI teams significantly outperform unaided experts, with collaboration also reducing inter-rater variability. Notably, transparent AI assistance applied as a targeted QC step yields median event-level performance improvements of approximately 30\% over black-box assistance, and QC timing further enhances count-based outcomes. While WB and QC approaches increase the time required for scoring, start-time assistance is faster and preferred by most participants. Participants overwhelmingly favor transparency, with seven out of eight expressing willingness to adopt the system with minor or no modifications. In summary, strategically timed transparent AI assistance effectively balances accuracy and clinical efficiency, providing a promising pathway toward trustworthy AI integration and user acceptance in clinical workflows.
☆ Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles
Background: Trustworthy AI serves as a foundational pillar for two major AI ethics conferences: AIES and FAccT. However, current research often adopts techno-centric approaches, focusing primarily on technical attributes such as reliability, robustness, and fairness, while overlooking the sociotechnical dimensions critical to understanding AI trustworthiness in real-world contexts. Objectives: This scoping review aims to examine how the AIES and FAccT communities conceptualize, measure, and validate AI trustworthiness, identifying major gaps and opportunities for advancing a holistic understanding of trustworthy AI systems. Methods: We conduct a scoping review of AIES and FAccT conference proceedings to date, systematically analyzing how trustworthiness is defined, operationalized, and applied across different research domains. Our analysis focuses on conceptualization approaches, measurement methods, verification and validation techniques, application areas, and underlying values. Results: While significant progress has been made in defining technical attributes such as transparency, accountability, and robustness, our findings reveal critical gaps. Current research often predominantly emphasizes technical precision at the expense of social and ethical considerations. The sociotechnical nature of AI systems remains less explored and trustworthiness emerges as a contested concept shaped by those with the power to define it. Conclusions: An interdisciplinary approach combining technical rigor with social, cultural, and institutional considerations is essential for advancing trustworthy AI. We propose actionable measures for the AI ethics community to adopt holistic frameworks that genuinely address the complex interplay between AI systems and society, ultimately promoting responsible technological development that benefits all stakeholders.
comment: Submitted to Journal of Artificial Intelligence Research (JAIR)
☆ DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services
Objective: Emergency medical dispatch (EMD) is a high-stakes process challenged by caller distress, ambiguity, and cognitive load. Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers. This study aimed to develop and evaluate a taxonomy-grounded, LLM-powered multi-agent system for simulating realistic EMD scenarios. Methods: We constructed a clinical taxonomy (32 chief complaints, 6 caller identities from MIMIC-III) and a six-phase call protocol. Using this framework, we developed an AutoGen-based MAS with Caller and Dispatcher Agents. The system grounds interactions in a fact commons to ensure clinical plausibility and mitigate misinformation. We used a hybrid evaluation framework: four physicians assessed 100 simulated cases for "Guidance Efficacy" and "Dispatch Effectiveness," supplemented by automated linguistic analysis (sentiment, readability, politeness). Results: Human evaluation, with substantial inter-rater agreement (Gwe's AC1 > 0.70), confirmed the system's high performance. It demonstrated excellent Dispatch Effectiveness (e.g., 94 % contacting the correct potential other agents) and Guidance Efficacy (advice provided in 91 % of cases), both rated highly by physicians. Algorithmic metrics corroborated these findings, indicating a predominantly neutral affective profile (73.7 % neutral sentiment; 90.4 % neutral emotion), high readability (Flesch 80.9), and a consistently polite style (60.0 % polite; 0 % impolite). Conclusion: Our taxonomy-grounded MAS simulates diverse, clinically plausible dispatch scenarios with high fidelity. Findings support its use for dispatcher training, protocol evaluation, and as a foundation for real-time decision support. This work outlines a pathway for safely integrating advanced AI agents into emergency response workflows.
comment: 27 pages, 7 figures, 3 tables
☆ Rethinking UX for Sustainable Science Gateways: Orientations from Practice
As science gateways mature, sustainability has become a central concern for funders, developers, and institutions. Although user experience (UX) is increasingly acknowledged as vital, it is often approached narrowly--limited to interface usability or deferred until late in development. This paper argues that UX should be understood not as a discrete feature or evaluation stage but as a design-oriented perspective for reasoning about sustainability. Drawing on principles from user-centered design and systems thinking, this view recognizes that infrastructure, staffing, community engagement, and development timelines all shape how gateways are experienced and maintained over time. Based on an interview study and consulting experience with more than 65 gateway projects, the paper identifies three recurring orientations toward UX--ad hoc, project-based, and strategic--that characterize how teams engage with users and integrate design thinking into their workflows. These orientations are not a maturity model but a reflective lens for understanding how UX is positioned within gateway practice. Reframing UX as a structural dimension of sustainability highlights its role in building adaptable, community-aligned, and enduring scientific infrastructure.
comment: To be published at Gateways 2025
☆ We Need Accountability in Human-AI Agent Relationships
We argue that accountability mechanisms are needed in human-AI agent relationships to ensure alignment with user and societal interests. We propose a framework according to which AI agents' engagement is conditional on appropriate user behaviour. The framework incorporates design-strategies such as distancing, disengaging, and discouraging.
☆ Designing and Evaluating Hint Generation Systems for Science Education
Large language models are influencing the education landscape, with students relying on them in their learning process. Often implemented using general-purpose models, these systems are likely to give away the answers, which could hinder conceptual understanding and critical thinking. We study the role of automatic hint generation as a pedagogical strategy to promote active engagement with the learning content, while guiding learners toward the answers. Focusing on scientific topics at the secondary education level, we explore the potential of large language models to generate chains of hints that scaffold learners without revealing answers. We compare two distinct hinting strategies: static hints, pre-generated for each problem, and dynamic hints, adapted to learners' progress. Through a quantitative study with 41 participants, we uncover different preferences among learners with respect to hinting strategies, and identify the limitations of automatic evaluation metrics to capture them. Our findings highlight key design considerations for future research on hint generation and intelligent tutoring systems that seek to develop learner-centered educational technologies.
☆ Soppia: A Structured Prompting Framework for the Proportional Assessment of Non-Pecuniary Damages in Personal Injury Cases
Applying complex legal rules characterized by multiple, heterogeneously weighted criteria presents a fundamental challenge in judicial decision-making, often hindering the consistent realization of legislative intent. This challenge is particularly evident in the quantification of non-pecuniary damages in personal injury cases. This paper introduces Soppia, a structured prompting framework designed to assist legal professionals in navigating this complexity. By leveraging advanced AI, the system ensures a comprehensive and balanced analysis of all stipulated criteria, fulfilling the legislator's intent that compensation be determined through a holistic assessment of each case. Using the twelve criteria for non-pecuniary damages established in the Brazilian CLT (Art. 223-G) as a case study, we demonstrate how Soppia (System for Ordered Proportional and Pondered Intelligent Assessment) operationalizes nuanced legal commands into a practical, replicable, and transparent methodology. The framework enhances consistency and predictability while providing a versatile and explainable tool adaptable across multi-criteria legal contexts, bridging normative interpretation and computational reasoning toward auditable legal AI.
comment: 9 pages, 2 tables, includes GitHub link to framework implementation. Submitted to the Artificial Intelligence and Law section of arXiv
♻ ☆ CityAQVis: Integrated ML-Visualization Sandbox Tool for Pollutant Estimation in Urban Regions Using Multi-Source Data (Software Article)
Urban air pollution poses significant risks to public health, environmental sustainability, and policy planning. Effective air quality management requires predictive tools that can integrate diverse datasets and communicate complex spatial and temporal pollution patterns. There is a gap in interactive tools with seamless integration of forecasting and visualization of spatial distributions of air pollutant concentrations. We present CityAQVis, an interactive machine learning ML sandbox tool designed to predict and visualize pollutant concentrations at the ground level using multi-source data, which includes satellite observations, meteorological parameters, population density, elevation, and nighttime lights. While traditional air quality visualization tools often lack forecasting capabilities, CityAQVis enables users to build and compare predictive models, visualizing the model outputs and offering insights into pollution dynamics at the ground level. The pilot implementation of the tool is tested through case studies predicting nitrogen dioxide (NO2) concentrations in metropolitan regions, highlighting its adaptability to various pollutants. Through an intuitive graphical user interface (GUI), the user can perform comparative visualizations of the spatial distribution of surface-level pollutant concentration in two different urban scenarios. Our results highlight the potential of ML-driven visual analytics to improve situational awareness and support data-driven decision-making in air quality management.
comment: 19 pages, 10 figures, 2 tables
♻ ☆ Visualization Tasks for Unlabelled Graphs
We investigate tasks that can be accomplished with unlabelled graphs, which are graphs with nodes that do not have attached persistent or semantically meaningful labels. New visualization techniques to represent unlabelled graphs have been proposed, but more understanding of unlabelled graph tasks is required before these techniques can be adequately evaluated. Some tasks apply to both labelled and unlabelled graphs, but many do not translate between these contexts. We propose a data abstraction model that distinguishes the Unlabelled context from the increasingly semantically rich Labelled, Attributed, and Augmented contexts. We filter tasks collected and gleaned from the literature according to our data abstraction and analyze the surfaced tasks, leading to a taxonomy of abstract tasks for unlabelled graphs. Our task taxonomy is organized according to the Scope of the data at play, the Action intended by the user, and the Target data under consideration. We show the descriptive power of this task abstraction by connecting to concrete examples from previous frameworks, and connect these abstractions to real-world problems. To showcase the evaluative power of the taxonomy, we perform a preliminary assessment of 6 visualizations for each task. For each combination of task and visual encoding, we consider the effort required from viewers, the likelihood of task success, and how both factors vary between small-scale and large-scale graphs.
♻ ☆ Feed-forward active magnetic shielding
Magnetic fields from the brain are tiny relative to ambient fields which therefore need to be suppressed. The common solution of passive shielding is expensive, bulky and insufficiently effective, thus motivating research into the alternative of active shielding which comes in two flavours: feed-back and feed-forward. In feed-back designs (the most common), corrective fields are created by coils driven from sensors within the area that they correct, for example from the main sensors of an MEG device. In feed-forward designs (less common), corrective fields are driven from dedicated reference sensors outside the area they correct. Feed-forward can achieve better performance than feed-back, in principle, however its implementation is hobbled by an unavoidable coupling between coils and reference sensors, which reduces the effectiveness of the shielding and may affect stability, complicating the design. This paper suggests a solution that relies on a ``decoupling matrix," inserted in the signal pathway between sensors and corrective coils, to counteract the spurious coupling. This allows feed-forward shielding do reduce the ambient field to zero across the full frequency range, in principle, although performance may be limited by other factors such as current noise. The solution, which is fully data-driven and does not require geometric calculations, high-tolerance fabrication, or physical calibration, has been evaluated by simulation, but not implemented in hardware. It might contribute to the deployment of a new generation of measurement systems based on optically-pumped magnetometers (OPM). The lower cost and reduced constraints of those systems are a strong incentive to likewise reduce the cost and constraints of the shielding required to operate them, hence the appeal of active shielding.
comment: 27 pages, 7 figures
♻ ☆ Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.
comment: 49 pages, 13 figures
♻ ☆ Kinesthetic Weight Modulation: The Effects of Whole-Arm Tendon Vibration on the Perceived Heaviness
Kinesthetic illusions, which arise when muscle spindles are activated by vibration, provide a compact means of presenting kinesthetic sensations. Because muscle spindles contribute not only to sensing body movement but also to perceiving heaviness, vibration-induced illusions could potentially modulate weight perception. While prior studies have primarily focused on conveying virtual movement, the modulation of perceived heaviness has received little attention. Presenting a sense of heaviness is essential for enriching haptic interactions with virtual objects. This study investigates whether multi-point tendon vibration can increase or decrease perceived heaviness (Experiment 1) and how the magnitude of the effect can be systematically controlled (Experiment 2). The results show that tendon vibration significantly increases perceived heaviness but does not significantly decrease it, although a decreasing trend was observed. Moreover, the increase can be adjusted across at least three levels within the range from 350 g to 450 g. Finally, we discuss plausible mechanisms underlying this vibration-induced modulation of weight perception.
comment: 12 pages, 11 figures
♻ ☆ Using Generative Artificial Intelligence Creatively in the Classroom and Research: Examples and Lessons Learned
Although generative artificial intelligence (AI) is not new, recent technological breakthroughs have transformed its capabilities across many domains. These changes necessitate new attention from educators and specialized training within the atmospheric and related sciences. Enabling students to use generative AI effectively, responsibly, and ethically is crucial for their academic and professional development. Educators can also use generative AI to develop engaging classroom activities, such as active learning modules and games; however, they must be aware of potential pitfalls and biases. There are also ethical implications in using tools that lack transparency and have a considerable carbon footprint, as well as equity concerns for students who lack access to more sophisticated paid versions of generative AI tools and have deficiencies in prior educational training. This article is written for students and educators alike, particularly those interested in learning more about generative AI in education and research, including its use cases, ethical concerns, and a brief history of its emergence. Sample user prompts are also provided across numerous applications in education and the atmospheric and related sciences. Current solutions addressing broader ethical concerns regarding the use of generative AI in education remain limited; however, this work aims to foster a discussion that could galvanize the education community around shared goals and values.
comment: This Work has been submitted to the Bulletin of the American Meteorological Society. Copyright in this Work may be transferred without further notice; 14 pages, 2 figures, 2 tables
♻ ☆ Save It for the "Hot" Day: An LLM-Empowered Visual Analytics System for Heat Risk Management
The escalating frequency and intensity of heat-related climate events, particularly heatwaves, emphasize the pressing need for advanced heat risk management strategies. Current approaches, primarily relying on numerical models, face challenges in spatial-temporal resolution and in capturing the dynamic interplay of environmental, social, and behavioral factors affecting heat risks. This has led to difficulties in translating risk assessments into effective mitigation actions. Recognizing these problems, we introduce a novel approach leveraging the burgeoning capabilities of Large Language Models (LLMs) to extract rich and contextual insights from news reports. We hence propose an LLM-empowered visual analytics system, Havior, that integrates the precise, data-driven insights of numerical models with nuanced news report information. This hybrid approach enables a more comprehensive assessment of heat risks and better identification, assessment, and mitigation of heat-related threats. The system incorporates novel visualization designs, such as "thermoglyph" and news glyph, enhancing intuitive understanding and analysis of heat risks. The integration of LLM-based techniques also enables advanced information retrieval and semantic knowledge extraction that can be guided by experts' analytics needs. Our case studies on two cities that faced significant heatwave events and interviews with five experts have demonstrated the usefulness of our system in providing in-depth and actionable insights for heat risk management.
♻ ☆ The Risks of Industry Influence in Tech Research
Emerging information technologies like social media, search engines, and AI can have a broad impact on public health, political institutions, social dynamics, and the natural world. It is critical to develop a scientific understanding of these impacts to inform evidence-based technology policy that minimizes harm and maximizes benefits. Unlike most other global-scale scientific challenges, however, the data necessary for scientific progress are generated and controlled by the same industry that might be subject to evidence-based regulation. Moreover, technology companies historically have been, and continue to be, a major source of funding for this field. These asymmetries in information and funding raise significant concerns about the potential for undue industry influence on the scientific record. In this Perspective, we explore how technology companies can influence our scientific understanding of their products. We argue that science faces unique challenges in the context of technology research that will require strengthening existing safeguards and constructing wholly new ones.
comment: 17 pages, 2 Figures
Software Engineering 38
☆ Privacy by Design: Aligning GDPR and Software Engineering Specifications with a Requirements Engineering Approach
Context: Consistent requirements and system specifications are essential for the compliance of software systems towards the General Data Protection Regulation (GDPR). Both artefacts need to be grounded in the original text and conjointly assure the achievement of privacy by design (PbD). Objectives: There is little understanding of the perspectives of practitioners on specification objectives and goals to address PbD. Existing approaches do not account for the complex intersection between problem and solution space expressed in GDPR. In this study we explore the demand for conjoint requirements and system specification for PbD and suggest an approach to address this demand. Methods: We reviewed secondary and related primary studies and conducted interviews with practitioners to (1) investigate the state-of-practice and (2) understand the underlying specification objectives and goals (e.g., traceability). We developed and evaluated an approach for requirements and systems specification for PbD, and evaluated it against the specification objectives. Results: The relationship between problem and solution space, as expressed in GDPR, is instrumental in supporting PbD. We demonstrate how our approach, based on the modeling GDPR content with original legal concepts, contributes to specification objectives of capturing legal knowledge, supporting specification transparency, and traceability. Conclusion: GDPR demands need to be addressed throughout different levels of abstraction in the engineering lifecycle to achieve PbD. Legal knowledge specified in the GDPR text should be captured in specifications to address the demands of different stakeholders and ensure compliance. While our results confirm the suitability of our approach to address practical needs, we also revealed specific needs for the future effective operationalization of the approach.
☆ Lights-Out: An Automated Ground Segment for unstaffed Satellite Operations
We present our approach for a periodically unstaffed, fully automated ground segment. The concept is in use for the first time on the German satellite communications mission Heinrich Hertz on behalf of the German Space Agency at DLR. Heinrich Hertz was launched in July 2023 and offers access to scientific and technical experiments to its users. The mission utilizes major automation concepts for the satellite platform operations, allowing fully automated operations outside of office hours. The concept includes tracking, telemetry and commanding (TTC) of the satellite. Pre-planned and automatically executed schedules enable commanding without human interaction. The user mission schedule is planned separately from the main mission schedule and is automatically de-conflicted. The automatic monitoring concept monitors the systems of the satellite and all assets in the ground segment and triggers reactions in operator-configurable ways depending on the mission needs, for example emergency notifications or automated execution of flight operation procedures. Additionally, the concept also puts special emphasis on a self-service user portal that provides flexible access 24/7, even when the control center is not staffed. The portal allows external users of the payload to schedule pre-defined experiments, monitor the live execution of the experiment with browser-based displays and access ground station telemetry and dedicated RF test equipment during the time of their scheduled experiment. Tasks can be planned long in advance as well as with a short reaction time (less than 1 minute), which allows, for example, the reconfiguration of the payload during a running experiment.
☆ Wisdom and Delusion of LLM Ensembles for Code Generation and Repair
Today's pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble's potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble's candidate pool. We find that the theoretical upperbound for an ensemble's performance can be 83% above the best single model. Our results show that consensus-based strategies for selecting solutions fall into a "popularity trap," amplifying common but incorrect outputs. In contrast, a diversity-based strategy realizes up to 95% of this theoretical potential, and proves effective even in small two-model ensembles, enabling a cost-efficient way to enhance performance by leveraging multiple LLMs.
☆ Risk Management for Mitigating Benchmark Failure Modes: BenchRisk NeurIPS 2025
Large language model (LLM) benchmarks inform LLM use decisions (e.g., "is this LLM safe to deploy for my use case and context?"). However, benchmarks may be rendered unreliable by various failure modes that impact benchmark bias, variance, coverage, or people's capacity to understand benchmark evidence. Using the National Institute of Standards and Technology's risk management process as a foundation, this research iteratively analyzed 26 popular benchmarks, identifying 57 potential failure modes and 196 corresponding mitigation strategies. The mitigations reduce failure likelihood and/or severity, providing a frame for evaluating "benchmark risk," which is scored to provide a metaevaluation benchmark: BenchRisk. Higher scores indicate that benchmark users are less likely to reach an incorrect or unsupported conclusion about an LLM. All 26 scored benchmarks present significant risk within one or more of the five scored dimensions (comprehensiveness, intelligibility, consistency, correctness, and longevity), which points to important open research directions for the field of LLM benchmarking. The BenchRisk workflow allows for comparison between benchmarks; as an open-source tool, it also facilitates the identification and sharing of risks and their mitigations.
comment: 19 pages, 7 figures, to be published in the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Towards Socio-Technical Topology-Aware Adaptive Threat Detection in Software Supply Chains
Software supply chains (SSCs) are complex systems composed of dynamic, heterogeneous technical and social components which collectively achieve the production and maintenance of software artefacts. Attacks on SSCs are increasing, yet pervasive vulnerability analysis is challenging due to their complexity. Therefore, threat detection must be targeted, to account for the large and dynamic structure, and adaptive, to account for its change and diversity. While current work focuses on technical approaches for monitoring supply chain dependencies and establishing component controls, approaches which inform threat detection through understanding the socio-technical dynamics are lacking. We outline a position and research vision to develop and investigate the use of socio-technical models to support adaptive threat detection of SSCs. We motivate this approach through an analysis of the XZ Utils attack whereby malicious actors undermined the maintainers' trust via the project's GitHub and mailing lists. We highlight that monitoring technical and social data can identify trends which indicate suspicious behaviour to then inform targeted and intensive vulnerability assessment. We identify challenges and research directions to achieve this vision considering techniques for developer and software analysis, decentralised adaptation and the need for a test bed for software supply chain security research.
comment: to be published in: The 12th International Conference on Social Networks Analysis, Management and Security (SNAMS), IEEE
☆ Scalpel: Automotive Deep Learning Framework Testing via Assembling Model Components
Deep learning (DL) plays a key role in autonomous driving systems. DL models support perception modules, equipped with tasks such as object detection and sensor fusion. These DL models enable vehicles to process multi-sensor inputs to understand complex surroundings. Deploying DL models in autonomous driving systems faces stringent challenges, including real-time processing, limited computational resources, and strict power constraints. To address these challenges, automotive DL frameworks (e.g., PaddleInference) have emerged to optimize inference efficiency. However, these frameworks encounter unique quality issues due to their more complex deployment environments, such as crashes stemming from limited scheduled memory and incorrect memory allocation. Unfortunately, existing DL framework testing methods fail to detect these quality issues due to the failure in deploying generated test input models, as these models lack three essential capabilities: (1) multi-input/output tensor processing, (2) multi-modal data processing, and (3) multi-level data feature extraction. These capabilities necessitate specialized model components, which existing testing methods neglect during model generation. To bridge this gap, we propose Scalpel, an automotive DL frameworks testing method that generates test input models at the model component level. Scalpel generates models by assembling model components (heads, necks, backbones) to support capabilities required by autonomous driving systems. Specifically, Scalpel maintains and updates a repository of model components, generating test inputs by selecting, mutating, and assembling them. Successfully generated models are added back to enrich the repository. Newly generated models are then deployed within the autonomous driving system to test automotive DL frameworks via differential testing.
☆ Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification
[Context and motivation] Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE). However, their use is compromised by high computational cost, data sharing risks, and dependence on external services. In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative. [Question/problem] It remains unclear how well SLMs perform compared to LLMs in RE tasks in terms of accuracy. [Results] Our preliminary study compares eight models, including three LLMs and five SLMs, on requirements classification tasks using the PROMISE, PROMISE Reclass, and SecReq datasets. Our results show that although LLMs achieve an average F1 score of 2% higher than SLMs, this difference is not statistically significant. SLMs almost reach LLMs performance across all datasets and even outperform them in recall on the PROMISE Reclass dataset, despite being up to 300 times smaller. We also found that dataset characteristics play a more significant role in performance than model size. [Contribution] Our study contributes with evidence that SLMs are a valid alternative to LLMs for requirements classification, offering advantages in privacy, cost, and local deployability.
☆ Context Engineering for AI Agents in Open-Source Software
GenAI-based coding assistants have disrupted software development. Their next generation is agent-based, operating with more autonomy and potentially without human oversight. One challenge is to provide AI agents with sufficient context about the software projects they operate in. Like humans, AI agents require contextual information to develop solutions that are in line with the target architecture, interface specifications, coding guidelines, standard workflows, and other project-specific policies. Popular AI agents for software development (e.g., Claude Code) advocate for maintaining tool-specific version-controlled Markdown files that cover aspects such as the project structure, building and testing, or code style. The content of these files is automatically added to each prompt. AGENTS.md has emerged as a potential standard that consolidates tool-specific formats. However, little is known about whether and how developers adopt this format. Therefore, in this paper, we present the results of a preliminary study investigating the adoption of AI configuration files in 466 open-source software projects, what information developers provide in these files, how they present that information, and how they evolve over time. Our findings indicate that there is no established structure yet, and that there is a lot of variation in terms of how context is provided (descriptive, prescriptive, prohibitive, explanatory, conditional). We see great potential in studying which modifications in structure or presentation can positively affect the quality of the generated content. Finally, our analysis of commits that have modified AGENTS.md files provides first insights into how projects continuously extend and maintain these files. We conclude the paper by outlining how the adoption of AI configuration files in provides a unique opportunity to study real-world prompt and context engineering.
comment: 6 pages, 1 figure, 2 tables
☆ REvolution: An Evolutionary Framework for RTL Generation driven by Large Language Models SP
Large Language Models (LLMs) are used for Register-Transfer Level (RTL) code generation, but they face two main challenges: functional correctness and Power, Performance, and Area (PPA) optimization. Iterative, feedback-based methods partially address these, but they are limited to local search, hindering the discovery of a global optimum. This paper introduces REvolution, a framework that combines Evolutionary Computation (EC) with LLMs for automatic RTL generation and optimization. REvolution evolves a population of candidates in parallel, each defined by a design strategy, RTL implementation, and evaluation feedback. The framework includes a dual-population algorithm that divides candidates into Fail and Success groups for bug fixing and PPA optimization, respectively. An adaptive mechanism further improves search efficiency by dynamically adjusting the selection probability of each prompt strategy according to its success rate. Experiments on the VerilogEval and RTLLM benchmarks show that REvolution increased the initial pass rate of various LLMs by up to 24.0 percentage points. The DeepSeek-V3 model achieved a final pass rate of 95.5\%, comparable to state-of-the-art results, without the need for separate training or domain-specific tools. Additionally, the generated RTL designs showed significant PPA improvements over reference designs. This work introduces a new RTL design approach by combining LLMs' generative capabilities with EC's broad search power, overcoming the local-search limitations of previous methods.
comment: Accepted for publication at the 2026 Asia and South Pacific Design Automation Conference (ASP-DAC)
☆ GreenMalloc: Allocator Optimisation for Industrial Workloads
We present GreenMalloc, a multi objective search-based framework for automatically configuring memory allocators. Our approach uses NSGA II and rand_malloc as a lightweight proxy benchmarking tool. We efficiently explore allocator parameters from execution traces and transfer the best configurations to gem5, a large system simulator, in a case study on two allocators: the GNU C/CPP compiler's glibc malloc and Google's TCMalloc. Across diverse workloads, our empirical results show up to 4.1 percantage reduction in average heap usage without loss of runtime efficiency; indeed, we get a 0.25 percantage reduction.
☆ FLAMES: Fine-tuning LLMs to Synthesize Invariants for Smart Contract Security
Smart contract vulnerabilities cost billions of dollars annually, yet existing automated analysis tools fail to generate deployable defenses. We present FLAMES, a novel automated approach that synthesizes executable runtime guards as Solidity "require" statements to harden smart contracts against exploits. Unlike prior work that relies on vulnerability labels, symbolic analysis, or natural language specifications, FLAMES employs domain-adapted large language models trained through fill-in-the-middle supervised fine-tuning on real-world invariants extracted from 514,506 verified contracts. Our extensive evaluation across three dimensions demonstrates FLAMES's effectiveness: (1) Compilation: FLAMES achieves 96.7% compilability for synthesized invariant (2) Semantic Quality: on a curated test set of 5,000 challenging invariants, FLAMES produces exact or semantically equivalent matches to ground truth in 44.5% of cases; (3) Exploit Mitigation: FLAMES prevents 22 out of 108 real exploits (20.4%) while preserving contract functionality, and (4) FLAMES successfully blocks the real-world APEMAGA incident by synthesizing a pre-condition that mitigates the attack. FLAMES establishes that domain-adapted LLMs can automatically generate production-ready security defenses for smart contracts without requiring vulnerability detection, formal specifications, or human intervention. We release our code, model weights, datasets, and evaluation infrastructure to enable reproducible research in this critical domain.
☆ Efficient Exploration of Chemical Kinetics
Estimating reaction rates and chemical stability is fundamental, yet efficient methods for large-scale simulations remain out of reach despite advances in modeling and exascale computing. Direct simulation is limited by short timescales; machine-learned potentials require large data sets and struggle with transition state regions essential for reaction rates. Reaction network exploration with sufficient accuracy is hampered by the computational cost of electronic structure calculations, and even simplifications like harmonic transition state theory rely on prohibitively expensive saddle point searches. Surrogate model-based acceleration has been promising but hampered by overhead and numerical instability. This dissertation presents a holistic solution, co-designing physical representations, statistical models, and systems architecture in the Optimal Transport Gaussian Process (OT-GP) framework. Using physics-aware optimal transport metrics, OT-GP creates compact, chemically relevant surrogates of the potential energy surface, underpinned by statistically robust sampling. Alongside EON software rewrites for long timescale simulations, we introduce reinforcement learning approaches for both minimum-mode following (when the final state is unknown) and nudged elastic band methods (when endpoints are specified). Collectively, these advances establish a representation-first, modular approach to chemical kinetics simulation. Large-scale benchmarks and Bayesian hierarchical validation demonstrate state-of-the-art performance and practical exploration of chemical kinetics, transforming a longstanding theoretical promise into a working engine for discovery.
comment: Doctoral dissertation, 102 pages, ISBN pending from the University of Iceland. doctorate. By design, all text and figures within this thesis are original and do not appear in the associated papers
☆ LLM-Powered Detection of Price Manipulation in DeFi
Decentralized Finance (DeFi) smart contracts manage billions of dollars, making them a prime target for exploits. Price manipulation vulnerabilities, often via flash loans, are a devastating class of attacks causing significant financial losses. Existing detection methods are limited. Reactive approaches analyze attacks only after they occur, while proactive static analysis tools rely on rigid, predefined heuristics, limiting adaptability. Both depend on known attack patterns, failing to identify novel variants or comprehend complex economic logic. We propose PMDetector, a hybrid framework combining static analysis with Large Language Model (LLM)-based reasoning to proactively detect price manipulation vulnerabilities. Our approach uses a formal attack model and a three-stage pipeline. First, static taint analysis identifies potentially vulnerable code paths. Second, a two-stage LLM process filters paths by analyzing defenses and then simulates attacks to evaluate exploitability. Finally, a static analysis checker validates LLM results, retaining only high-risk paths and generating comprehensive vulnerability reports. To evaluate its effectiveness, we built a dataset of 73 real-world vulnerable and 288 benign DeFi protocols. Results show PMDetector achieves 88% precision and 90% recall with Gemini 2.5-flash, significantly outperforming state-of-the-art static analysis and LLM-based approaches. Auditing a vulnerability with PMDetector costs just $0.03 and takes 4.0 seconds with GPT-4.1, offering an efficient and cost-effective alternative to manual audits.
☆ Unified Implementations of Recurrent Neural Networks in Multiple Deep Learning Frameworks
Recurrent neural networks (RNNs) are a cornerstone of sequence modeling across various scientific and industrial applications. Owing to their versatility, numerous RNN variants have been proposed over the past decade, aiming to improve the modeling of long-term dependencies and to address challenges such as vanishing and exploding gradients. However, no central library is available to test these variations, and reimplementing diverse architectures can be time-consuming and error-prone, limiting reproducibility and exploration. Here, we introduce three open-source libraries in Julia and Python that centralize numerous recurrent cell implementations and higher-level recurrent architectures. torchrecurrent, RecurrentLayers.jl, and LuxRecurrentLayers.jl offer a consistent framework for constructing and extending RNN models, providing built-in mechanisms for customization and experimentation. All packages are available under the MIT license and actively maintained on GitHub.
☆ Securing AI Agent Execution
Large Language Models (LLMs) have evolved into AI agents that interact with external tools and environments to perform complex tasks. The Model Context Protocol (MCP) has become the de facto standard for connecting agents with such resources, but security has lagged behind: thousands of MCP servers execute with unrestricted access to host systems, creating a broad attack surface. In this paper, we introduce AgentBound, the first access control framework for MCP servers. AgentBound combines a declarative policy mechanism, inspired by the Android permission model, with a policy enforcement engine that contains malicious behavior without requiring MCP server modifications. We build a dataset containing the 296 most popular MCP servers, and show that access control policies can be generated automatically from source code with 80.9% accuracy. We also show that AgentBound blocks the majority of security threats in several malicious MCP servers, and that policy enforcement engine introduces negligible overhead. Our contributions provide developers and project managers with a practical foundation for securing MCP servers while maintaining productivity, enabling researchers and tool builders to explore new directions for declarative access control and MCP security.
☆ QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture
The field of computer architecture, which bridges high-level software abstractions and low-level hardware implementations, remains absent from current large language model (LLM) evaluations. To this end, we present QuArch (pronounced 'quark'), the first benchmark designed to facilitate the development and evaluation of LLM knowledge and reasoning capabilities specifically in computer architecture. QuArch provides a comprehensive collection of 2,671 expert-validated question-answer (QA) pairs covering various aspects of computer architecture, including processor design, memory systems, and interconnection networks. Our evaluation reveals that while frontier models possess domain-specific knowledge, they struggle with skills that require higher-order thinking in computer architecture. Frontier model accuracies vary widely (from 34% to 72%) on these advanced questions, highlighting persistent gaps in architectural reasoning across analysis, design, and implementation QAs. By holistically assessing fundamental skills, QuArch provides a foundation for building and measuring LLM capabilities that can accelerate innovation in computing systems. With over 140 contributors from 40 institutions, this benchmark represents a community effort to set the standard for architectural reasoning in LLM evaluation.
☆ HW/SW Co-design of a PCM/PWM converter: a System Level Approach based in the SpecC Methodology
We present a case study applying the SpecC methodology within a system-level hardware/software co-design flow to a PCM-to-PWM converter, the core of a Class-D audio amplifier. The converter was modeled and explored with SpecC methodology to derive an HW/SW partition. Using system-level estimates and fast functional simulation, we evaluated mappings that meet real-time constraints while reducing estimated cost of an all-hardware solution and avoiding the expense of a purely software implementation on a high-end processor. Despite the design's moderate complexity, the results underline the value of system-level co-design for early architectural insight, rapid validation, and actionable cost/performance trade-offs. [Original work from 2005; formatting revised in 2025, with no changes to the results.]
comment: 6
☆ Impact and Implications of Generative AI for Enterprise Architects in Agile Environments: A Systematic Literature Review
Generative AI (GenAI) is reshaping enterprise architecture work in agile software organizations, yet evidence on its effects remains scattered. We report a systematic literature review (SLR), following established SLR protocols of Kitchenham and PRISMA, of 1,697 records, yielding 33 studies across enterprise, solution, domain, business, and IT architect roles. GenAI most consistently supports (i) design ideation and trade-off exploration; (ii) rapid creation and refinement of artifacts (e.g., code, models, documentation); and (iii) architectural decision support and knowledge retrieval. Reported risks include opacity and bias, contextually incorrect outputs leading to rework, privacy and compliance concerns, and social loafing. We also identify emerging skills and competencies, including prompt engineering, model evaluation, and professional oversight, and organizational enablers around readiness and adaptive governance. The review contributes with (1) a mapping of GenAI use cases and risks in agile architecting, (2) implications for capability building and governance, and (3) an initial research agenda on human-AI collaboration in architecture. Overall, the findings inform responsible adoption of GenAI that accelerates digital transformation while safeguarding architectural integrity.
comment: 17 pages, 1 figure, 5 tables; to appear in Enterprise Design, Operations, and Computing. EDOC 2025 Workshops, Lecture Notes in Business Information Processing (LNBIP), Springer, 2025. Part of 29th International Conference on Enterprise Design, Operations, and Computing (EDOC)
☆ FeaGPT: an End-to-End agentic-AI for Finite Element Analysis
Large language models (LLMs) are establishing new paradigms for engineering applications by enabling natural language control of complex computational workflows. This paper introduces FeaGPT, the first framework to achieve complete geometry-mesh-simulation workflows through conversational interfaces. Unlike existing tools that automate individual FEA components, FeaGPT implements a fully integrated Geometry-Mesh-Simulation-Analysis (GMSA) pipeline that transforms engineering specifications into validated computational results without manual intervention. The system interprets engineering intent, automatically generates physics-aware adaptive meshes, configures complete FEA simulations with proper boundary condition inference, and performs multi-objective analysis through closed-loop iteration. Experimental validation confirms complete end-to-end automation capability. Industrial turbocharger cases (7-blade compressor and 12-blade turbine at \SI{110000}{rpm}) demonstrate the system successfully transforms natural language specifications into validated CalculiX simulations, producing physically realistic results for rotating machinery analysis. Additional validation through 432 NACA airfoil configurations confirms scalability for parametric design exploration. These results demonstrate that natural language interfaces can effectively democratize access to advanced computational engineering tools while preserving analytical rigor.
☆ ArchISMiner: A Framework for Automatic Mining of Architectural Issue-Solution Pairs from Online Developer Communities
Stack Overflow (SO), a leading online community forum, is a rich source of software development knowledge. However, locating architectural knowledge, such as architectural solutions remains challenging due to the overwhelming volume of unstructured content and fragmented discussions. Developers must manually sift through posts to find relevant architectural insights, which is time-consuming and error-prone. This study introduces ArchISMiner, a framework for mining architectural knowledge from SO. The framework comprises two complementary components: ArchPI and ArchISPE. ArchPI trains and evaluates multiple models, including conventional ML/DL models, Pre-trained Language Models (PLMs), and Large Language Models (LLMs), and selects the best-performing model to automatically identify Architecture-Related Posts (ARPs) among programming-related discussions. ArchISPE employs an indirect supervised approach that leverages diverse features, including BERT embeddings and local TextCNN features, to extract architectural issue-solution pairs. Our evaluation shows that the best model in ArchPI achieves an F1-score of 0.960 in ARP detection, and ArchISPE outperforms baselines in both SE and NLP fields, achieving F1-scores of 0.883 for architectural issues and 0.894 for solutions. A user study further validated the quality (e.g., relevance and usefulness) of the identified ARPs and the extracted issue-solution pairs. Moreover, we applied ArchISMiner to three additional forums, releasing a dataset of over 18K architectural issue-solution pairs. Overall, ArchISMiner can help architects and developers identify ARPs and extract succinct, relevant, and useful architectural knowledge from developer communities more accurately and efficiently. The replication package of this study has been provided at https://github.com/JeanMusenga/ArchISPE
comment: 42 pages, 14 images, 6 tables, Manuscript submitted to a Journal (2025)
☆ A Comparison of Conversational Models and Humans in Answering Technical Questions: the Firefox Case
The use of Large Language Models (LLMs) to support tasks in software development has steadily increased over recent years. From assisting developers in coding activities to providing conversational agents that answer newcomers' questions. In collaboration with the Mozilla Foundation, this study evaluates the effectiveness of Retrieval-Augmented Generation (RAG) in assisting developers within the Mozilla Firefox project. We conducted an empirical analysis comparing responses from human developers, a standard GPT model, and a GPT model enhanced with RAG, using real queries from Mozilla's developer chat rooms. To ensure a rigorous evaluation, Mozilla experts assessed the responses based on helpfulness, comprehensiveness, and conciseness. The results show that RAG-assisted responses were more comprehensive than human developers (62.50% to 54.17%) and almost as helpful (75.00% to 79.17%), suggesting RAG's potential to enhance developer assistance. However, the RAG responses were not as concise and often verbose. The results show the potential to apply RAG-based tools to Open Source Software (OSS) to minimize the load to core maintainers without losing answer quality. Toning down retrieval mechanisms and making responses even shorter in the future would enhance developer assistance in massive projects like Mozilla Firefox.
comment: 13 pages
☆ TOM-SWE: User Mental Modeling For Software Engineering Agents
Recent advances in coding agents have made them capable of planning, editing, running, and testing complex code bases. Despite their growing ability in coding tasks, these systems still struggle to infer and track user intent, especially when instructions are underspecified or context-dependent. To bridge this gap, we introduce ToM-SWE, a dual-agent architecture that pairs a primary software-engineering (SWE) agent with a lightweight theory-of-mind (ToM) partner agent dedicated to modeling the user's mental state. The ToM agent infers user goals, constraints, and preferences from instructions and interaction history, maintains a \textbf{persistent memory} of the user, and provides user-related suggestions to the SWE agent. In two software engineering benchmarks (ambiguous SWE-bench and stateful SWE-bench), ToM-SWE improves task success rates and user satisfaction. Notably, on the stateful SWE benchmark, a newly introduced evaluation that provides agents with a user simulator along with previous interaction histories, ToM-SWE achieves a substantially higher task success rate of 59.7\% compared to 18.1\% for OpenHands, a state-of-the-art SWE agent. Furthermore, in a three-week study with professional developers using ToM-SWE in their daily work, participants found it useful 86\% of the time, underscoring the value of stateful user modeling for practical coding agents.
☆ Software Engineering Agents for Embodied Controller Generation : A Study in Minigrid Environments
Software Engineering Agents (SWE-Agents) have proven effective for traditional software engineering tasks with accessible codebases, but their performance for embodied tasks requiring well-designed information discovery remains unexplored. We present the first extended evaluation of SWE-Agents on controller generation for embodied tasks, adapting Mini-SWE-Agent (MSWEA) to solve 20 diverse embodied tasks from the Minigrid environment. Our experiments compare agent performance across different information access conditions: with and without environment source code access, and with varying capabilities for interactive exploration. We quantify how different information access levels affect SWE-Agent performance for embodied tasks and analyze the relative importance of static code analysis versus dynamic exploration for task solving. This work establishes controller generation for embodied tasks as a crucial evaluation domain for SWE-Agents and provides baseline results for future research in efficient reasoning systems.
comment: 10 pages, 7 figures
☆ R2ComSync: Improving Code-Comment Synchronization with In-Context Learning and Reranking
Code-Comment Synchronization (CCS) aims to synchronize the comments with code changes in an automated fashion, thereby significantly reducing the workload of developers during software maintenance and evolution. While previous studies have proposed various solutions that have shown success, they often exhibit limitations, such as a lack of generalization ability or the need for extensive task-specific learning resources. This motivates us to investigate the potential of Large Language Models (LLMs) in this area. However, a pilot analysis proves that LLMs fall short of State-Of-The-Art (SOTA) CCS approaches because (1) they lack instructive demonstrations for In-Context Learning (ICL) and (2) many correct-prone candidates are not prioritized.To tackle the above challenges, we propose R2ComSync, an ICL-based code-Comment Synchronization approach enhanced with Retrieval and Re-ranking. Specifically, R2ComSync carries corresponding two novelties: (1) Ensemble hybrid retrieval. It equally considers the similarity in both code-comment semantics and change patterns when retrieval, thereby creating ICL prompts with effective examples. (2) Multi-turn re-ranking strategy. We derived three significant rules through large-scale CCS sample analysis. Given the inference results of LLMs, it progressively exploits three re-ranking rules to prioritize relatively correct-prone candidates. We evaluate R2ComSync using five recent LLMs on three CCS datasets covering both Java and Python programming languages, and make comparisons with five SOTA approaches. Extensive experiments demonstrate the superior performance of R2ComSync against other approaches. Moreover, both quantitative and qualitative analyses provide compelling evidence that the comments synchronized by our proposal exhibit significantly higher quality.}
☆ BDiff: Block-aware and Accurate Text-based Code Differencing
Code differencing is a fundamental technique in software engineering practice and research. While researchers have proposed text-based differencing techniques capable of identifying line changes over the past decade, existing methods exhibit a notable limitation in identifying edit actions (EAs) that operate on text blocks spanning multiple lines. Such EAs are common in developers' practice, such as moving a code block for conditional branching or duplicating a method definition block for overloading. Existing tools represent such block-level operations as discrete sequences of line-level EAs, compelling developers to manually correlate them and thereby substantially impeding the efficiency of change comprehension. To address this issue, we propose BDiff, a text-based differencing algorithm capable of identifying two types of block-level EAs and five types of line-level EAs. Building on traditional differencing algorithms, we first construct a candidate set containing all possible line mappings and block mappings. Leveraging the Kuhn-Munkres algorithm, we then compute the optimal mapping set that can minimize the size of the edit script (ES) while closely aligning with the original developer's intent. To validate the effectiveness of BDiff, we selected five state-of-the-art tools, including large language models (LLMs), as baselines and adopted a combined qualitative and quantitative approach to evaluate their performance in terms of ES size, result quality, and running time. Experimental results show that BDiff produces higher-quality differencing results than baseline tools while maintaining competitive runtime performance. Our experiments also show the unreliability of LLMs in code differencing tasks regarding result quality and their infeasibility in terms of runtime efficiency. We have implemented a web-based visual differencing tool.
☆ VisCoder2: Building Multi-Language Visualization Coding Agents
Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
♻ ☆ Domain Adaptation-based Edge Computing for Cross-Conditions Fault Diagnosis
Fault diagnosis of mechanical equipment provides robust support for industrial production. It is worth noting that, the operation of mechanical equipment is accompanied by changes in factors such as speed and load, leading to significant differences in data distribution, which pose challenges for fault diagnosis. Additionally, in terms of application deployment, commonly used cloud-based fault diagnosis methods often encounter issues such as time delays and data security concerns, while common fault diagnosis methods cannot be directly applied to edge computing devices. Therefore, conducting fault diagnosis under cross-operating conditions based on edge computing holds significant research value. This paper proposes a domain-adaptation-based lightweight fault diagnosis framework tailored for edge computing scenarios. Incorporating the local maximum mean discrepancy into knowledge transfer aligns the feature distributions of different domains in a high-dimensional feature space, to discover a common feature space across domains. The acquired fault diagnosis expertise from the cloud-based deep neural network model is transferred to the lightweight edge-based model (edge model) using adaptation knowledge transfer methods. It aims to achieve accurate fault diagnosis under cross-working conditions while ensuring real-time diagnosis capabilities. We utilized the NVIDIA Jetson Xavier NX kit as the edge computing platform and conducted validation experiments on two devices. In terms of diagnostic performance, the proposed method significantly improved diagnostic accuracy, with average increases of 34.44% and 17.33% compared to existing methods, respectively.
comment: 23 pages, 7 figures
♻ ☆ AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 154 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner uses a simple, budgeted loop that edits code, compiles and runs it, profiles performance, verifies correctness on tests, and selects the fastest valid version. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.
♻ ☆ Selective Parallel Loading of Large-Scale Compressed Graphs with ParaGrapher
Comprehensive evaluation is one of the basis of experimental science. In High-Performance Graph Processing, a thorough evaluation of contributions becomes more achievable by supporting common input formats over different frameworks. However, each framework creates its specific format, which may not support reading large-scale real-world graph datasets. This shows a demand for high-performance libraries capable of loading graphs to (i) accelerate designing new graph algorithms, (ii) to evaluate the contributions on a wide range of graph algorithms, and (iii) to facilitate easy and fast comparison over different graph frameworks. To that end, we present ParaGrapher, a high-performance API and library for loading large-scale and compressed graphs. ParaGrapher supports different types of requests for accessing graphs in shared- and distributed-memory and out-of-core graph processing. We explain the design of ParaGrapher and present a performance model of graph decompression, which is used for evaluation of ParaGrapher over three storage types. Our evaluation shows that by decompressing compressed graphs in WebGraph format, ParaGrapher delivers up to 3.2 times speedup in loading and up to 5.2 times speedup in end-to-end execution in comparison to the binary and textual formats. ParaGrapher is available online on https://blogs.qub.ac.uk/DIPSA/ParaGrapher/.
♻ ☆ A General Solution for the Implementation of CI/CD in Embedded Linux Development
With the growing use of embedded systems in various industries, the need for automated platforms for the development and deployment of customized Linux-based operating systems has become more important. This research was conducted with the aim of designing and implementing an integrated and reproducible infrastructure for the development, building, and testing of a Linux-based operating system using the Yocto Project. The proposed structure was implemented based on a three-layer architecture consisting of the main Yocto repositories, a custom layer (meta-custom), and a coordinating manifest layer to ensure version synchronization, scalability, and reproducibility. Three sample projects, including libhelloworld, helloworld, and the kernel module hello mod, were developed and integrated into the build process. Continuous Integration and Continuous Deployment pipelines were implemented with GitLab CI and combined with an isolated Docker environment to automate and streamline the build and testing workflows. Using a local cache server containing hashserv, downloads and sstate cache significantly reduced the build time. The functionality and stability of the system were verified through six boot test scenarios in the QEMU simulator. The results show that the proposed design not only ensures reproducibility but also can be extended to advanced applications such as continuous deployment of real-time Linux versions. Future recommendations include expanding automated tests, implementing system monitoring with Prometheus and Grafana, using distributed builds, optimizing with Docker multi-stage builds, and enabling continuous deployment of real-time Linux changes to provide a stable and scalable model for industrial and research projects in embedded systems with a rapid and reliable development cycle.
♻ ☆ RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation
Tree search methods have demonstrated impressive performance in code generation. Previous methods combine tree search with reflection that summarizes past mistakes to achieve iterative improvement. However, these methods face significant challenges. First, they search directly within the code language space, neglecting the underlying reasoning process critical for effective code generation. Second, reflection-based approaches merely accumulate historical errors in memory without providing correct reasoning pathways, making it difficult for subsequent search iterations to identify optimal solutions, resulting in decreased search quality. In this work, we propose RethinkMCTS, a framework that systematically explores and refines the reasoning process for code generation. Specifically, we employ MCTS to search for thoughts before code generation and integrate MCTS with a refinement mechanism called rethink, which incorporates fine-grained code execution feedback to refine erroneous thoughts during the search. It ensures the search path aligns with better reasoning, improving overall search quality. Through extensive experiments, we demonstrate that RethinkMCTS outperforms previous search-based and feedback-enhanced code generation baselines.
♻ ☆ TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models
Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs' capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, \textit{HumanEval}+ and \textit{ClassEval}, as well as 8 code generation LLMs, we show that \textit{TaskEval} is capable of characterising the properties of tasks. Using topic analysis, we identify and analyse the tasks of 17 and 21 topics within the benchmarks. We also cross-analyse tasks' characteristics with programming constructs (e.g., variable assignment, conditions, etc.) used by LLMs, emphasising some patterns with tasks' difficulty. Finally, we conduct a comparison between the difficulty assessment of tasks by human annotators and LLMs. Orthogonal to current benchmarking evaluation efforts, \textit{TaskEval} can assist researchers and practitioners in fostering better assessments of LLMs. The tasks' characteristics can be used to identify shortcomings within existing benchmarks or improve the evaluation of LLMs.
comment: Accepted ACM Transactions on Software Engineering and Methodology
♻ ☆ Can Agents Fix Agent Issues? NeurIPS 2025
LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AgentIssue-Bench, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AgentIssue-Bench and reveal their limited effectiveness (i.e., with only 0.67% - 4.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://github.com/alfin06/AgentIssue-Bench.
comment: Accepted by the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Ensuring Functional Correctness of Large Code Models with Selective Generation
The hallucination of code generation models hinders their applicability to systems requiring higher safety standards. One critical bottleneck in addressing code hallucination is the difficulty of identifying the functional correctness of generated code, due to its unnatural form. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, leveraging the \emph{executable nature} of code. Accordingly, we propose \emph{selective code generator} that abstains from uncertain generations -- based on the functional correctness evaluated by generated unit tests -- to theoretically control the correctness among non-abstained answers, \ie the false discovery rate. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this paradigm \emph{FuzzEval}. We demonstrate the efficacy of our method along with the controllability of code hallucination and reasonable selection efficiency.
♻ ☆ E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
The rapid advancement in large language models (LLMs) has demonstrated significant potential in End-to-End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities. To address these limitations, we present E2EDev, a novel benchmark grounded in the principles of Behavior-Driven Development (BDD), which evaluates the capabilities of E2ESD frameworks by assessing whether the generated software meets user needs through mimicking real user interactions (Figure 1). E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.
♻ ☆ How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models
Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM , which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs' inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using five state-of-the-art LLMs as evaluation subjects having increasing complexity (7-671B parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average).
♻ ☆ deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses
Although Rust ensures memory safety by default, it also permits the use of unsafe code, which can introduce memory safety vulnerabilities if misused. Unfortunately, existing tools for detecting memory bugs in Rust typically exhibit limited detection capabilities, inadequately handle Rust-specific types, or rely heavily on manual intervention. To address these limitations, we present deepSURF, a tool that integrates static analysis with Large Language Model (LLM)-guided fuzzing harness generation to effectively identify memory safety vulnerabilities in Rust libraries, specifically targeting unsafe code. deepSURF introduces a novel approach for handling generics by substituting them with custom types and generating tailored implementations for the required traits, enabling the fuzzer to simulate user-defined behaviors within the fuzzed library. Additionally, deepSURF employs LLMs to augment fuzzing harnesses dynamically, facilitating exploration of complex API interactions and significantly increasing the likelihood of exposing memory safety vulnerabilities. We evaluated deepSURF on 63 real-world Rust crates, successfully rediscovering 30 known memory safety bugs and uncovering 12 previously-unknown vulnerabilities (out of which 11 have been assigned RustSec IDs and 3 have been patched), demonstrating clear improvements over state-of-the-art tools.
comment: At IEEE S&P 2026
♻ ☆ GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.
comment: Website: https://gso-bench.github.io/
Systems and Control 34
☆ Rate-cost tradeoffs in continuous-time control with a biomolecular application
This paper focuses on rate-limited control of the generalized Ornstein-Uhlenbeck process where the control action can be either multiplicative or additive, and the noise variance can depend on the control action. We derive a lower bound on the data rate necessary to achieve the desired control cost. The lower bound is attained with equality if the control is performed via an additive white Gaussian channel. The system model approximates the dynamics of a discrete-state molecular birth-death process, and the result has direct implications on the control of a biomolecular system via chemical reactions, where the multiplicative control corresponds to the degradation rate, the additive control corresponds to the production rate, and the control objective is to decrease the fluctuations of the controlled molecular species around their desired concentration levels.
☆ System-Theoretic Analysis of Dynamic Generalized Nash Equilibrium Problems -- Turnpikes and Dissipativity
Generalized Nash equilibria are used in multi-agent control applications to model strategic interactions between agents that are coupled in the cost, dynamics, and constraints. We study the properties of open-loop GNE trajectories from a system-theoretic perspective. We show how strict dissipativity generates the turnpike phenomenon in GNE solutions. Moreover, we establish a converse turnpike result, i.e., the implication from turnpike to strict dissipativity. We derive conditions under which the steady-state GNE is the optimal operating point and, using a game value function, we give a local characterization of the geometry of storage functions. Finally, we design linear terminal penalties that ensure GNE open-loop trajectories converge to and remain at the steady-state GNE. These connections provide the foundation for future system-theoretic analysis of GNEs similar to those existing in optimal control.
☆ Auction-Based Responsibility Allocation for Scalable Decentralized Safety Filters in Cooperative Multi-Agent Collision Avoidance
This paper proposes a scalable decentralized safety filter for multi-agent systems based on high-order control barrier functions (HOCBFs) and auction-based responsibility allocation. While decentralized HOCBF formulations ensure pairwise safety under input bounds, they face feasibility and scalability challenges as the number of agents grows. Each agent must evaluate an increasing number of pairwise constraints, raising the risk of infeasibility and making it difficult to meet real-time requirements. To address this, we introduce an auction-based allocation scheme that distributes constraint enforcement asymmetrically among neighbors based on local control effort estimates. The resulting directed responsibility graph guarantees full safety coverage while reducing redundant constraints and per-agent computational load. Simulation results confirm safe and efficient coordination across a range of network sizes and interaction densities.
comment: 6 pages, 3 figures, Submitted to Control Engineering Practice and IFAC World Congress 2026
☆ Analysis and Synthesis of Switched Optimization Algorithms
Deployment of optimization algorithms on networked systems face challenges associated with time delays and corruptions. One particular instance is the presence of time-varying delays arising from factors such as packet drops and irregular sampling. Fixed time delays can destabilize gradient descent algorithms, and this degradation is exacerbated by time-varying delays. This work concentrates on the analysis and creation of discrete-time optimization algorithms with certified exponential convergence rates that are robust against switched uncertainties between the optimizer and the gradient oracle. These optimization algorithms are implemented by a switch-scheduled output feedback controllers. Rate variation and sawtooth behavior (packet drops) in time-varying delays can be imposed through constraining switching sequences. Analysis is accomplished by bisection in the convergence rate to find Zames-Falb filter coefficents. Synthesis is performed by alternating between a filter coefficient search for a fixed controller, and a controller search for fixed multipliers.
☆ Robust Regret Control with Uncertainty-Dependent Baseline
This paper proposes a robust regret control framework in which the performance baseline adapts to the realization of system uncertainty. The plant is modeled as a discrete-time, uncertain linear time-invariant system with real-parametric uncertainty. The performance baseline is the optimal non-causal controller constructed with full knowledge of the disturbance and the specific realization of the uncertain plant. We show that a controller achieves robust additive regret relative to this baseline if and only if it satisfies a related, robust $H_\infty$ performance condition on a modified plant. One technical issue is that the modified plant can, in general, have a complicated nonlinear dependence on the uncertainty. We use a linear approximation step so that the robust additive regret condition can be recast as a standard $\mu$-synthesis problem. A numerical example is used to demonstrate the proposed approach.
☆ Predictive control barrier functions for piecewise affine systems with non-smooth constraints
Obtaining control barrier functions (CBFs) with large safe sets for complex nonlinear systems and constraints is a challenging task. Predictive CBFs address this issue by using an online finite-horizon optimal control problem that implicitly defines a large safe set. The optimal control problem, also known as the predictive safety filter (PSF), involves predicting the system's flow under a given backup control policy. However, for non-smooth systems and constraints, some key elements, such as CBF gradients and the sensitivity of the flow, are not well-defined, making the current methods inadequate for ensuring safety. Additionally, for control-non-affine systems, the PSF is generally nonlinear and non-convex, posing challenges for real-time computation. This paper considers piecewise affine systems, which are usually control-non-affine, under nonlinear state and polyhedral input constraints. We solve the safety issue by incorporating set-valued generalized Clarke derivatives in the PSF design. We show that enforcing CBF constraints across all elements of the generalized Clarke derivatives suffices to guarantee safety. Moreover, to lighten the computational overhead, we propose an explicit approximation of the PSF. The resulting control methods are demonstrated through numerical examples.
☆ Data-driven Koopman MPC using Mixed Stochastic-Deterministic Tubes
This paper presents a novel data-driven stochastic MPC design for discrete-time nonlinear systems with additive disturbances by leveraging the Koopman operator and a distributionally robust optimization (DRO) framework. By lifting the dynamical system into a linear space, we achieve a finite-dimensional approximation of the Koopman operator. We explicitly account for the modeling approximation and additive disturbance error by a mixed stochastic-deterministic tube for the lifted linear model. This ensures the regulation of the original nonlinear system while complying with the prespecified constraints. Stochastic and deterministic tubes are constructed using a DRO and a hyper-cube hull, respectively. We provide finite sample error bounds for both types of tubes. The effectiveness of the proposed approach is demonstrated through numerical simulations.
comment: This is the accepted version. It will appear in Journal of Process Control, 2025
☆ The PhasorArray Toolbox for Harmonic Analysis and Control Design
We present a MATLAB package called the Pha-sorArray Toolbox that has been developed to make harmonic analysis and control methods both practical and user-friendly. The toolbox adopts an object-oriented architecture that enables intuitive manipulation of periodic matrices through overloaded operators for addition, multiplication, convolution, and automatic Toeplitz construction. Its advanced features include harmonic Sylvester, Lyapunov and Riccati equations solvers, and seamless integration with YALMIP, thereby facilitating advanced control and analysis techniques based on Linear Matrix Inequalities (LMIs) in the harmonic framework.
☆ Physics-Informed Neural Networks for MIMO Beam Map and Environment Reconstruction
As communication networks evolve towards greater complexity (e.g., 6G and beyond), a deep understanding of the wireless environment becomes increasingly crucial. When explicit knowledge of the environment is unavailable, geometry-aware feature extraction from channel state information (CSI) emerges as a pivotal methodology to bridge physical-layer measurements with network intelligence. This paper proposes to explore the received signal strength (RSS) data, without explicit 3D environment knowledge, to jointly construct the radio beam map and environmental geometry for a multiple-input multiple-output (MIMO) system. Unlike existing methods that only learn blockage structures, we propose an oriented virtual obstacle model that captures the geometric features of both blockage and reflection. Reflective zones are formulated to identify relevant reflected paths according to the geometry relation of the environment. We derive an analytical expression for the reflective zone and further analyze its geometric characteristics to develop a reformulation that is more compatible with deep learning representations. A physics-informed deep learning framework that incorporates the reflective-zone-based geometry model is proposed to learn the blockage, reflection, and scattering components, along with the beam pattern, which leverages physics prior knowledge to enhance network transferability. Numerical experiments demonstrate that, in addition to reconstructing the blockage and reflection geometry, the proposed model can construct a more accurate MIMO beam map with a 32%-48% accuracy improvement.
☆ The Role of Information Incompleteness in Defending Against Stealth Attacks
The effectiveness of Data Injections Attacks (DIAs) critically depends on the completeness of the system information accessible to adversaries. This relationship positions information incompleteness enhancement as a vital defense strategy for degrading DIA performance. In this paper, we focus on the information-theoretic stealth attacks, where the attacker encounters a fundamental tradeoff between the attack stealthiness and destructiveness. Specifically, we systematically characterize how incomplete admittance information impacts the dual objectives. In particular, we establish sufficient conditions for two distinct operational regimes: (i) stealthiness intensifies while destructive potential diminishes and (ii) destructiveness increases while stealth capability weakens. For scenarios beyond these regimes, we propose a maximal incompleteness strategy to optimally degrade stealth capability. To solve the associated optimization problem, the feasible region is reduced without excluding the optimal solution, and a heuristic algorithm is then introduced to effectively identify the near-optimal solutions within the reduced region. Numerical simulations are conducted on IEEE test systems to validate the findings.
☆ High-Performance Rotor Cooling with Ducted Liquid in Completely Cold-Formed Modular Motor Shaft
This paper suggests a novel rotor-cooling shaft concept for high-performance electric motors that increases the effectiveness of cooling and is yet simple and cost-effective to manufacture. We investigate the thermal performance of four shaft geometries for rotor cooling in automotive applications. The proposed tooth-guided liquid-cooling shaft design aims to solve the high churning loss of conventional cooled rotor shafts due to internal vortex formation and their still limited heat transfer. Therefore, we optimize heat transfer efficiency and pressure management by incorporating cold-formed internal channels that restrict vortex formation beyond a degree that improves heat transfer. We evaluated key performance metrics, including heat transfer rate, outlet temperature, pressure drop, and velocity profiles, under varying rotational speeds, inlet flow rates, and coolant temperatures. Computational fluid analysis demonstrates that the tooth-guided design outperforms conventional hollow shafts and achieves up to 110% higher cooling efficiency at low rotational speeds, while it maintains comparable pressure levels. These findings provide practical insight into geometry-driven thermal optimization and offer a path toward improving the performance and durability of electric motors.
comment: 11 pages, 21 figures
☆ Control of neural field equations with step-function inputs
Wilson-Cowan and Amari-type models capture nonlinear neural population dynamics, providing a fundamental framework for modeling how sensory and other exogenous inputs shape activity in neural tissue. We study the controllability properties of Amari-type neural fields subject to piecewise/constant-in-time inputs. The model describes the time evolution of the polarization of neural tissue within a spatial continuum, with synaptic interactions represented by a convolution kernel. We study the synthesis of piecewise/constant-in-time inputs to achieve two-point boundary-type control objectives, namely, steering neural activity from an initial state to a prescribed target state. This approach is particularly relevant for predicting the emergence of paradoxical neural representations, such as discordant visual illusions that occur in response to overt sensory stimuli. We first present a control synthesis based on the Banach fixed-point theorem, which yields an iterative construction of a constant-in-time input under minimal regularity assumptions on the kernel and transfer function; however, it exhibits practical limitations, even in the linear case. To overcome these challenges, we then develop a generic synthesis framework based on the flow of neural dynamics drift, enabling explicit piecewise constant and constant-in-time inputs. Extensive numerical results in one and two spatial dimensions confirm the effectiveness of the proposed syntheses and demonstrate their superior performance compared to inputs derived from naive linearization at the initial or target states when these states are not equilibria of the drift dynamics. By providing a mathematically rigorous framework for controlling Amari-type neural fields, this work advances our understanding of nonlinear neural population control with potential applications in computational neuroscience, psychophysics, and neurostimulation.
☆ A Hybrid GNN-LSE Method for Fast, Robust, and Physically-Consistent AC Power Flow
Conventional AC Power Flow (ACPF) solvers like Newton-Raphson (NR) face significant computational and convergence challenges in modern, large-scale power systems. This paper proposes a novel, two-stage hybrid method that integrates a Physics-Informed Graph Neural Network (GNN) with a robust, iterative Linear State Estimation (LSE) refinement step to produce fast and physically-consistent solutions. The GNN, trained with a physics-informed loss function featuring an efficient dynamic weighting scheme, rapidly predicts a high-quality initial system state. This prediction is then refined using an iterative, direct linear solver inspired by state estimation techniques. This LSE refinement step solves a series of linear equations to enforce physical laws, effectively bypassing the non-linearities and convergence issues of traditional solvers. The proposed GNN-LSE framework is comprehensively validated on systems ranging from small radial distribution networks (IEEE 33-bus, 69-bus) to a large, meshed transmission system (IEEE 118-bus). Results show that our GNN variants are up to $8.4 \times 10^3$ times faster than NR. The LSE refinement provides a fast route to a physically-consistent solution, while heavy-loading stress tests (120%-150% of nominal) and N-1 contingencies demonstrate the method's reliability and generalization. This work presents a powerful and flexible framework for bridging fast, data-driven models with the rigorous constraints of power system physics, offering a practical tool for real-time operations and analysis.
☆ Motion Planning with Precedence Specifications via Augmented Graphs of Convex Sets
We present an algorithm for planning trajectories that avoid obstacles and satisfy key-door precedence specifications expressed with a fragment of signal temporal logic. Our method includes a novel exact convex partitioning of the obstacle free space that encodes connectivity among convex free space sets, key sets, and door sets. We then construct an augmented graph of convex sets that exactly encodes the key-door precedence specifications. By solving a shortest path problem in this augmented graph of convex sets, our pipeline provides an exact solution up to a finite parameterization of the trajectory. To illustrate the effectiveness of our approach, we present a method to generate key-door mazes that provide challenging problem instances, and we perform numerical experiments to evaluate the proposed pipeline. Our pipeline is faster by several orders of magnitude than recent state-of-the art methods that use general purpose temporal logic tools.
☆ Pricing Problems in Adoption of New Technologies
We propose a generalization of the Bass diffusion model in discrete-time that explicitly models the effect of price in adoption. Our model is different from earlier price-incorporated models and fits well to adoption data for various products. We then utilize this model to study two decision-making problems. First, we provide a series of structural results on optimal pricing strategies to maximize profits from product sales by a monopolist over a finite horizon. We fully characterize the optimal pricing strategy in the single-period problem, and establish several structural properties of the same for the multi-period counterpart. Second, we study a Stackelberg game between a policy-maker and a monopolist, where the former seeks to maximize adoption through rebates, while the latter focuses on profits. For this problem, we analytically characterize crucial properties of the equilibrium path of the single-period game, and demonstrate how they carry over to the multi-period variant.
☆ Fixed Horizon Linear Quadratic Covariance Steering in Continuous Time with Hilbert-Schmidt Terminal Cost
We formulate and solve the fixed horizon linear quadratic covariance steering problem in continuous time with a terminal cost measured in Hilbert-Schmidt (i.e., Frobenius) norm error between the desired and the controlled terminal covariances. For this problem, the necessary conditions of optimality become a coupled matrix ODE two-point boundary value problem. To solve this system of equations, we design a matricial recursive algorithm and prove its convergence. The proposed algorithm and its analysis make use of the linear fractional transforms parameterized by the state transition matrix of the associated Hamiltonian matrix. To illustrate the results, we provide two numerical examples: one with a two dimensional and another with a six dimensional state space.
☆ A Perspective on the Algebra, Topology, and Logic of Electrical Networks
This paper presents a unified algebraic, topological, and logical framework for electrical one-port networks based on \v{S}are's $m$-theory. Within this formalism, networks are represented by $m$-words (jorbs) over an ordered alphabet, where series and parallel composition induce an $m$-topology on $m$-graphs with a theta mapping $\vartheta$ that preserves one-port equivalence. The study formalizes quasi-orders, shells, and cores, showing their structural correspondence to network boundary conditions and impedance behavior. The $\lambda--\Delta$ metric, together with the valuation morphism $\Phi$, provides a concise descriptor of the impedance-degree structure. In the computational domain, the framework is extended with algorithmic procedures for generating and classifying non-isomorphic series-parallel topologies, accompanied by programmatic Cauer/Foster synthesis workflows and validation against canonical examples from Ladenheim's catalogue. The resulting approach enables symbolic-to-topological translation of impedance functions, offering a constructive bridge between algebraic representation and electrical realization. Overall, the paper outlines a self-consistent theoretical and computational foundation for automated network synthesis, classification, and formal verification within the emerging field of Jorbology.
☆ Green Hydrogen under Uncertainty: Evaluating Power-to-X Strategies Using Agent-Based Simulation and Multi-Criteria Decision Framework
The transition toward net-zero energy systems requires scalable and cost-effective deployment of Power-to-X technologies, particularly green hydrogen production. Despite increasing investments, a critical research gap remains in dynamically assessing how different operational strategies affect the feasibility of hydrogen production under real-world energy market conditions. Most existing studies rely on static, techno-economic models and overlook actor interactions, infrastructure limitations, and regulatory complexity. This paper presents a novel modeling framework that integrates agent-based simulation with multi-criteria decision-making to evaluate green hydrogen production strategies using co-located wind and solar generation. Three operational strategies - grid-only, on-site-only, and hybrid - are applied across three electrolyzer capacity levels (10 MW, 50 MW, and 100 MW) within a Danish case study. Real electricity tariffs, emissions factors, and market data are used to simulate technical, economic, and environmental performance indicators. The results show that hybrid strategies consistently outperform grid-only configurations in terms of cost and emissions while maintaining stable hydrogen output. Although on-site-only strategies minimize emissions and costs, they fail to meet fixed production demands. This framework offers novel scientific contributions by modeling dynamic actor interactions and integrating system performance evaluation into strategic planning. Practically, it provides actionable insights for energy planners and policymakers designing resilient and efficient Power-to-X systems in renewable-rich contexts.
☆ The local Gaussian correlation networks among return tails in the Chinese stock market
Financial networks based on Pearson correlations have been intensively studied. However, previous studies may have led to misleading and catastrophic results because of several critical shortcomings of the Pearson correlation. The local Gaussian correlation coefficient, a new measurement of statistical dependence between variables, has unique advantages including capturing local nonlinear dependence and handling heavy-tailed distributions. This study constructs financial networks using the local Gaussian correlation coefficients between tail regions of stock returns in the Shanghai Stock Exchange. The work systematically analyzes fundamental network metrics including node centrality, average shortest path length, and entropy. Compared with the local Gaussian correlation network among positive tails and the conventional Pearson correlation network, the properties of the local Gaussian correlation network among negative tails are more sensitive to the stock market risks. This finding suggests researchers should prioritize the local Gaussian correlation network among negative tails. Future work should reevaluate existing findings using the local Gaussian correlation method.
☆ Environment-Dependent Components Identification of Behind-the-Meter Resources via Inverse Optimization
With the increasing penetration of behind-the-meter (BTM) resources, it is vital to monitor the components of these resources and deduce their response behavior to external environment. Owing to data privacy, however, the appliance-wise measurement is invisible to the power system operator, which hinders the accurate modeling of load identification. To this end, this paper proposes a hybrid physics-inspired and data-driven framework for decomposing BTM components based on external measurement of total load and environmental factors. The total load is decomposed into different environment-dependent components, namely storage-like component, PV generation component, thermostatically-controlled load component, and periodic component. The overall load identification adopts a double-layer iterative solution framework. A data-driven inverse optimization algorithm is developed to identify parameters of the energy storage-like component. The physics-inspired model is proposed to identify the capacity and response of the rest components. The modeling accuracy and robustness of the proposed method are validated by numerical tests. The application significance of the proposed BTM identification method is also validated in electricity market clearing for reducing system operation costs.
♻ ☆ Trajectory Optimization for Minimum Threat Exposure using Physics-Informed Neural Networks
We apply a physics-informed neural network (PINN) to solve the two-point boundary value problem (BVP) arising from the necessary conditions postulated by Pontryagin's Minimum Principle for optimal control. Such BVPs are known to be numerically difficult to solve by traditional shooting methods due to extremely high sensitivity to initial guesses. In the light of recent successes in applying PINNs for solving high-dimensional differential equations, we develop a PINN to solve the problem of finding trajectories with minimum exposure to a spatiotemporal threat for a vehicle kinematic model. First, we implement PINNs that are trained to solve the BVP for a given pair of initial and final states for a given threat field. Next, we implement a PINN conditioned on the initial state for a given threat field, which eliminates the need for retraining for each initial state. We demonstrate that the PINN outputs satisfy the necessary conditions with low numerical error.
comment: 2025 Indian Control Conference
♻ ☆ Operational Risks in Grid Integration of Large Data Center Loads: Characteristics, Stability Assessments, and Sensitivity Studies
This paper investigates the dynamic interactions between large-scale data centers and the power grid, focusing on reliability challenges arising from sudden fluctuations in demand. With the rapid growth of AI-driven workloads, such fluctuations, along with fast ramp patterns, are expected to exacerbate stressed grid conditions and system instabilities. We consider a few open-source AI data center consumption profiles from the MIT supercloud datasets, along with generating a few experimental HPC job-distribution-based inference profiles. Subsequently, we develop analytical methodologies for real-time assessment of grid stability, focusing on both transient and small-signal stability assessments. Energy-flow-like metrics for nonlinear transient stability, formulated by computing localized data center bus kinetic-like flows and coupling interactions with neighboring buses over varying time windows, help provide operators with real-time assessments of the regional grid stress in the data center hubs. On the other hand, small-signal stability metrics, constructed from analytical state matrices under variable operating conditions during a fast ramping period, enable snapshot-based assessments of data center load fluctuations and provide enhanced observability into evolving grid conditions. By quantifying the stability impacts of large data center clusters, studies conducted in the modified IEEE benchmark $68-$bus model support improved operator situational awareness to capture risks in reliable integration of large data center loads.
comment: 13 pages, 8 figures, 3 tables
♻ ☆ A Novel State-Centric Necessary Condition for Time-Optimal Control of Controllable Linear Systems Based on Augmented Switching Laws (Extended Version)
Most existing necessary conditions for optimal control based on adjoining methods require both state and costate information, yet the unobservability of costates for a given feasible trajectory impedes the determination of optimality in practice. This paper establishes a novel theoretical framework for time-optimal control of controllable linear systems with a single input, proposing the augmented switching law (ASL) that represents the input control and the feasibility in a compact form. Given a feasible trajectory, the perturbed trajectory under the constraints of ASL is guaranteed to be feasible, resulting in a novel state-centric necessary condition without dependence on costate information. A first-order necessary condition is proposed that the Jacobian matrix of the ASL is not of full row rank, which also results in a potential approach to optimizing a given feasible trajectory with the preservation of arc structures. The proposed necessary condition is applied to high-order chain-of-integrator systems with full box constraints, contributing to some theoretical results challenging to reason by costate-based conditions.
comment: This paper has been published in IEEE TAC
♻ ☆ Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning AAAI 2025
Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging because the size of the joint state and action spaces grows exponentially in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm $\texttt{SUBSAMPLE-MFQ}$ ($\textbf{Subsample}$-$\textbf{M}$ean-$\textbf{F}$ield-$\textbf{Q}$-learning) and a decentralized randomized policy for a system with $n$ agents. For any $k\leq n$, our algorithm learns a policy for the system in time polynomial in $k$. We prove that this learned policy converges to the optimal policy on the order of $\tilde{O}(1/\sqrt{k})$ as the number of subsampled agents $k$ increases. In particular, this bound is independent of the number of agents $n$.
comment: 53 pages. AAAI 2025 MARW Best Paper Award. Accepted at NeurIPS 2025 (spotlight)
♻ ☆ Hybrid MAC Protocol with Integrated Multi-Layered Security for Resource-Constrained UAV Swarm Communications
Flying Ad Hoc Networks (FANETs) present unique challenges due to high node mobility, dynamic topologies, and strict resource constraints. Existing routing protocols often optimize for a single metric, such as path length or energy, while neglecting the complex dependencies between network performance, security, and MAC layer efficiency. This paper introduces a novel hardware software co design framework for secure and adaptive UAV swarm communications, featuring an energy aware protocol stack. The architecture employs a multicast, clustered organization where routing decisions integrate dynamic trust scores, historical link quality, and internodal distance. A hybrid MAC protocol combines contention based and scheduled channel access for optimized throughput. Security is ensured through a zero trust model that fuses cryptographic authentication with a behavioral reputation system, alongside hardware accelerated AES GCM encryption. Comparative analysis in an NS 3 simulation environment demonstrates the framework's superiority in packet delivery ratio, latency, resilience, and overhead, providing a scalable foundation for high performance swarm operations.
comment: Accepted at ISED 2025
♻ ☆ Predictability Enables Parallelization of Nonlinear State Space Models NeurIPS '25
The rise of parallel computing hardware has made it increasingly important to understand which nonlinear state space models can be efficiently parallelized. Recent advances like DEER (arXiv:2309.12252) or DeepPCR (arXiv:2309.16318) have shown that evaluating a state space model can be recast as solving a parallelizable optimization problem, and sometimes this approach can yield dramatic speed-ups in evaluation time. However, the factors that govern the difficulty of these optimization problems remain unclear, limiting the larger adoption of the technique. In this work, we establish a precise relationship between the dynamics of a nonlinear system and the conditioning of its corresponding optimization formulation. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behavior, impacts the number of optimization steps required for evaluation. In predictable systems, the state trajectory can be computed in $O((\log T)^2)$ time, where $T$ is the sequence length, a major improvement over the conventional sequential approach. In contrast, chaotic or unpredictable systems exhibit poor conditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis demonstrates that for predictable systems, the optimization problem is always well-conditioned, whereas for unpredictable systems, the conditioning degrades exponentially as a function of the sequence length. We validate our claims through extensive experiments, providing practical guidance on when nonlinear dynamical systems can be efficiently parallelized, and highlighting predictability as a key design principle for parallelizable models.
comment: NeurIPS '25. Code: https://github.com/lindermanlab/predictability_enables_parallelization
♻ ☆ Robust time series generation via Schrödinger Bridge: a comprehensive evaluation
We investigate the generative capabilities of the Schr\"odinger Bridge (SB) approach for time series. The SB framework formulates time series synthesis as an entropic optimal interpolation transport problem between a reference probability measure on path space and a target joint distribution. This results in a stochastic differential equation over a finite horizon that accurately captures the temporal dynamics of the target time series. While the SB approach has been largely explored in fields like image generation, there is a scarcity of studies for its application to time series. In this work, we bridge this gap by conducting a comprehensive evaluation of the SB method's robustness and generative performance. We benchmark it against state-of-the-art (SOTA) time series generation methods across diverse datasets, assessing its strengths, limitations, and capacity to model complex temporal dependencies. Our results offer valuable insights into the SB framework's potential as a versatile and robust tool for time series generation.
comment: 9 pages
♻ ☆ Optimal Rates of Convergence for Entropy Regularization in Discounted Markov Decision Processes
We study the error introduced by entropy regularization in infinite-horizon, discrete, discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength both in a weighted KL-divergence and in value with a problem-specific exponent. This is in contrast to previously known estimates, of the order $O(\tau)$, where $\tau$ is the regularization strength. We provide a lower bound matching our upper bound up to a polynomial term, thereby characterizing the exponential convergence rate for entropy regularization. Our proof relies on the observation that the solutions of entropy-regularized Markov decision processes solve a gradient flow of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. This correspondence allows us to identify the limit of this gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of this gradient flow, which corresponds to a time-continuous version of the natural policy gradient method. We use our improved error estimates to show that for entropy-regularized natural policy gradient methods, the overall error decays exponentially in the square root of the number of iterations, improving over existing sublinear guarantees. Finally, we extend our analysis to settings beyond the entropy. In particular, we characterize the implicit bias regarding general convex potentials and their resulting generalized natural policy gradients.
comment: 32 pages, 1 figure
♻ ☆ Ensemble-Based Peak Demand Probability Density Forecasting with Application to Risk-Aware Power System Scheduling
Power systems face increasing challenges in maintaining resource adequacy due to lower operating margins, rising renewable energy uncertainty, and demand variability. Forecasting the probability distribution of peak demand on shorter timescales is a critical forward-facing issue under increasing volatility. This study introduces a novel ensemble-based machine learning method for peak demand probability density forecasting that extends classical extreme value theory to model time series peaks as nonstationary statistical distributions. The approach employs an ensemble of tree-based learners that recursively partition the covariate space and estimate local generalized extreme value distributions, allowing it to automatically capture complex covariate-dependent parameter variations. Unlike existing approaches, which often suffer from convergence issues or restrictive functional forms, this framework is both flexible and robust. Validation on a case study based on the PJM interconnection demonstrates that the method achieves a 38 percent reduction in committed capacity when generation is scheduled based on a reliability criterion. These improvements provide practical value for power system operation, enabling risk-aware capacity scheduling under peak demand uncertainty and supporting reliability-driven decision making in future energy systems.
comment: Completed major revision
♻ ☆ Faster Reinforcement Learning by Freezing Slow States
We study infinite horizon Markov decision processes (MDPs) with "fast-slow" structure, where some state variables evolve rapidly ("fast states") while others change more gradually ("slow states"). This structure commonly arises in practice when decisions must be made at high frequencies over long horizons, and where slowly changing information still plays a critical role in determining optimal actions. Examples include inventory control under slowly changing demand indicators or dynamic pricing with gradually shifting consumer behavior. Modeling the problem at the natural decision frequency leads to MDPs with discount factors close to one, making them computationally challenging. We propose a novel approximation strategy that "freezes" slow states during phases of lower-level planning and subsequently applies value iteration to an auxiliary upper-level MDP that evolves on a slower timescale. Freezing states for short periods of time leads to easier-to-solve lower-level problems, while a slower upper-level timescale allows for a more favorable discount factor. On the theoretical side, we analyze the regret incurred by our frozen-state approach, which leads to simple insights on how to trade off regret versus computational cost. Empirically, we benchmark our new frozen-state methods on three domains, (i) inventory control with fixed order costs, (ii) a gridworld problem with spatial tasks, and (iii) dynamic pricing with reference-price effects. We demonstrate that the new methods produce high-quality policies with significantly less computation, and we show that simply omitting slow states is often a poor heuristic.
comment: 70 pages, 10 figures
♻ ☆ Dimensionality Reduction with Koopman Generalized Eigenfunctions
This paper presents a methodology to achieve lower-dimensional Koopman quasi-linear representations of nonlinear system dynamics using Koopman generalized eigenfunctions. The proposed approach considers the analytically derived Koopman formulation of rigid body dynamics, but it can be extended to any data-driven or analytically derived generalized eigenfunction set. It achieves a representation for which the number of Koopman observables matches the number of inputs allowing for Koopman linearization control solutions rather than resorting to the least squares approximation method adopted in high dimensional Koopman formulations. Through a linear combination of Koopman generalized eigenfunctions a new set of Koopman generalized eigenfunction is constructed so that the zero order truncation approximate a Koopman eigenfunction which can be used to design linear control strategies to steer the dynamics of the original nonlinear system. The proposed methodology is tested by designing a linear quadratic (LQ) flight controller for a quadrotor UAV. Numerical and Hardware-in-the-loop (HIL) simulations validate the applicability and real-time implementability of the proposed approach in the presence of noise and sensor delays. The main advantage of the proposed method is the realization of a fully actuated Koopman based model which, in the case of the underactuated quadrotor system, allows to achieve trajectory tracking through a single linear control loop.
♻ ☆ Bandwidth Efficient Livestreaming in Mobile Wireless Networks: A Peer-to-Peer ACIDE Solution
In mobile wireless networks, livestreaming in high user density areas presents two typical challenges: the wireless bandwidth is depleted and the number of users is limited. In this study, a media distribution model utilizing peer to peer communications, Active Control in an Intelligent and Distributed Environment, is proposed for bandwidth efficient livestreaming. The basic idea is to group users with identical livestream interest in a cluster of n peers. Instead of sending n copies of a livestream package, only one copy is sent to the cluster. A package is divided into n blocks. Each user receives one block from the base station and the remaining n-1 blocks from the other peers. Two optimization problems are addressed. The first problem is minimizing the bandwidth needed to guarantee a continuous live media play on all peers. A solution is proposed to find the optimal block sizes such that the wireless bandwidth is minimized. The second problem is maximizing the number of peers admitted to a cluster, given a fixed wireless bandwidth. This problem is NP-complete and a greedy strategy is proposed to calculate a feasible solution for peer selection. The proposed model improves the bandwidth efficiency and allows more users to be served.
comment: 16 pages, 12 figures, 4 tables, Journal submission
♻ ☆ A Traffic Prediction-Based Individualized Driver Warning System to Reduce Red Light Violations SC
Red light violation is a major cause of traffic collisions and resulting injuries and fatalities. Despite extensive prior work to reduce red light violations, they continue to be a major problem in practice, partly because existing systems suffer from the flaw of providing the same guidance to all drivers. As a result, some violations are avoided, but other drivers ignore or respond inappropriately to red light running systems, resulting in safety issues overall. We show a method of providing accurate warnings to individual drivers to avoid the broad guidance approach of most existing systems. Recognizing if a driver will run red lights is highly dependent on signal phase and timing, traffic conditions along the road, and individual driver behaviour, the proposed warning system contains three parts: a traffic prediction algorithm, an individual warning signal optimizer, and a driver warning display. The traffic prediction algorithm predicts future traffic states along the road towards the signalized intersections using the latest traffic conditions obtained through vehicle-to-vehicle and vehicle-to-infrastructure communications. Then, an optimization problem is formulated to compute the optimal warning signal based on predicted traffic states and driver reaction model. Finally, the optimal warning signal is shown on the display screen to advise driver on how much braking is needed to avoid running the red light. The system continuously updates the latest warning signal as the vehicle is approaching the intersection. Both numerical simulated driving scenarios and real-world road tests are used to demonstrate the proposed algorithm's performance under different conditions by comparing with previous work on red light running warning system. The results show that the system provides more effective and accurate warning signals to drivers, helping them avoid running red lights.
comment: accepted by ASCE's Journal of Transportation Engineering, Part A: Systems
♻ ☆ Time-causal and time-recursive wavelets
When to apply wavelet analysis to real-time temporal signals, where the future cannot be accessed, it is essential to base all the steps in the signal processing pipeline on computational mechanisms that are truly time-causal. This paper describes how a time-causal wavelet analysis can be performed based on concepts developed in the area of temporal scale-space theory, originating from a complete classification of temporal smoothing kernels that guarantee non-creation of new structures from finer to coarser temporal scale levels. By necessity, convolution with truncated exponential kernels in cascade constitutes the only permissable class of kernels, as well as their temporal derivatives as a natural complement to fulfil the admissibility conditions of wavelet representations. For a particular way of choosing the time constants in the resulting infinite convolution of truncated exponential kernels, to ensure temporal scale covariance and thus self-similarity over temporal scales, we describe how mother wavelets can be chosen as temporal derivatives of the resulting time-causal limit kernel. By developing connections between wavelet theory and scale-space theory, we characterize and quantify how the continuous scaling properties transfer to the discrete implementation, demonstrating how the proposed time-causal wavelet representation can reflect the duration of locally dominant temporal structures in the input signals. We propose that this notion of time-causal wavelet analysis could be a valuable tool for signal processing tasks, where streams of signals are to be processed in real time, specifically for signals that may contain local variations over a rich span of temporal scales, or more generally for analysing physical or biophysical temporal phenomena, where a fully time-causal analysis is called for to be physically realistic.
comment: 25 pages, 10 figures
Graphics 6
☆ WorldGrow: Generating Infinite 3D World
We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.
comment: Project page: https://world-grow.github.io/ Code: https://github.com/world-grow/WorldGrow
☆ Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging ICCV 2025
Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMU-based tracking compromises translation estimates and accurate relative positioning between individuals, as inertial cues are inherently self-referential and provide no direct spatial reference for others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors - both on each individual and across multiple individuals. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people's global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous state-of-the-art methods in accuracy and robustness across synthetic and real-world data, showing the promise of IMU+UWB-based multi-human motion capture in the wild. Code, models, dataset: https://github.com/eth-siplab/GroupInertialPoser
comment: Accepted by ICCV 2025, Code: https://github.com/eth-siplab/GroupInertialPoser
☆ ArtiLatent: Realistic Articulated 3D Object Generation via Structured Latents SIGGRAPH
We propose ArtiLatent, a generative framework that synthesizes human-made 3D objects with fine-grained geometry, accurate articulation, and realistic appearance. Our approach jointly models part geometry and articulation dynamics by embedding sparse voxel representations and associated articulation properties, including joint type, axis, origin, range, and part category, into a unified latent space via a variational autoencoder. A latent diffusion model is then trained over this space to enable diverse yet physically plausible sampling. To reconstruct photorealistic 3D shapes, we introduce an articulation-aware Gaussian decoder that accounts for articulation-dependent visibility changes (e.g., revealing the interior of a drawer when opened). By conditioning appearance decoding on articulation state, our method assigns plausible texture features to regions that are typically occluded in static poses, significantly improving visual realism across articulation configurations. Extensive experiments on furniture-like objects from PartNet-Mobility and ACD datasets demonstrate that ArtiLatent outperforms existing approaches in geometric consistency and appearance fidelity. Our framework provides a scalable solution for articulated 3D object synthesis and manipulation.
comment: accepted to SIGGRAPH Asia; Project page: https://chenhonghua.github.io/MyProjects/ArtiLatent/
☆ PC-NCLaws: Physics-Embedded Conditional Neural Constitutive Laws for Elastoplastic Materials
While data-driven methods offer significant promise for modeling complex materials, they often face challenges in generalizing across diverse physical scenarios and maintaining physical consistency. To address these limitations, we propose a generalizable framework called Physics-Embedded Conditional Neural Constitutive Laws for Elastoplastic Materials, which combines the partial differential equations with neural networks. Specifically, the model employs two separate neural networks to model elastic and plastic constitutive laws. Simultaneously, the model incorporates physical parameters as conditional inputs and is trained on comprehensive datasets encompassing multiple scenarios with varying physical parameters, thereby enabling generalization across different properties without requiring retraining for each individual case. Furthermore, the differentiable architecture of our model, combined with its explicit parameter inputs, enables the inverse estimation of physical parameters from observed motion sequences. This capability extends our framework to objects with unknown or unmeasured properties. Experimental results demonstrate state-of-the-art performance in motion reconstruction, robust long-term prediction, geometry generalization, and precise parameters estimation for elastoplastic materials, highlighting its versatility as a unified simulator and inverse analysis tool.
comment: 11 pages
☆ The Principles of Diffusion Models
This monograph presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions. The goal is to learn a reverse process that transforms noise back into data while recovering the same intermediates. We describe three complementary views. The variational view, inspired by variational autoencoders, sees diffusion as learning to remove noise step by step. The score-based view, rooted in energy-based modeling, learns the gradient of the evolving data distribution, indicating how to nudge samples toward more likely regions. The flow-based view, related to normalizing flows, treats generation as following a smooth path that moves samples from noise to data under a learned velocity field. These perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to the data. Sampling then amounts to solving a differential equation that evolves noise into data along a continuous trajectory. On this foundation, the monograph discusses guidance for controllable generation, efficient numerical solvers, and diffusion-motivated flow-map models that learn direct mappings between arbitrary times. It provides a conceptual and mathematically grounded understanding of diffusion models for readers with basic deep-learning knowledge.
♻ ☆ Approximating Signed Distance Fields of Implicit Surfaces with Sparse Ellipsoidal Radial Basis Function Networks
Accurate and compact representation of signed distance functions (SDFs) of implicit surfaces is crucial for efficient storage, computation, and downstream processing of 3D geometry. In this work, we propose a general learning method for approximating precomputed SDF fields of implicit surfaces by a relatively small number of ellipsoidal radial basis functions (ERBFs). The SDF values could be computed from various sources, including point clouds, triangle meshes, analytical expressions, pretrained neural networks, etc. Given SDF values on spatial grid points, our method approximates the SDF using as few ERBFs as possible, achieving a compact representation while preserving the geometric shape of the corresponding implicit surface. To balance sparsity and approximation precision, we introduce a dynamic multi-objective optimization strategy, which adaptively incorporates regularization to enforce sparsity and jointly optimizes the weights, centers, shapes, and orientations of the ERBFs. For computational efficiency, a nearest-neighbor-based data structure restricts computations to points near each kernel center, and CUDA-based parallelism further accelerates the optimization. Furthermore, a hierarchical refinement strategy based on SDF spatial grid points progressively incorporates coarse-to-fine samples for parameter initialization and optimization, improving convergence and training efficiency. Extensive experiments on multiple benchmark datasets demonstrate that our method can represent SDF fields with significantly fewer parameters than existing sparse implicit representation approaches, achieving better accuracy, robustness, and computational efficiency. The corresponding executable program is publicly available at https://github.com/lianbobo/SE-RBFNet.git
Computation and Language 5
☆ AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.
♻ ☆ Knee-Deep in C-RASP: A Transformer Depth Hierarchy
It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). The same is also proven for transformers with positional encodings (like RoPE and ALiBi). These results are established by studying a temporal logic with counting operators equivalent to C-RASP. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.
comment: 35 pages, 5 figures
♻ ☆ SimuRA: A World-Model-Driven Simulative Reasoning Architecture for General Goal-Oriented Agents
AI agents built on foundation models hold enormous promise. Current practice, however, focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also faces practical limitations from black-box autoregressive reasoning, where decisions unfold token by token without explicit simulation or counterfactual evaluation of outcomes. Humans, on the other hand, reason and plan by mentally simulating the consequences of actions within an internal model of the world -- a capability that supports flexible, goal-directed behavior across diverse contexts. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of an optimal agent in any general environment, SimuRA addresses the limitations of black-box autoregressive reasoning by incorporating the world model for planning via simulation. Our prototype world model is implemented using LLMs as a substrate, leveraging the natural language as a discrete, hierarchical representation grounded in concepts for planning, while remaining model-agnostic. On complex web-browsing tasks such as flight search, SimuRA improves the success rate from 0% to 32.2% compared to a representative open-web agent baseline. Across tasks, world-model-based planning achieves up to 124% higher task completion rates than a matched black-box autoregressive baseline, demonstrating the advantages of simulative reasoning. We release ReasonerAgent-Web, a web-browsing agent built on SimuRA, as an open-source research demo.
comment: This submission has been updated to adjust the scope and presentation of the work
♻ ☆ RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback
Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation
comment: Code and dataset are available at github.com/ChunyuMiao98/RECODE
♻ ☆ Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
comment: 19 pages, 5 figures
Information Retrieval 19
☆ A Data-Centric Approach to Multilingual E-Commerce Product Search: Case Study on Query-Category and Query-Item Relevance
Multilingual e-commerce search suffers from severe data imbalance across languages, label noise, and limited supervision for low-resource languages--challenges that impede the cross-lingual generalization of relevance models despite the strong capabilities of large language models (LLMs). In this work, we present a practical, architecture-agnostic, data-centric framework to enhance performance on two core tasks: Query-Category (QC) relevance (matching queries to product categories) and Query-Item (QI) relevance (matching queries to product titles). Rather than altering the model, we redesign the training data through three complementary strategies: (1) translation-based augmentation to synthesize examples for languages absent in training, (2) semantic negative sampling to generate hard negatives and mitigate class imbalance, and (3) self-validation filtering to detect and remove likely mislabeled instances. Evaluated on the CIKM AnalytiCup 2025 dataset, our approach consistently yields substantial F1 score improvements over strong LLM baselines, achieving competitive results in the official competition. Our findings demonstrate that systematic data engineering can be as impactful as--and often more deployable than--complex model modifications, offering actionable guidance for building robust multilingual search systems in the real-world e-commerce settings.
☆ DeepAgent: A General Reasoning Agent with Scalable Toolsets
Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To address the challenges of long-horizon interactions, particularly the context length explosion from multiple tool calls and the accumulation of interaction history, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. This work takes a step toward more general and capable agents for real-world applications. The code and demo are available at https://github.com/RUC-NLPIR/DeepAgent.
☆ Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep Research
Deep Research systems have revolutionized how LLMs solve complex questions through iterative reasoning and evidence gathering. However, current systems remain fundamentally constrained to textual web data, overlooking the vast knowledge embedded in multimodal documents Processing such documents demands sophisticated parsing to preserve visual semantics (figures, tables, charts, and equations), intelligent chunking to maintain structural coherence, and adaptive retrieval across modalities, which are capabilities absent in existing systems. In response, we present Doc-Researcher, a unified system that bridges this gap through three integrated components: (i) deep multimodal parsing that preserves layout structure and visual semantics while creating multi-granular representations from chunk to document level, (ii) systematic retrieval architecture supporting text-only, vision-only, and hybrid paradigms with dynamic granularity selection, and (iii) iterative multi-agent workflows that decompose complex queries, progressively accumulate evidence, and synthesize comprehensive answers across documents and modalities. To enable rigorous evaluation, we introduce M4DocBench, the first benchmark for Multi-modal, Multi-hop, Multi-document, and Multi-turn deep research. Featuring 158 expert-annotated questions with complete evidence chains across 304 documents, M4DocBench tests capabilities that existing benchmarks cannot assess. Experiments demonstrate that Doc-Researcher achieves 50.6% accuracy, 3.4xbetter than state-of-the-art baselines, validating that effective document research requires not just better retrieval, but fundamentally deep parsing that preserve multimodal integrity and support iterative research. Our work establishes a new paradigm for conducting deep research on multimodal document collections.
comment: preprint
☆ Redefining Retrieval Evaluation in the Era of LLMs
Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components
☆ SciNUP: Natural Language User Interest Profiles for Scientific Literature Recommendation
The use of natural language (NL) user profiles in recommender systems offers greater transparency and user control compared to traditional representations. However, there is scarcity of large-scale, publicly available test collections for evaluating NL profile-based recommendation. To address this gap, we introduce SciNUP, a novel synthetic dataset for scholarly recommendation that leverages authors' publication histories to generate NL profiles and corresponding ground truth items. We use this dataset to conduct a comparison of baseline methods, ranging from sparse and dense retrieval approaches to state-of-the-art LLM-based rerankers. Our results show that while baseline methods achieve comparable performance, they often retrieve different items, indicating complementary behaviors. At the same time, considerable headroom for improvement remains, highlighting the need for effective NL-based recommendation approaches. The SciNUP dataset thus serves as a valuable resource for fostering future research and development in this area.
☆ CausalRec: A CausalBoost Attention Model for Sequential Recommendation
Recent advances in correlation-based sequential recommendation systems have demonstrated substantial success. Specifically, the attention-based model outperforms other RNN-based and Markov chains-based models by capturing both short- and long-term dependencies more effectively. However, solely focusing on item co-occurrences overlooks the underlying motivations behind user behaviors, leading to spurious correlations and potentially inaccurate recommendations. To address this limitation, we present a novel framework that integrates causal attention for sequential recommendation, CausalRec. It incorporates a causal discovery block and a CausalBooster. The causal discovery block learns the causal graph in user behavior sequences, and we provide a theory to guarantee the identifiability of the learned causal graph. The CausalBooster utilizes the discovered causal graph to refine the attention mechanism, prioritizing behaviors with causal significance. Experimental evaluations on real-world datasets indicate that CausalRec outperforms several state-of-the-art methods, with average improvements of 7.21% in Hit Rate (HR) and 8.65% in Normalized Discounted Cumulative Gain (NDCG). To the best of our knowledge, this is the first model to incorporate causality through the attention mechanism in sequential recommendation, demonstrating the value of causality in generating more accurate and reliable recommendations.
comment: 11 pages, 3 figures
☆ Pctx: Tokenizing Personalized Context for Generative Recommendation
Generative recommendation (GR) models tokenize each action into a few discrete tokens (called semantic IDs) and autoregressively generate the next tokens as predictions, showing advantages such as memory efficiency, scalability, and the potential to unify retrieval and ranking. Despite these benefits, existing tokenization methods are static and non-personalized. They typically derive semantic IDs solely from item features, assuming a universal item similarity that overlooks user-specific perspectives. However, under the autoregressive paradigm, semantic IDs with the same prefixes always receive similar probabilities, so a single fixed mapping implicitly enforces a universal item similarity standard across all users. In practice, the same item may be interpreted differently depending on user intentions and preferences. To address this issue, we propose a personalized context-aware tokenizer that incorporates a user's historical interactions when generating semantic IDs. This design allows the same item to be tokenized into different semantic IDs under different user contexts, enabling GR models to capture multiple interpretive standards and produce more personalized predictions. Experiments on three public datasets demonstrate up to 11.44% improvement in NDCG@10 over non-personalized action tokenization baselines. Our code is available at https://github.com/YoungZ365/Pctx.
☆ Bi-Level Optimization for Generative Recommendation: Bridging Tokenization and Generation
Generative recommendation is emerging as a transformative paradigm by directly generating recommended items, rather than relying on matching. Building such a system typically involves two key components: (1) optimizing the tokenizer to derive suitable item identifiers, and (2) training the recommender based on those identifiers. Existing approaches often treat these components separately--either sequentially or in alternation--overlooking their interdependence. This separation can lead to misalignment: the tokenizer is trained without direct guidance from the recommendation objective, potentially yielding suboptimal identifiers that degrade recommendation performance. To address this, we propose BLOGER, a Bi-Level Optimization for GEnerative Recommendation framework, which explicitly models the interdependence between the tokenizer and the recommender in a unified optimization process. The lower level trains the recommender using tokenized sequences, while the upper level optimizes the tokenizer based on both the tokenization loss and recommendation loss. We adopt a meta-learning approach to solve this bi-level optimization efficiently, and introduce gradient surgery to mitigate gradient conflicts in the upper-level updates, thereby ensuring that item identifiers are both informative and recommendation-aligned. Extensive experiments on real-world datasets demonstrate that BLOGER consistently outperforms state-of-the-art generative recommendation methods while maintaining practical efficiency with no significant additional computational overhead, effectively bridging the gap between item tokenization and autoregressive generation.
☆ VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion
Multimodal conversational recommendation has emerged as a promising paradigm for delivering personalized experiences through natural dialogue enriched by visual and contextual grounding. Yet, current multimodal conversational recommendation datasets remain limited: existing resources either simulate conversations, omit user history, or fail to collect sufficiently detailed feedback, all of which constrain the types of research and evaluation they support. To address these gaps, we introduce VOGUE, a novel dataset of 60 humanhuman dialogues in realistic fashion shopping scenarios. Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and histories, and post-conversation ratings from both Seekers and Assistants. This design enables rigorous evaluation of conversational inference, including not only alignment between predicted and ground-truth preferences, but also calibration against full rating distributions and comparison with explicit and implicit user satisfaction signals. Our initial analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue. For example, recommenders frequently suggest items simultaneously in feature-based groups, which creates distinct conversational phases bridged by Seeker critiques and refinements. Benchmarking multimodal large language models against human recommenders shows that while MLLMs approach human-level alignment in aggregate, they exhibit systematic distribution errors in reproducing human ratings and struggle to generalize preference inference beyond explicitly discussed items. These findings establish VOGUE as both a unique resource for studying multimodal conversational systems and as a challenge dataset beyond the current recommendation capabilities of existing top-tier multimodal foundation models such as GPT-4o-mini, GPT-5-mini, and Gemini-2.5-Flash.
☆ A Benchmark for Open-Domain Numerical Fact-Checking Enhanced by Claim Decomposition
Fact-checking numerical claims is critical as the presence of numbers provide mirage of veracity despite being fake potentially causing catastrophic impacts on society. The prior works in automatic fact verification do not primarily focus on natural numerical claims. A typical human fact-checker first retrieves relevant evidence addressing the different numerical aspects of the claim and then reasons about them to predict the veracity of the claim. Hence, the search process of a human fact-checker is a crucial skill that forms the foundation of the verification process. Emulating a real-world setting is essential to aid in the development of automated methods that encompass such skills. However, existing benchmarks employ heuristic claim decomposition approaches augmented with weakly supervised web search to collect evidences for verifying claims. This sometimes results in less relevant evidences and noisy sources with temporal leakage rendering a less realistic retrieval setting for claim verification. Hence, we introduce QuanTemp++: a dataset consisting of natural numerical claims, an open domain corpus, with the corresponding relevant evidence for each claim. The evidences are collected through a claim decomposition process approximately emulating the approach of human fact-checker and veracity labels ensuring there is no temporal leakage. Given this dataset, we also characterize the retrieval performance of key claim decomposition paradigms. Finally, we observe their effect on the outcome of the verification pipeline and draw insights. The code for data pipeline along with link to data can be found at https://github.com/VenkteshV/QuanTemp_Plus
comment: 16 pages
☆ Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders
Modern large-scale recommendation systems rely heavily on user interaction history sequences to enhance the model performance. The advent of large language models and sequential modeling techniques, particularly transformer-like architectures, has led to significant advancements recently (e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories (10k to 100k items) generally improves model performance, it also creates significant challenges on latency, queries per second (QPS) and GPU cost in industry-scale recommendation systems. Existing models do not adequately address these industrial scalability issues. In this paper, we propose a novel two-stage modeling framework, namely VIrtual Sequential Target Attention (VISTA), which decomposes traditional target attention from a candidate item to user history items into two distinct stages: (1) user history summarization into a few hundred tokens; followed by (2) candidate item attention to those tokens. These summarization token embeddings are then cached in storage system and then utilized as sequence features for downstream model training and inference. This novel design for scalability enables VISTA to scale to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed, which is essential in industry. Our approach achieves significant improvements in offline and online metrics and has been successfully deployed on an industry leading recommendation platform serving billions of users.
☆ Multimodal Item Scoring for Natural Language Recommendation via Gaussian Process Regression with LLM Relevance Judgments
Natural Language Recommendation (NLRec) generates item suggestions based on the relevance between user-issued NL requests and NL item description passages. Existing NLRec approaches often use Dense Retrieval (DR) to compute item relevance scores from aggregation of inner products between user request embeddings and relevant passage embeddings. However, DR views the request as the sole relevance label, thus leading to a unimodal scoring function centered on the query embedding that is often a weak proxy for query relevance. To better capture the potential multimodal distribution of the relevance scoring function that may arise from complex NLRec data, we propose GPR-LLM that uses Gaussian Process Regression (GPR) with LLM relevance judgments for a subset of candidate passages. Experiments on four NLRec datasets and two LLM backbones demonstrate that GPR-LLM with an RBF kernel, capable of modeling multimodal relevance scoring functions, consistently outperforms simpler unimodal kernels (dot product, cosine similarity), as well as baseline methods including DR, cross-encoder, and pointwise LLM-based relevance scoring by up to 65%. Overall, GPR-LLM provides an efficient and effective approach to NLRec within a minimal LLM labeling budget.
comment: 16 pages,20 figures
☆ Temporal Graph Theoretic Analysis of Geopolitical Dynamics in the U.S. Entity List
Export controls have become one of America's most prominent tools of economic statecraft. They aim to block rival countries' access to sensitive technologies, safeguard U.S. supply chains, protect national security, and shape geopolitical competition. Among various instruments, the U.S. Entity List has emerged as the most salient, yet its dynamics remain underexplored. This paper introduces a novel temporal graph framework that transforms the Entity List documents from a static registry of foreign entities of concern into a dynamic representation of geopolitical strategy. We construct the first event-based dataset of U.S. government foreign entity designations and model them as a temporal bipartite graph. Building on this representation, we develop a multi-level analytical approach that reveals shifting roles, enforcement strategy, and broader sanction ecosystems. Applied to 25 years of data, the framework uncovers dynamic patterns of escalation, persistence, and coordination that static views cannot capture. More broadly, our study demonstrates how temporal graph analysis offers systematic computational insights into the geopolitical dynamics of export controls.
comment: 13 pages, 9 figures. Under review
♻ ☆ Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs
Sequential recommendations (SR) with transformer-based architectures are widely adopted in real-world applications, where SR models require frequent retraining to adapt to ever-changing user preferences. However, training transformer-based SR models often encounters a high computational cost associated with scoring extensive item catalogs, often exceeding thousands of items. This occurs mainly due to the use of cross-entropy loss, where peak memory scales proportionally to catalog size, batch size, and sequence length. Recognizing this, practitioners in the field of recommendation systems typically address memory consumption by integrating the cross-entropy (CE) loss with negative sampling, thereby reducing the explicit memory demands of the final layer. However, a small number of negative samples would degrade model performance, and as we demonstrate in our work, increasing the number of negative samples and the batch size further improves the model's performance, but rapidly starts to exceed industrial GPUs' size (~40Gb). In this work, we introduce the CCE- method, which offers a GPU-efficient implementation of the CE loss with negative sampling. Our method accelerates training by up to two times while reducing memory consumption by more than 10 times. Leveraging the memory savings afforded by using CCE- for model training, it becomes feasible to enhance its accuracy on datasets with a large item catalog compared to those trained with original PyTorch-implemented loss functions. Finally, we perform an analysis of key memory-related hyperparameters and highlight the necessity of a delicate balance among these factors. We demonstrate that scaling both the number of negative samples and batch size leads to better results rather than maximizing only one of them. To facilitate further adoption of CCE-, we release a Triton kernel that efficiently implements the proposed method.
♻ ☆ SimLab: A Platform for Simulation-based Evaluation of Conversational Information Access Systems
Progress in conversational information access (CIA) systems has been hindered by the difficulty of evaluating such systems with reproducible experiments. While user simulation offers a promising solution, the lack of infrastructure and tooling to support this evaluation paradigm remains a significant barrier. To address this gap, we introduce SimLab, the first cloud-based platform providing a centralized solution for the community to benchmark both conversational systems and user simulators in a controlled and reproducible setting. We articulate the requirements for such a platform and propose a general infrastructure to meet them. We then present the design and implementation of an initial version of SimLab and showcase its features through an initial simulation-based evaluation task in conversational movie recommendation. Furthermore, we discuss the platform's sustainability and future opportunities for development, inviting the community to drive further progress in the fields of CIA and user simulation.
♻ ☆ AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking NeurIPS 2025
Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, leading to inefficiencies. We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance. Using a Bayesian TrueSkill model, we iteratively refine relevance estimates until reaching sufficient confidence levels, and our explicit modeling of ranking uncertainty enables principled control over reranking behavior and avoids unnecessary updates to confident predictions. Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models.
comment: Accepted at NeurIPS 2025. The first two authors contributed equally. Author order is randomly determined via coin toss
♻ ☆ Towards Context-aware Reasoning-enhanced Generative Searching in E-commerce
Search-based recommendation is one of the most critical application scenarios in e-commerce platforms. Users' complex search contexts--such as spatiotemporal factors, historical interactions, and current query's information--constitute an essential part of their decision-making, reflecting implicit preferences that complement explicit query terms. Modeling such rich contextual signals and their intricate associations with candidate items remains a key challenge. Although numerous efforts have been devoted to building more effective search methods, existing approaches still show limitations in integrating contextual information, which hinders their ability to fully capture user intent. To address these challenges, we propose a context-aware reasoning-enhanced generative search framework for better \textbf{understanding the complicated context}. Specifically, the framework first unifies heterogeneous user and item contexts into textual representations or text-based semantic identifiers and aligns them. To overcome the lack of explicit reasoning trajectories, we introduce a self-evolving post-training paradigm that iteratively combines supervised fine-tuning and reinforcement learning to progressively enhance the model's reasoning capability. In addition, we identify potential biases in existing RL algorithms when applied to search scenarios and present a debiased variant of GRPO to improve ranking performance. Extensive experiments on search log data collected from a real-world e-commerce platform demonstrate that our approach achieves superior performance compared with strong baselines, validating its effectiveness for search-based recommendation.
♻ ☆ Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.
♻ ☆ Engram Memory Encoding and Retrieval: A Neurocomputational Perspective
Despite substantial research into the biological basis of memory, the precise mechanisms by which experiences are encoded, stored, and retrieved in the brain remain incompletely understood. A growing body of evidence supports the engram theory, which posits that sparse populations of neurons undergo lasting physical and biochemical changes to support long-term memory. Yet, a comprehensive computational framework that integrates biological findings with mechanistic models remains elusive. This work synthesizes insights from cellular neuroscience and computational modeling to address key challenges in engram research: how engram neurons are identified and manipulated; how synaptic plasticity mechanisms contribute to stable memory traces; and how sparsity promotes efficient, interference-resistant representations. Relevant computational approaches -- such as sparse regularization, engram gating, and biologically inspired architectures like Sparse Distributed Memory and spiking neural networks -- are also examined. Together, these findings suggest that memory efficiency, capacity, and stability emerge from the interaction of plasticity and sparsity constraints. By integrating neurobiological and computational perspectives, this paper provides a comprehensive theoretical foundation for engram research and proposes a roadmap for future inquiry into the mechanisms underlying memory, with implications for the diagnosis and treatment of memory-related disorders.
comment: 18 pages, 7 figures, 3 tables
Multimedia 3
☆ YTLive: A Dataset of Real-World YouTube Live Streaming Sessions
Live streaming plays a major role in today's digital platforms, supporting entertainment, education, social media, etc. However, research in this field is limited by the lack of large, publicly available datasets that capture real-time viewer behavior at scale. To address this gap, we introduce YTLive, a public dataset focused on YouTube Live. Collected through the YouTube Researcher Program over May and June 2024, YTLive includes more than 507000 records from 12156 live streams, tracking concurrent viewer counts at five-minute intervals along with precise broadcast durations. We describe the dataset design and collection process and present an initial analysis of temporal viewing patterns. Results show that viewer counts are higher and more stable on weekends, especially during afternoon hours. Shorter streams attract larger and more consistent audiences, while longer streams tend to grow slowly and exhibit greater variability. These insights have direct implications for adaptive streaming, resource allocation, and Quality of Experience (QoE) modeling. YTLive offers a timely, open resource to support reproducible research and system-level innovation in live streaming. The dataset is publicly available at github.
♻ ☆ Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization NeurIPS 2025
Imagine hearing a dog bark and turning toward the sound only to see a parked car, while the real, silent dog sits elsewhere. Such sensory conflicts test perception, yet humans reliably resolve them by prioritizing sound over misleading visuals. Despite advances in multimodal AI integrating vision and audio, little is known about how these systems handle cross-modal conflicts or whether they favor one modality. In this study, we systematically examine modality bias and conflict resolution in AI sound localization. We assess leading multimodal models and benchmark them against human performance in psychophysics experiments across six audiovisual conditions, including congruent, conflicting, and absent cues. Humans consistently outperform AI, demonstrating superior resilience to conflicting or missing visuals by relying on auditory information. In contrast, AI models often default to visual input, degrading performance to near chance levels. To address this, we propose a neuroscience-inspired model, EchoPin, which uses a stereo audio-image dataset generated via 3D simulations. Even with limited training data, EchoPin surpasses existing benchmarks. Notably, it also mirrors human-like horizontal localization bias favoring left-right precision-likely due to the stereo audio structure reflecting human ear placement. These findings underscore how sensory input quality and system architecture shape multimodal representation accuracy.
comment: NeurIPS 2025, Spotlight
♻ ☆ Save It for the "Hot" Day: An LLM-Empowered Visual Analytics System for Heat Risk Management
The escalating frequency and intensity of heat-related climate events, particularly heatwaves, emphasize the pressing need for advanced heat risk management strategies. Current approaches, primarily relying on numerical models, face challenges in spatial-temporal resolution and in capturing the dynamic interplay of environmental, social, and behavioral factors affecting heat risks. This has led to difficulties in translating risk assessments into effective mitigation actions. Recognizing these problems, we introduce a novel approach leveraging the burgeoning capabilities of Large Language Models (LLMs) to extract rich and contextual insights from news reports. We hence propose an LLM-empowered visual analytics system, Havior, that integrates the precise, data-driven insights of numerical models with nuanced news report information. This hybrid approach enables a more comprehensive assessment of heat risks and better identification, assessment, and mitigation of heat-related threats. The system incorporates novel visualization designs, such as "thermoglyph" and news glyph, enhancing intuitive understanding and analysis of heat risks. The integration of LLM-based techniques also enables advanced information retrieval and semantic knowledge extraction that can be guided by experts' analytics needs. Our case studies on two cities that faced significant heatwave events and interviews with five experts have demonstrated the usefulness of our system in providing in-depth and actionable insights for heat risk management.
Image and Video Processing 16
☆ Anisotropic Pooling for LUT-realizable CNN Image Restoration
Table look-up realization of image restoration CNNs has the potential of achieving competitive image quality while being much faster and resource frugal than the straightforward CNN implementation. The main technical challenge facing the LUT-based CNN algorithm designers is to manage the table size without overly restricting the receptive field. The prevailing strategy is to reuse the table for small pixel patches of different orientations (apparently assuming a degree of isotropy) and then fuse the look-up results. The fusion is currently done by average pooling, which we find being ill suited to anisotropic signal structures. To alleviate the problem, we investigate and discuss anisotropic pooling methods to replace naive averaging for improving the performance of the current LUT-realizable CNN restoration methods. First, we introduce the method of generalized median pooling which leads to measurable gains over average pooling. We then extend this idea by learning data-dependent pooling coefficients for each orientation, so that they can adaptively weigh the contributions of differently oriented pixel patches. Experimental results on various restoration benchmarks show that our anisotropic pooling strategy yields both perceptually and numerically superior results compared to existing LUT-realizable CNN methods.
☆ MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification
Generative modeling has emerged as a powerful paradigm for representation learning, but its direct applicability to challenging fields like medical imaging remains limited: mere generation, without task alignment, fails to provide a robust foundation for clinical use. We propose MAGIC-Flow, a conditional multiscale normalizing flow architecture that performs generation and classification within a single modular framework. The model is built as a hierarchy of invertible and differentiable bijections, where the Jacobian determinant factorizes across sub-transformations. We show how this ensures exact likelihood computation and stable optimization, while invertibility enables explicit visualization of sample likelihoods, providing an interpretable lens into the model's reasoning. By conditioning on class labels, MAGIC-Flow supports controllable sample synthesis and principled class-probability estimation, effectively aiding both generative and discriminative objectives. We evaluate MAGIC-Flow against top baselines using metrics for similarity, fidelity, and diversity. Across multiple datasets, it addresses generation and classification under scanner noise, and modality-specific synthesis and identification. Results show MAGIC-Flow creates realistic, diverse samples and improves classification. MAGIC-Flow is an effective strategy for generation and classification in data-limited domains, with direct benefits for privacy-preserving augmentation, robust generalization, and trustworthy medical AI.
☆ Caption-Driven Explainability: Probing CNNs for Bias via CLIP ICIP 2025
Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models.
comment: Accepted and presented at the IEEE ICIP 2025 Satellite Workshop "Generative AI for World Simulations and Communications & Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos", Anchorage, Alaska, United States, September 14, 2025. Camera-ready preprint. The official IEEE Xplore version will be available after proceedings processing
☆ FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing
The remarkable success of diffusion and flow-matching models has ignited a surge of works on adapting them at test time for controlled generation tasks. Examples range from image editing to restoration, compression and personalization. However, due to the iterative nature of the sampling process in those models, it is computationally impractical to use gradient-based optimization to directly control the image generated at the end of the process. As a result, existing methods typically resort to manipulating each timestep separately. Here we introduce FlowOpt - a zero-order (gradient-free) optimization framework that treats the entire flow process as a black box, enabling optimization through the whole sampling path without backpropagation through the model. Our method is both highly efficient and allows users to monitor the intermediate optimization results and perform early stopping if desired. We prove a sufficient condition on FlowOpt's step-size, under which convergence to the global optimum is guaranteed. We further show how to empirically estimate this upper bound so as to choose an appropriate step-size. We demonstrate how FlowOpt can be used for image editing, showcasing two options: (i) inversion (determining the initial noise that generates a given image), and (ii) directly steering the edited image to be similar to the source image while conforming to a target text prompt. In both cases, FlowOpt achieves state-of-the-art results while using roughly the same number of neural function evaluations (NFEs) as existing methods. Code and examples are available on the project's webpage.
comment: Project's webpage at https://orronai.github.io/FlowOpt/
☆ Inverse Design of Metasurface for Spectral Imaging
Inverse design of metasurfaces for the joint optimization of optical modulation and algorithmic decoding in computational optics presents significant challenges, especially in applications such as hyperspectral imaging. We introduce a physics-data co-driven framework for designing reconfigurable metasurfaces fabricated from the phase-change material Ge2Sb2Se4Te1 to achieve compact, compressive spectral imaging in the shortwave infrared region. Central to our approach is a differentiable neural simulator, trained on over 320,000 simulated geometries, that accurately predicts spectral responses across 11 crystallization states. This differentiability enables end-to-end joint optimization of the metasurface geometry, its spectral encoding function, and a deep reconstruction network. We also propose a soft shape regularization technique that preserves manufacturability during gradient-based updates. Experiments show that our optimized system improves reconstruction fidelity by up to 7.6 dB in the peak-signal-to-noise ratio, with enhanced noise resilience and improved measurement matrix conditioning, underscoring the potential of our approach for high-performance hyperspectral imaging.
☆ Noise is All You Need: Solving Linear Inverse Problems by Noise Combination Sampling with Diffusion Models
Pretrained diffusion models have demonstrated strong capabilities in zero-shot inverse problem solving by incorporating observation information into the generation process of the diffusion models. However, this presents an inherent dilemma: excessive integration can disrupt the generative process, while insufficient integration fails to emphasize the constraints imposed by the inverse problem. To address this, we propose \emph{Noise Combination Sampling}, a novel method that synthesizes an optimal noise vector from a noise subspace to approximate the measurement score, replacing the noise term in the standard Denoising Diffusion Probabilistic Models process. This enables conditional information to be naturally embedded into the generation process without reliance on step-wise hyperparameter tuning. Our method can be applied to a wide range of inverse problem solvers, including image compression, and, particularly when the number of generation steps $T$ is small, achieves superior performance with negligible computational overhead, significantly improving robustness and stability.
comment: 9 pages
☆ Point-level Uncertainty Evaluation of Mobile Laser Scanning Point Clouds
Reliable quantification of uncertainty in Mobile Laser Scanning (MLS) point clouds is essential for ensuring the accuracy and credibility of downstream applications such as 3D mapping, modeling, and change analysis. Traditional backward uncertainty modeling heavily rely on high-precision reference data, which are often costly or infeasible to obtain at large scales. To address this issue, this study proposes a machine learning-based framework for point-level uncertainty evaluation that learns the relationship between local geometric features and point-level errors. The framework is implemented using two ensemble learning models, Random Forest (RF) and XGBoost, which are trained and validated on a spatially partitioned real-world dataset to avoid data leakage. Experimental results demonstrate that both models can effectively capture the nonlinear relationships between geometric characteristics and uncertainty, achieving mean ROC-AUC values above 0.87. The analysis further reveals that geometric features describing elevation variation, point density, and local structural complexity play a dominant role in predicting uncertainty. The proposed framework offers a data-driven perspective of uncertainty evaluation, providing a scalable and adaptable foundation for future quality control and error analysis of large-scale point clouds.
☆ DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI
Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.
comment: 11 pages
♻ ☆ Guided MRI Reconstruction via Schrödinger Bridge
Magnetic Resonance Imaging (MRI) is an inherently multi-contrast modality, where cross-contrast priors can be exploited to improve image reconstruction from undersampled data. Recently, diffusion models have shown remarkable performance in MRI reconstruction. However, they still struggle to effectively utilize such priors, mainly because existing methods rely on feature-level fusion in image or latent spaces, which lacks explicit structural correspondence and thus leads to suboptimal performance. To address this issue, we propose $\mathbf{I}^2$SB-Inversion, a multi-contrast guided reconstruction framework based on the Schr\"odinger Bridge (SB). The proposed method performs pixel-wise translation between paired contrasts, providing explicit structural constraints between the guidance and target images. Furthermore, an Inversion strategy is introduced to correct inter-modality misalignment, which often occurs in guided reconstruction, thereby mitigating artifacts and improving reconstruction accuracy. Experiments on paired T1- and T2-weighted datasets demonstrate that $\mathbf{I}^2$SB-Inversion achieves a high acceleration factor of up to 14.4 and consistently outperforms existing methods in both quantitative and qualitative evaluations.
♻ ☆ Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation NeurIPS 2025
In recent years, artificial intelligence has significantly advanced medical image segmentation. Nonetheless, challenges remain, including efficient 3D medical image processing across diverse modalities and handling data variability. In this work, we introduce Hierarchical Soft Mixture-of-Experts (HoME), a two-level token-routing layer for efficient long-context modeling, specifically designed for 3D medical image segmentation. Built on the Mamba Selective State Space Model (SSM) backbone, HoME enhances sequential modeling through adaptive expert routing. In the first level, a Soft Mixture-of-Experts (SMoE) layer partitions input sequences into local groups, routing tokens to specialized per-group experts for localized feature extraction. The second level aggregates these outputs through a global SMoE layer, enabling cross-group information fusion and global context refinement. This hierarchical design, combining local expert routing with global expert refinement, enhances generalizability and segmentation performance, surpassing state-of-the-art results across datasets from the three most widely used 3D medical imaging modalities and varying data qualities. The code is publicly available at https://github.com/gmum/MambaHoME.
comment: Accepted at NeurIPS 2025
♻ ☆ Robust Residual Finite Scalar Quantization for Neural Compression
Finite Scalar Quantization (FSQ) offers simplified training but suffers from residual magnitude decay in multi-stage settings, where subsequent stages receive exponentially weaker signals. We propose Robust Residual Finite Scalar Quantization (RFSQ), addressing this fundamental limitation through two novel conditioning strategies: learnable scaling factors and invertible layer normalization. Our experiments across audio and image modalities demonstrate RFSQ's effectiveness and generalizability. In audio reconstruction at 24 bits/frame, RFSQ-LayerNorm achieves 3.646 DNSMOS, a 3.6% improvement over state-of-the-art RVQ (3.518). On ImageNet, RFSQ achieves 0.102 L1 loss and 0.100 perceptual loss, with LayerNorm providing 9.7% L1 improvement and 17.4% perceptual improvement over unconditioned variants. The LayerNorm strategy consistently outperforms alternatives by maintaining normalized input statistics across stages, effectively preventing exponential magnitude decay that limits naive residual approaches. RFSQ combines FSQ's simplicity with multi-stage quantization's representational power, establishing a new standard for neural compression across diverse modalities.
comment: 5 pages, 2 figures
♻ ☆ Size and Smoothness Aware Adaptive Focal Loss for Small Tumor Segmentation
Deep learning has achieved remarkable accuracy in medical image segmentation, particularly for larger structures with well-defined boundaries. However, its effectiveness can be challenged by factors such as irregular object shapes and edges, non-smooth surfaces, small target areas, etc. which complicate the ability of networks to grasp the intricate and diverse nature of anatomical regions. In response to these challenges, we propose an Adaptive Focal Loss (A-FL) that takes both object boundary smoothness and size into account, with the goal to improve segmentation performance in intricate anatomical regions. The proposed A-FL dynamically adjusts itself based on an object's surface smoothness, size, and the class balancing parameter based on the ratio of targeted area and background. We evaluated the performance of the A-FL on the PICAI 2022 and BraTS 2018 datasets. In the PICAI 2022 dataset, the A-FL achieved an Intersection over Union (IoU) score of 0.696 and a Dice Similarity Coefficient (DSC) of 0.769, outperforming the regular Focal Loss (FL) by 5.5% and 5.4% respectively. It also surpassed the best baseline by 2.0% and 1.2%. In the BraTS 2018 dataset, A-FL achieved an IoU score of 0.883 and a DSC score of 0.931. Our ablation experiments also show that the proposed A-FL surpasses conventional losses (this includes Dice Loss, Focal Loss, and their hybrid variants) by large margin in IoU, DSC, and other metrics. The code is available at https://github.com/rakibuliuict/AFL-CIBM.git.
comment: 15 pages, 5 figures
♻ ☆ InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding NeurIPS 2025
Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time-quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy-even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.
comment: NeurIPS 2025
♻ ☆ Multi-Atlas Brain Network Classification through Consistency Distillation and Complementary Information Fusion
In the realm of neuroscience, identifying distinctive patterns associated with neurological disorders via brain networks is crucial. Resting-state functional magnetic resonance imaging (fMRI) serves as a primary tool for mapping these networks by correlating blood-oxygen-level-dependent (BOLD) signals across different brain regions, defined as regions of interest (ROIs). Constructing these brain networks involves using atlases to parcellate the brain into ROIs based on various hypotheses of brain division. However, there is no standard atlas for brain network classification, leading to limitations in detecting abnormalities in disorders. Some recent methods have proposed utilizing multiple atlases, but they neglect consistency across atlases and lack ROI-level information exchange. To tackle these limitations, we propose an Atlas-Integrated Distillation and Fusion network (AIDFusion) to improve brain network classification using fMRI data. AIDFusion addresses the challenge of utilizing multiple atlases by employing a disentangle Transformer to filter out inconsistent atlas-specific information and distill distinguishable connections across atlases. It also incorporates subject- and population-level consistency constraints to enhance cross-atlas consistency. Additionally, AIDFusion employs an inter-atlas message-passing mechanism to fuse complementary information across brain regions. Experimental results on four datasets of different diseases demonstrate the effectiveness and efficiency of AIDFusion compared to state-of-the-art methods. A case study illustrates AIDFusion extract patterns that are both interpretable and consistent with established neuroscience findings.
♻ ☆ Time-causal and time-recursive wavelets
When to apply wavelet analysis to real-time temporal signals, where the future cannot be accessed, it is essential to base all the steps in the signal processing pipeline on computational mechanisms that are truly time-causal. This paper describes how a time-causal wavelet analysis can be performed based on concepts developed in the area of temporal scale-space theory, originating from a complete classification of temporal smoothing kernels that guarantee non-creation of new structures from finer to coarser temporal scale levels. By necessity, convolution with truncated exponential kernels in cascade constitutes the only permissable class of kernels, as well as their temporal derivatives as a natural complement to fulfil the admissibility conditions of wavelet representations. For a particular way of choosing the time constants in the resulting infinite convolution of truncated exponential kernels, to ensure temporal scale covariance and thus self-similarity over temporal scales, we describe how mother wavelets can be chosen as temporal derivatives of the resulting time-causal limit kernel. By developing connections between wavelet theory and scale-space theory, we characterize and quantify how the continuous scaling properties transfer to the discrete implementation, demonstrating how the proposed time-causal wavelet representation can reflect the duration of locally dominant temporal structures in the input signals. We propose that this notion of time-causal wavelet analysis could be a valuable tool for signal processing tasks, where streams of signals are to be processed in real time, specifically for signals that may contain local variations over a rich span of temporal scales, or more generally for analysing physical or biophysical temporal phenomena, where a fully time-causal analysis is called for to be physically realistic.
comment: 25 pages, 10 figures
♻ ☆ Slot-BERT: Self-supervised Object Discovery in Surgical Video
Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical videos. While conventional object-centric methods for videos leverage recurrent processing to achieve efficiency, they often struggle with maintaining long-range temporal coherence required for long videos in surgical applications. On the other hand, fully parallel processing of entire videos enhances temporal consistency but introduces significant computational overhead, making it impractical for implementation on hardware in medical facilities. We present Slot-BERT, a bidirectional long-range model that learns object-centric representations in a latent space while ensuring robust temporal coherence. Slot-BERT scales object discovery seamlessly to long videos of unconstrained lengths. A novel slot contrastive loss further reduces redundancy and improves the representation disentanglement by enhancing slot orthogonality. We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across diverse domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.
comment: 28 pages, 10 figures
Computer Vision and Pattern Recognition 150
☆ HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.
comment: Project page and code: https://holo-cine.github.io/
☆ LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas
Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.
comment: 9 pages, preprint
☆ Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge
Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.
☆ GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation
This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: https://3dgsworld.github.io/.
☆ SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution
Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.
☆ Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict
☆ Real Deep Research for AI, Robotics and Beyond
With the rapid growth of research in AI and robotics now producing over 10,000 papers annually it has become increasingly difficult for researchers to stay up to date. Fast evolving trends, the rise of interdisciplinary work, and the need to explore domains beyond one's expertise all contribute to this challenge. To address these issues, we propose a generalizable pipeline capable of systematically analyzing any research area: identifying emerging trends, uncovering cross domain opportunities, and offering concrete starting points for new inquiry. In this work, we present Real Deep Research (RDR) a comprehensive framework applied to the domains of AI and robotics, with a particular focus on foundation models and robotics advancements. We also briefly extend our analysis to other areas of science. The main paper details the construction of the RDR pipeline, while the appendix provides extensive results across each analyzed topic. We hope this work sheds light for researchers working in the field of AI and beyond.
comment: website: https://realdeepresearch.github.io
☆ Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers
Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
comment: 14 pages, 14 figures
☆ ARGenSeg: Image Segmentation with Autoregressive Image Generation Model NeurIPS 2025
We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.
comment: Accepted to NeurIPS 2025, 18 pages
☆ Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples
Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected -- eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data -- both for computing the indicative gradients and for measuring the final accuracy -- suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets -- entirely without fine-tuning.
☆ Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature
This paper presents a Multi-Object Tracking (MOT) framework that fuses radar and camera data to enhance tracking efficiency while minimizing manual interventions. Contrary to many studies that underutilize radar and assign it a supplementary role--despite its capability to provide accurate range/depth information of targets in a world 3D coordinate system--our approach positions radar in a crucial role. Meanwhile, this paper utilizes common features to enable online calibration to autonomously associate detections from radar and camera. The main contributions of this work include: (1) the development of a radar-camera fusion MOT framework that exploits online radar-camera calibration to simplify the integration of detection results from these two sensors, (2) the utilization of common features between radar and camera data to accurately derive real-world positions of detected objects, and (3) the adoption of feature matching and category-consistency checking to surpass the limitations of mere position matching in enhancing sensor association accuracy. To the best of our knowledge, we are the first to investigate the integration of radar-camera common features and their use in online calibration for achieving MOT. The efficacy of our framework is demonstrated by its ability to streamline the radar-camera mapping process and improve tracking precision, as evidenced by real-world experiments conducted in both controlled environments and actual traffic scenarios. Code is available at https://github.com/radar-lab/Radar_Camera_MOT
comment: accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)
☆ CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image
This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.
comment: project page at https://cupid3d.github.io
☆ AlphaFlow: Understanding and Improving MeanFlow Models
MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE).
☆ DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.
☆ MEIcoder: Decoding Visual Stimuli from Neural Activity by Leveraging Most Exciting Inputs NeurIPS 2025
Decoding visual stimuli from neural population activity is crucial for understanding the brain and for applications in brain-machine interfaces. However, such biological data is often scarce, particularly in primates or humans, where high-throughput recording techniques, such as two-photon imaging, remain challenging or impossible to apply. This, in turn, poses a challenge for deep learning decoding techniques. To overcome this, we introduce MEIcoder, a biologically informed decoding method that leverages neuron-specific most exciting inputs (MEIs), a structural similarity index measure loss, and adversarial training. MEIcoder achieves state-of-the-art performance in reconstructing visual stimuli from single-cell activity in primary visual cortex (V1), especially excelling on small datasets with fewer recorded neurons. Using ablation studies, we demonstrate that MEIs are the main drivers of the performance, and in scaling experiments, we show that MEIcoder can reconstruct high-fidelity natural-looking images from as few as 1,000-2,500 neurons and less than 1,000 training data points. We also propose a unified benchmark with over 160,000 samples to foster future research. Our results demonstrate the feasibility of reliable decoding in early visual system and provide practical insights for neuroscience and neuroengineering applications.
comment: Accepted to NeurIPS 2025
☆ ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology
Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved {\mu}IoU/{\mu}Dice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS-SegNet
comment: 5 pages
☆ AutoScape: Geometry-Consistent Long-Horizon Scene Generation ICCV 2025
This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.
comment: ICCV 2025. Project page: https://auto-scape.github.io
☆ ALICE-LRI: A General Method for Lossless Range Image Generation for Spinning LiDAR Sensors without Calibration Metadata
3D LiDAR sensors are essential for autonomous navigation, environmental monitoring, and precision mapping in remote sensing applications. To efficiently process the massive point clouds generated by these sensors, LiDAR data is often projected into 2D range images that organize points by their angular positions and distances. While these range image representations enable efficient processing, conventional projection methods suffer from fundamental geometric inconsistencies that cause irreversible information loss, compromising high-fidelity applications. We present ALICE-LRI (Automatic LiDAR Intrinsic Calibration Estimation for Lossless Range Images), the first general, sensor-agnostic method that achieves lossless range image generation from spinning LiDAR point clouds without requiring manufacturer metadata or calibration files. Our algorithm automatically reverse-engineers the intrinsic geometry of any spinning LiDAR sensor by inferring critical parameters including laser beam configuration, angular distributions, and per-beam calibration corrections, enabling lossless projection and complete point cloud reconstruction with zero point loss. Comprehensive evaluation across the complete KITTI and DurLAR datasets demonstrates that ALICE-LRI achieves perfect point preservation, with zero points lost across all point clouds. Geometric accuracy is maintained well within sensor precision limits, establishing geometric losslessness with real-time performance. We also present a compression case study that validates substantial downstream benefits, demonstrating significant quality improvements in practical applications. This paradigm shift from approximate to lossless LiDAR projections opens new possibilities for high-precision remote sensing applications requiring complete geometric preservation.
☆ Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.
comment: Our code is available at https://github.com/xuyang-liu16/MixKV
☆ Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward
Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.
comment: 5 pages
☆ Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling
Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at https://github.com/a2jinhee/EMQNet_jk.
☆ HybridSOMSpikeNet: A Deep Model with Differentiable Soft Self-Organizing Maps and Spiking Dynamics for Waste Classification
Accurate waste classification is vital for achieving sustainable waste management and reducing the environmental footprint of urbanization. Misclassification of recyclable materials contributes to landfill accumulation, inefficient recycling, and increased greenhouse gas emissions. To address these issues, this study introduces HybridSOMSpikeNet, a hybrid deep learning framework that integrates convolutional feature extraction, differentiable self-organization, and spiking-inspired temporal processing to enable intelligent and energy-efficient waste classification. The proposed model employs a pre-trained ResNet-152 backbone to extract deep spatial representations, followed by a Differentiable Soft Self-Organizing Map (Soft-SOM) that enhances topological clustering and interpretability. A spiking neural head accumulates temporal activations over discrete time steps, improving robustness and generalization. Trained on a ten-class waste dataset, HybridSOMSpikeNet achieved a test accuracy of 97.39%, outperforming several state-of-the-art architectures while maintaining a lightweight computational profile suitable for real-world deployment. Beyond its technical innovations, the framework provides tangible environmental benefits. By enabling precise and automated waste segregation, it supports higher recycling efficiency, reduces contamination in recyclable streams, and minimizes the ecological and operational costs of waste processing. The approach aligns with global sustainability priorities, particularly the United Nations Sustainable Development Goals (SDG 11 and SDG 12), by contributing to cleaner cities, circular economy initiatives, and intelligent environmental management systems.
☆ UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset NeurIPS 2025
Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable progress. However, two key challenges remain : 1) the absence of a large-scale high-quality UHR T2I dataset, and (2) the neglect of tailored training strategies for fine-grained detail synthesis in UHR scenarios. To tackle the first challenge, we introduce \textbf{UltraHR-100K}, a high-quality dataset of 100K UHR images with rich captions, offering diverse content and strong visual fidelity. Each image exceeds 3K resolution and is rigorously curated based on detail richness, content complexity, and aesthetic quality. To tackle the second challenge, we propose a frequency-aware post-training method that enhances fine-detail generation in T2I diffusion models. Specifically, we design (i) \textit{Detail-Oriented Timestep Sampling (DOTS)} to focus learning on detail-critical denoising steps, and (ii) \textit{Soft-Weighting Frequency Regularization (SWFR)}, which leverages Discrete Fourier Transform (DFT) to softly constrain frequency components, encouraging high-frequency detail preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks demonstrate that our approach significantly improves the fine-grained detail quality and overall fidelity of UHR image generation. The code is available at \href{https://github.com/NJU-PCALab/UltraHR-100k}{here}.
comment: Accepted by NeurIPS 2025
☆ Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging NeurIPS 2025
Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D
comment: NeurIPS 2025
☆ Deep Learning in Dental Image Analysis: A Systematic Review of Datasets, Methodologies, and Emerging Challenges
Efficient analysis and processing of dental images are crucial for dentists to achieve accurate diagnosis and optimal treatment planning. However, dental imaging inherently poses several challenges, such as low contrast, metallic artifacts, and variations in projection angles. Combined with the subjectivity arising from differences in clinicians' expertise, manual interpretation often proves time-consuming and prone to inconsistency. Artificial intelligence (AI)-based automated dental image analysis (DIA) offers a promising solution to these issues and has become an integral part of computer-aided dental diagnosis and treatment. Among various AI technologies, deep learning (DL) stands out as the most widely applied and influential approach due to its superior feature extraction and representation capabilities. To comprehensively summarize recent progress in this field, we focus on the two fundamental aspects of DL research-datasets and models. In this paper, we systematically review 260 studies on DL applications in DIA, including 49 papers on publicly available dental datasets and 211 papers on DL-based algorithms. We first introduce the basic concepts of dental imaging and summarize the characteristics and acquisition methods of existing datasets. Then, we present the foundational techniques of DL and categorize relevant models and algorithms according to different DIA tasks, analyzing their network architectures, optimization strategies, training methods, and performance. Furthermore, we summarize commonly used training and evaluation metrics in the DIA domain. Finally, we discuss the current challenges of existing research and outline potential future directions. We hope that this work provides a valuable and systematic reference for researchers in this field. All supplementary materials and detailed comparison tables will be made publicly available on GitHub.
comment: 52 pages, 24 figures. Under Review
☆ SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.
☆ OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects NeurIPS 2025
Free-moving object reconstruction from monocular video remains challenging, particularly without reliable pose or depth cues and under arbitrary object motion. We introduce OnlineSplatter, a novel online feed-forward framework generating high-quality, object-centric 3D Gaussians directly from RGB frames without requiring camera pose, depth priors, or bundle optimization. Our approach anchors reconstruction using the first frame and progressively refines the object representation through a dense Gaussian primitive field, maintaining constant computational cost regardless of video sequence length. Our core contribution is a dual-key memory module combining latent appearance-geometry keys with explicit directional keys, robustly fusing current frame features with temporally aggregated object states. This design enables effective handling of free-moving objects via spatial-guided memory readout and an efficient sparsification mechanism, ensuring comprehensive yet compact object coverage. Evaluations on real-world datasets demonstrate that OnlineSplatter significantly outperforms state-of-the-art pose-free reconstruction baselines, consistently improving with more observations while maintaining constant memory and runtime.
comment: NeurIPS 2025 (Spotlight)
☆ Unsupervised Domain Adaptation via Similarity-based Prototypes for Cross-Modality Segmentation MICCAI 2021
Deep learning models have achieved great success on various vision challenges, but a well-trained model would face drastic performance degradation when applied to unseen data. Since the model is sensitive to domain shift, unsupervised domain adaptation attempts to reduce the domain gap and avoid costly annotation of unseen domains. This paper proposes a novel framework for cross-modality segmentation via similarity-based prototypes. In specific, we learn class-wise prototypes within an embedding space, then introduce a similarity constraint to make these prototypes representative for each semantic class while separable from different classes. Moreover, we use dictionaries to store prototypes extracted from different images, which prevents the class-missing problem and enables the contrastive learning of prototypes, and further improves performance. Extensive experiments show that our method achieves better results than other state-of-the-art methods.
comment: MICCAI 2021
☆ GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models
Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments. Evaluations of popular text-to-image models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will guide improvements in precise color generation. The benchmark will be made public upon acceptance.
☆ Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
☆ EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence
The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at https://zterobot.github.io/EmbodiedBrain.github.io.
☆ From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail
In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation: geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.
☆ From Cheap to Pro: A Learning-based Adaptive Camera Parameter Network for Professional-Style Imaging
Consumer-grade camera systems often struggle to maintain stable image quality under complex illumination conditions such as low light, high dynamic range, and backlighting, as well as spatial color temperature variation. These issues lead to underexposure, color casts, and tonal inconsistency, which degrade the performance of downstream vision tasks. To address this, we propose ACamera-Net, a lightweight and scene-adaptive camera parameter adjustment network that directly predicts optimal exposure and white balance from RAW inputs. The framework consists of two modules: ACamera-Exposure, which estimates ISO to alleviate underexposure and contrast loss, and ACamera-Color, which predicts correlated color temperature and gain factors for improved color consistency. Optimized for real-time inference on edge devices, ACamera-Net can be seamlessly integrated into imaging pipelines. Trained on diverse real-world data with annotated references, the model generalizes well across lighting conditions. Extensive experiments demonstrate that ACamera-Net consistently enhances image quality and stabilizes perception outputs, outperforming conventional auto modes and lightweight baselines without relying on additional image enhancement modules.
comment: 13 pages. Code and project page will be released
☆ Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation
Despite advancements in SLAM technologies, robust operation under challenging conditions such as low-texture, motion-blur, or challenging lighting remains an open challenge. Such conditions are common in applications such as assistive navigation for the visually impaired. These challenges undermine localization accuracy and tracking stability, reducing navigation reliability and safety. To overcome these limitations, we present SELM-SLAM3, a deep learning-enhanced visual SLAM framework that integrates SuperPoint and LightGlue for robust feature extraction and matching. We evaluated our framework using TUM RGB-D, ICL-NUIM, and TartanAir datasets, which feature diverse and challenging scenarios. SELM-SLAM3 outperforms conventional ORB-SLAM3 by an average of 87.84% and exceeds state-of-the-art RGB-D SLAM systems by 36.77%. Our framework demonstrates enhanced performance under challenging conditions, such as low-texture scenes and fast motion, providing a reliable platform for developing navigation aids for the visually impaired.
comment: 8 pages, 7 figures, 4 tables
☆ Blur2seq: Blind Deblurring and Camera Trajectory Estimation from a Single Camera Motion-blurred Image
Motion blur caused by camera shake, particularly under large or rotational movements, remains a major challenge in image restoration. We propose a deep learning framework that jointly estimates the latent sharp image and the underlying camera motion trajectory from a single blurry image. Our method leverages the Projective Motion Blur Model (PMBM), implemented efficiently using a differentiable blur creation module compatible with modern networks. A neural network predicts a full 3D rotation trajectory, which guides a model-based restoration network trained end-to-end. This modular architecture provides interpretability by revealing the camera motion that produced the blur. Moreover, this trajectory enables the reconstruction of the sequence of sharp images that generated the observed blurry image. To further refine results, we optimize the trajectory post-inference via a reblur loss, improving consistency between the blurry input and the restored output. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real datasets, particularly in cases with severe or spatially variant blur, where end-to-end deblurring networks struggle. Code and trained models are available at https://github.com/GuillermoCarbajal/Blur2Seq/
☆ Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis
The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining a more reliable data annotation pipeline, FiFa-Annotator, for forgery explanation. Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task, which generates textual forgery explanations interleaved with segmentation masks of manipulated artifacts. We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis. With multiple auxiliary supervision tasks, FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA performance on existing XDFA datasets. The code and data will be made open-source at https://github.com/lxq1000/Fake-in-Facext.
comment: 25 pages, 9 figures, 17 tables
☆ Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning
Inspired by recent advancements in LLM reasoning, the field of multimodal reasoning has seen remarkable progress, achieving significant performance gains on intricate tasks such as mathematical problem-solving. Despite this progress, current multimodal large reasoning models exhibit two key limitations. They tend to employ computationally expensive reasoning even for simple queries, leading to inefficiency. Furthermore, this focus on specialized reasoning often impairs their broader, more general understanding capabilities. In this paper, we propose Metis-HOME: a Hybrid Optimized Mixture-of-Experts framework designed to address this trade-off. Metis-HOME enables a ''Hybrid Thinking'' paradigm by structuring the original dense model into two distinct expert branches: a thinking branch tailored for complex, multi-step reasoning, and a non-thinking branch optimized for rapid, direct inference on tasks like general VQA and OCR. A lightweight, trainable router dynamically allocates queries to the most suitable expert. We instantiate Metis-HOME by adapting the Qwen2.5-VL-7B into an MoE architecture. Comprehensive evaluations reveal that our approach not only substantially enhances complex reasoning abilities but also improves the model's general capabilities, reversing the degradation trend observed in other reasoning-specialized models. Our work establishes a new paradigm for building powerful and versatile MLLMs, effectively resolving the prevalent reasoning-vs-generalization dilemma.
☆ EchoDistill: Bidirectional Concept Distillation for One-Step Diffusion Personalization
Recent advances in accelerating text-to-image (T2I) diffusion models have enabled the synthesis of high-fidelity images even in a single step. However, personalizing these models to incorporate novel concepts remains a challenge due to the limited capacity of one-step models to capture new concept distributions effectively. We propose a bidirectional concept distillation framework, EchoDistill, to enable one-step diffusion personalization (1-SDP). Our approach involves an end-to-end training process where a multi-step diffusion model (teacher) and a one-step diffusion model (student) are trained simultaneously. The concept is first distilled from the teacher model to the student, and then echoed back from the student to the teacher. During the EchoDistill, we share the text encoder between the two models to ensure consistent semantic understanding. Following this, the student model is optimized with adversarial losses to align with the real image distribution and with alignment losses to maintain consistency with the teacher's output. Furthermore, we introduce the bidirectional echoing refinement strategy, wherein the student model leverages its faster generation capability to feedback to the teacher model. This bidirectional concept distillation mechanism not only enhances the student ability to personalize novel concepts but also improves the generative quality of the teacher model. Our experiments demonstrate that this collaborative framework significantly outperforms existing personalization methods over the 1-SDP setup, establishing a novel paradigm for rapid and effective personalization in T2I diffusion models.
comment: Project page available at https://liulisixin.github.io/EchoDistill-page/
☆ Reliable and Reproducible Demographic Inference for Fairness in Face Analysis
Fairness evaluation in face analysis systems (FAS) typically depends on automatic demographic attribute inference (DAI), which itself relies on predefined demographic segmentation. However, the validity of fairness auditing hinges on the reliability of the DAI process. We begin by providing a theoretical motivation for this dependency, showing that improved DAI reliability leads to less biased and lower-variance estimates of FAS fairness. To address this, we propose a fully reproducible DAI pipeline that replaces conventional end-to-end training with a modular transfer learning approach. Our design integrates pretrained face recognition encoders with non-linear classification heads. We audit this pipeline across three dimensions: accuracy, fairness, and a newly introduced notion of robustness, defined via intra-identity consistency. The proposed robustness metric is applicable to any demographic segmentation scheme. We benchmark the pipeline on gender and ethnicity inference across multiple datasets and training setups. Our results show that the proposed method outperforms strong baselines, particularly on ethnicity, which is the more challenging attribute. To promote transparency and reproducibility, we will publicly release the training dataset metadata, full codebase, pretrained models, and evaluation toolkit. This work contributes a reliable foundation for demographic inference in fairness auditing.
☆ Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.
☆ Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models NeurIPS 2025
Recent years have seen a surge in interest in digital content watermarking techniques, driven by the proliferation of generative models and increased legal pressure. With an ever-growing percentage of AI-generated content available online, watermarking plays an increasingly important role in ensuring content authenticity and attribution at scale. There have been many works assessing the robustness of watermarking to removal attacks, yet, watermark forging, the scenario when a watermark is stolen from genuine content and applied to malicious content, remains underexplored. In this work, we investigate watermark forging in the context of widely used post-hoc image watermarking. Our contributions are as follows. First, we introduce a preference model to assess whether an image is watermarked. The model is trained using a ranking loss on purely procedurally generated images without any need for real watermarks. Second, we demonstrate the model's capability to remove and forge watermarks by optimizing the input image through backpropagation. This technique requires only a single watermarked image and works without knowledge of the watermarking model, making our attack much simpler and more practical than attacks introduced in related work. Third, we evaluate our proposed method on a variety of post-hoc image watermarking models, demonstrating that our approach can effectively forge watermarks, questioning the security of current watermarking approaches. Our code and further resources are publicly available.
comment: NeurIPS 2025
☆ Dynamic Weight Adjustment for Knowledge Distillation: Leveraging Vision Transformer for High-Accuracy Lung Cancer Detection and Real-Time Deployment
This paper presents the FuzzyDistillViT-MobileNet model, a novel approach for lung cancer (LC) classification, leveraging dynamic fuzzy logic-driven knowledge distillation (KD) to address uncertainty and complexity in disease diagnosis. Unlike traditional models that rely on static KD with fixed weights, our method dynamically adjusts the distillation weight using fuzzy logic, enabling the student model to focus on high-confidence regions while reducing attention to ambiguous areas. This dynamic adjustment improves the model ability to handle varying uncertainty levels across different regions of LC images. We employ the Vision Transformer (ViT-B32) as the instructor model, which effectively transfers knowledge to the student model, MobileNet, enhancing the student generalization capabilities. The training process is further optimized using a dynamic wait adjustment mechanism that adapts the training procedure for improved convergence and performance. To enhance image quality, we introduce pixel-level image fusion improvement techniques such as Gamma correction and Histogram Equalization. The processed images (Pix1 and Pix2) are fused using a wavelet-based fusion method to improve image resolution and feature preservation. This fusion method uses the wavedec2 function to standardize images to a 224x224 resolution, decompose them into multi-scale frequency components, and recursively average coefficients at each level for better feature representation. To address computational efficiency, Genetic Algorithm (GA) is used to select the most suitable pre-trained student model from a pool of 12 candidates, balancing model performance with computational cost. The model is evaluated on two datasets, including LC25000 histopathological images (99.16% accuracy) and IQOTH/NCCD CT-scan images (99.54% accuracy), demonstrating robustness across different imaging domains.
☆ Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval
Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.
comment: ACM Multimedia 2025
☆ Positional Encoding Field
Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field-augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.
comment: 8 pages, 9 figures
☆ Synthetic Data for Robust Runway Detection
Deep vision models are now mature enough to be integrated in industrial and possibly critical applications such as autonomous navigation. Yet, data collection and labeling to train such models requires too much efforts and costs for a single company or product. This drawback is more significant in critical applications, where training data must include all possible conditions including rare scenarios. In this perspective, generating synthetic images is an appealing solution, since it allows a cheap yet reliable covering of all the conditions and environments, if the impact of the synthetic-to-real distribution shift is mitigated. In this article, we consider the case of runway detection that is a critical part in autonomous landing systems developed by aircraft manufacturers. We propose an image generation approach based on a commercial flight simulator that complements a few annotated real images. By controlling the image generation and the integration of real and synthetic data, we show that standard object detection models can achieve accurate prediction. We also evaluate their robustness with respect to adverse conditions, in our case nighttime images, that were not represented in the real data, and show the interest of using a customized domain adaptation strategy.
☆ AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models NeurIPS 2025
We present in this paper a novel post-training quantization (PTQ) method, dubbed AccuQuant, for diffusion models. We show analytically and empirically that quantization errors for diffusion models are accumulated over denoising steps in a sampling process. To alleviate the error accumulation problem, AccuQuant minimizes the discrepancies between outputs of a full-precision diffusion model and its quantized version within a couple of denoising steps. That is, it simulates multiple denoising steps of a diffusion sampling process explicitly for quantization, accounting the accumulated errors over multiple denoising steps, which is in contrast to previous approaches to imitating a training process of diffusion models, namely, minimizing the discrepancies independently for each step. We also present an efficient implementation technique for AccuQuant, together with a novel objective, which reduces a memory complexity significantly from $\mathcal{O}(n)$ to $\mathcal{O}(1)$, where $n$ is the number of denoising steps. We demonstrate the efficacy and efficiency of AccuQuant across various tasks and diffusion models on standard benchmarks.
comment: Accepted to NeurIPS 2025
☆ Dino-Diffusion Modular Designs Bridge the Cross-Domain Gap in Autonomous Parking
Parking is a critical pillar of driving safety. While recent end-to-end (E2E) approaches have achieved promising in-domain results, robustness under domain shifts (e.g., weather and lighting changes) remains a key challenge. Rather than relying on additional data, in this paper, we propose Dino-Diffusion Parking (DDP), a domain-agnostic autonomous parking pipeline that integrates visual foundation models with diffusion-based planning to enable generalized perception and robust motion planning under distribution shifts. We train our pipeline in CARLA at regular setting and transfer it to more adversarial settings in a zero-shot fashion. Our model consistently achieves a parking success rate above 90% across all tested out-of-distribution (OOD) scenarios, with ablation studies confirming that both the network architecture and algorithmic design significantly enhance cross-domain performance over existing baselines. Furthermore, testing in a 3D Gaussian splatting (3DGS) environment reconstructed from a real-world parking lot demonstrates promising sim-to-real transfer.
comment: Code is at https://github.com/ChampagneAndfragrance/Dino_Diffusion_Parking_Official
☆ AnyPcc: Compressing Any Point Cloud with a Single Universal Model
Generalization remains a critical challenge for deep learning-based point cloud geometry compression. We argue this stems from two key limitations: the lack of robust context models and the inefficient handling of out-of-distribution (OOD) data. To address both, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages priors from both spatial and channel-wise grouping to capture robust contextual dependencies. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. It fine-tunes a small subset of network weights for each instance and incorporates them into the bitstream, where the marginal bit cost of the weights is dwarfed by the resulting savings in geometry compression. Extensive experiments on a benchmark of 15 diverse datasets confirm that AnyPcc sets a new state-of-the-art in point cloud compression. Our code and datasets will be released to encourage reproducible research.
comment: 11 pages, 5 figures
☆ HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models NeurIPS2025
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters.
comment: Accepted by NeurIPS2025
☆ A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization
We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores at inference. The system tops the official leaderboard, demonstrating robust cross-modal geo-localization under heterogeneous viewpoints.
☆ Breakdance Video classification in the age of Generative AI
Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.
comment: 11 pages
☆ UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.
☆ DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering
Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC$^3$) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51\% and 46.04\% on the \textit{normal} and \textit{indirect} splits of EgoTaskQA, and 13.2\% on QAEGO4D, both reaching the state-of-the-art performance.
☆ Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition
Deep learning models for complex-valued Synthetic Aperture Radar (CV-SAR) image recognition are fundamentally constrained by a representation trilemma under data-limited and domain-shift scenarios: the concurrent, yet conflicting, optimization of generalization, interpretability, and efficiency. Our work is motivated by the premise that the rich electromagnetic scattering features inherent in CV-SAR data hold the key to resolving this trilemma, yet they are insufficiently harnessed by conventional data-driven models. To this end, we introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture. The first stage performs a physics-guided compression, wherein a novel dictionary processor adaptively embeds physical priors, enabling a compact unfolding network to efficiently extract sparse, physically-grounded signatures. A subsequent aggregation module enriches these representations, followed by a final semantic compression stage that utilizes a compact classification head with self-distillation to learn maximally task-relevant and discriminative embeddings. We instantiate KINN in both CNN (0.7M) and Vision Transformer (0.95M) variants. Extensive evaluations on five SAR benchmarks confirm that KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios and tangible interpretability, thereby providing an effective solution to the representation trilemma and offering a new path for trustworthy AI in SAR image analysis.
☆ Causal Debiasing for Visual Commonsense Reasoning
Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.
☆ GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection
Video anomaly detection (VAD) is a challenging task that detects anomalous frames in continuous surveillance videos. Most previous work utilizes the spatio-temporal correlation of visual features to distinguish whether there are abnormalities in video snippets. Recently, some works attempt to introduce multi-modal information, like text feature, to enhance the results of video anomaly detection. However, these works merely incorporate text features into video snippets in a coarse manner, overlooking the significant amount of redundant information that may exist within the video snippets. Therefore, we propose to leverage the diversity among multi-modal information to further refine the extracted features, reducing the redundancy in visual features, and we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD). Specifically, we generate more grained multi-modal feature based on the video snippet, which summarizes the main content, and text features based on the captions of original video will be introduced to further enhance the visual features of highlighted portions. Experiments show that the proposed GMFVAD achieves state-of-the-art performance on four mainly datasets. Ablation experiments also validate that the improvement of GMFVAD is due to the reduction of redundant information.
☆ Real-Time Currency Detection and Voice Feedback for Visually Impaired Individuals
Technologies like smartphones have become an essential in our daily lives. It has made accessible to everyone including visually impaired individuals. With the use of smartphone cameras, image capturing and processing have become more convenient. With the use of smartphones and machine learning, the life of visually impaired can be made a little easier. Daily tasks such as handling money without relying on someone can be troublesome for them. For that purpose this paper presents a real-time currency detection system designed to assist visually impaired individuals. The proposed model is trained on a dataset containing 30 classes of notes and coins, representing 3 types of currency: US dollar (USD), Euro (EUR), and Bangladeshi taka (BDT). Our approach uses a YOLOv8 nano model with a custom detection head featuring deep convolutional layers and Squeeze-and-Excitation blocks to enhance feature extraction and detection accuracy. Our model has achieved a higher accuracy of 97.73%, recall of 95.23%, f1-score of 95.85% and a mean Average Precision at IoU=0.5 (mAP50(B)) of 97.21\%. Using the voice feedback after the detection would help the visually impaired to identify the currency. This paper aims to create a practical and efficient currency detection system to empower visually impaired individuals independent in handling money.
comment: 20 pages, 5 tables, 8 figues
☆ GUSL-Dehaze: A Green U-Shaped Learning Approach to Image Dehazing
Image dehazing is a restoration task that aims to recover a clear image from a single hazy input. Traditional approaches rely on statistical priors and the physics-based atmospheric scattering model to reconstruct the haze-free image. While recent state-of-the-art methods are predominantly based on deep learning architectures, these models often involve high computational costs and large parameter sizes, making them unsuitable for resource-constrained devices. In this work, we propose GUSL-Dehaze, a Green U-Shaped Learning approach to image dehazing. Our method integrates a physics-based model with a green learning (GL) framework, offering a lightweight, transparent alternative to conventional deep learning techniques. Unlike neural network-based solutions, GUSL-Dehaze completely avoids deep learning. Instead, we begin with an initial dehazing step using a modified Dark Channel Prior (DCP), which is followed by a green learning pipeline implemented through a U-shaped architecture. This architecture employs unsupervised representation learning for effective feature extraction, together with feature-engineering techniques such as the Relevant Feature Test (RFT) and the Least-Squares Normal Transform (LNT) to maintain a compact model size. Finally, the dehazed image is obtained via a transparent supervised learning strategy. GUSL-Dehaze significantly reduces parameter count while ensuring mathematical interpretability and achieving performance on par with state-of-the-art deep learning models.
☆ Kinaema: a recurrent sequence model for memory and pose in motion
One key aspect of spatially aware robots is the ability to "find their bearings", ie. to correctly situate themselves in previously seen spaces. In this work, we focus on this particular scenario of continuous robotics operations, where information observed before an actual episode start is exploited to optimize efficiency. We introduce a new model, Kinaema, and agent, capable of integrating a stream of visual observations while moving in a potentially large scene, and upon request, processing a query image and predicting the relative position of the shown space with respect to its current position. Our model does not explicitly store an observation history, therefore does not have hard constraints on context length. It maintains an implicit latent memory, which is updated by a transformer in a recurrent way, compressing the history of sensor readings into a compact representation. We evaluate the impact of this model in a new downstream task we call "Mem-Nav". We show that our large-capacity recurrent model maintains a useful representation of the scene, navigates to goals observed before the actual episode start, and is computationally efficient, in particular compared to classical transformers with attention over an observation history.
comment: 10 pages + references + checklist + appendix, 29 pages total
☆ Calibrating Multimodal Consensus for Emotion Recognition
In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CMC.
☆ Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization
Cross-view object geo-localization enables high-precision object localization through cross-view matching, with critical applications in autonomous driving, urban management, and disaster response. However, existing methods rely on keypoint-based positional encoding, which captures only 2D coordinates while neglecting object shape information, resulting in sensitivity to annotation shifts and limited cross-view matching capability. To address these limitations, we propose a mask-based positional encoding scheme that leverages segmentation masks to capture both spatial coordinates and object silhouettes, thereby upgrading the model from "location-aware" to "object-aware." Furthermore, to tackle the challenge of large-span objects (e.g., elongated buildings) in satellite imagery, we design a context enhancement module. This module employs horizontal and vertical strip convolutional kernels to extract long-range contextual features, enhancing feature discrimination among strip-like objects. Integrating MPE and CEM, we present EDGeo, an end-to-end framework for robust cross-view object geo-localization. Extensive experiments on two public datasets (CVOGL and VIGOR-Building) demonstrate that our method achieves state-of-the-art performance, with a 3.39% improvement in localization accuracy under challenging ground-to-satellite scenarios. This work provides a robust positional encoding paradigm and a contextual modeling framework for advancing cross-view geo-localization research.
☆ Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding NeurIPS 2025
Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) tokenrole- aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances finegrained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.
comment: Comments: 28 pages, including appendix. 5 figures. Full version of the NeurIPS 2025 paper
☆ COS3D: Collaborative Open-Vocabulary 3D Segmentation NeurIPS 2025
Open-vocabulary 3D segmentation is a fundamental yet challenging task, requiring a mutual understanding of both segmentation and language. However, existing Gaussian-splatting-based methods rely either on a single 3D language field, leading to inferior segmentation, or on pre-computed class-agnostic segmentations, suffering from error accumulation. To address these limitations, we present COS3D, a new collaborative prompt-segmentation framework that contributes to effectively integrating complementary language and segmentation cues throughout its entire pipeline. We first introduce the new concept of collaborative field, comprising an instance field and a language field, as the cornerstone for collaboration. During training, to effectively construct the collaborative field, our key idea is to capture the intrinsic relationship between the instance field and language field, through a novel instance-to-language feature mapping and designing an efficient two-stage training strategy. During inference, to bridge distinct characteristics of the two fields, we further design an adaptive language-to-instance prompt refinement, promoting high-quality prompt-segmentation inference. Extensive experiments not only demonstrate COS3D's leading performance over existing methods on two widely-used benchmarks but also show its high potential to various applications,~\ie, novel image-based 3D segmentation, hierarchical segmentation, and robotics. The code is publicly available at \href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}.
comment: NeurIPS 2025. The code is publicly available at \href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}
☆ Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel "induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses.
☆ EditInfinity: Image Editing with Binary-Quantized Generative Models NeurIPS 2025
Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose \emph{EditInfinity}, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our \emph{EditInfinity} to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across "add", "change", and "delete" editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.
comment: 28 pages, 13 figures, accepted by The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection
Accurate fetal movement (FM) detection is essential for assessing prenatal health, as abnormal movement patterns can indicate underlying complications such as placental dysfunction or fetal distress. Traditional methods, including maternal perception and cardiotocography (CTG), suffer from subjectivity and limited accuracy. To address these challenges, we propose Contrastive Ultrasound Video Representation Learning (CURL), a novel self-supervised learning framework for FM detection from extended fetal ultrasound video recordings. Our approach leverages a dual-contrastive loss, incorporating both spatial and temporal contrastive learning, to learn robust motion representations. Additionally, we introduce a task-specific sampling strategy, ensuring the effective separation of movement and non-movement segments during self-supervised training, while enabling flexible inference on arbitrarily long ultrasound recordings through a probabilistic fine-tuning approach. Evaluated on an in-house dataset of 92 subjects, each with 30-minute ultrasound sessions, CURL achieves a sensitivity of 78.01% and an AUROC of 81.60%, demonstrating its potential for reliable and objective FM analysis. These results highlight the potential of self-supervised contrastive learning for fetal movement analysis, paving the way for improved prenatal monitoring and clinical decision-making.
comment: This is the preprint version of the manuscript submitted to IEEE Journal of Biomedical and Health Informatics (JBHI) for review
☆ FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing
Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state'' and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.
☆ RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.
☆ A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development
The development of foundation models for brain MRI depends critically on the scale, diversity, and consistency of available data, yet systematic assessments of these factors remain scarce. In this study, we analyze 54 publicly accessible brain MRI datasets encompassing over 538,031 to provide a structured, multi-level overview tailored to foundation model development. At the dataset level, we characterize modality composition, disease coverage, and dataset scale, revealing strong imbalances between large healthy cohorts and smaller clinical populations. At the image level, we quantify voxel spacing, orientation, and intensity distributions across 15 representative datasets, demonstrating substantial heterogeneity that can influence representation learning. We then perform a quantitative evaluation of preprocessing variability, examining how intensity normalization, bias field correction, skull stripping, spatial registration, and interpolation alter voxel statistics and geometry. While these steps improve within-dataset consistency, residual differences persist between datasets. Finally, feature-space case study using a 3D DenseNet121 shows measurable residual covariate shift after standardized preprocessing, confirming that harmonization alone cannot eliminate inter-dataset bias. Together, these analyses provide a unified characterization of variability in public brain MRI resources and emphasize the need for preprocessing-aware and domain-adaptive strategies in the design of generalizable brain MRI foundation models.
☆ Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.
comment: In Proceedings of the 2nd ACM Workshop in AI-powered Question and Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3746274.3760393
☆ SPAN: Continuous Modeling of Suspicion Progression for Temporal Intention Localization
Temporal Intention Localization (TIL) is crucial for video surveillance, focusing on identifying varying levels of suspicious intentions to improve security monitoring. However, existing discrete classification methods fail to capture the continuous nature of suspicious intentions, limiting early intervention and explainability. In this paper, we propose the Suspicion Progression Analysis Network (SPAN), which shifts from discrete classification to continuous regression, enabling the capture of fluctuating and evolving suspicious intentions. We reveal that suspicion exhibits long-term dependencies and cumulative effects, similar to Temporal Point Process (TPP) theory. Based on these insights, we define a suspicion score formula that models continuous changes while accounting for temporal characteristics. We also introduce Suspicion Coefficient Modulation, which adjusts suspicion coefficients using multimodal information to reflect the varying impacts of suspicious actions. Additionally, the Concept-Anchored Mapping method is proposed to link suspicious actions to predefined intention concepts, offering insights into both the actions and their potential underlying intentions. Extensive experiments on the HAI dataset show that SPAN significantly outperforms existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%. Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating its superior ability to capture subtle behavioral changes. Compared to discrete classification systems, our continuous suspicion modeling approach enables earlier detection and proactive intervention, greatly enhancing system explainability and practical utility in security applications.
☆ Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories
Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.
comment: Preprint, under review
☆ PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching
Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a \textbf{P}ick-and-\textbf{P}lay \textbf{M}emory (PPM) construction module for dynamic \textbf{Stereo} matching, dubbed as \textbf{PPMStereo}. PPM consists of a `pick' process that identifies the most relevant frames and a `play' process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3\% \& 9.02\% improvements over BiDAStereo) with fewer computational costs. Codes are available at \textcolor{blue}{https://github.com/cocowy1/PPMStereo}.
☆ IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks AAAI
We propose a new GAN-based unsupervised model for disentangled representation learning. The new model is discovered in an attempt to utilize the Information Bottleneck (IB) framework to the optimization of GAN, thereby named IB-GAN. The architecture of IB-GAN is partially similar to that of InfoGAN but has a critical difference; an intermediate layer of the generator is leveraged to constrain the mutual information between the input and the generated output. The intermediate stochastic layer can serve as a learnable latent distribution that is trained with the generator jointly in an end-to-end fashion. As a result, the generator of IB-GAN can harness the latent space in a disentangled and interpretable manner. With the experiments on dSprites and Color-dSprites dataset, we demonstrate that IB-GAN achieves competitive disentanglement scores to those of state-of-the-art \b{eta}-VAEs and outperforms InfoGAN. Moreover, the visual quality and the diversity of samples generated by IB-GAN are often better than those by \b{eta}-VAEs and Info-GAN in terms of FID score on CelebA and 3D Chairs dataset.
comment: Published in the Proceedings of the Thirty Fifth AAAI Conference on Artificial Intelligence (AAAI 2021), paper number 7926
☆ TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning NeurIPS 2025
Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT .
comment: Accepted to NeurIPS 2025
☆ Monocular Visual 8D Pose Estimation for Articulated Bicycles and Cyclists
In Autonomous Driving, cyclists belong to the safety-critical class of Vulnerable Road Users (VRU), and accurate estimation of their pose is critical for cyclist crossing intention classification, behavior prediction, and collision avoidance. Unlike rigid objects, articulated bicycles are composed of movable rigid parts linked by joints and constrained by a kinematic structure. 6D pose methods can estimate the 3D rotation and translation of rigid bicycles, but 6D becomes insufficient when the steering/pedals angles of the bicycle vary. That is because: 1) varying the articulated pose of the bicycle causes its 3D bounding box to vary as well, and 2) the 3D box orientation is not necessarily aligned to the orientation of the steering which determines the actual intended travel direction. In this work, we introduce a method for category-level 8D pose estimation for articulated bicycles and cyclists from a single RGB image. Besides being able to estimate the 3D translation and rotation of a bicycle from a single image, our method also estimates the rotations of its steering handles and pedals with respect to the bicycle body frame. These two new parameters enable the estimation of a more fine-grained bicycle pose state and travel direction. Our proposed model jointly estimates the 8D pose and the 3D Keypoints of articulated bicycles, and trains with a mix of synthetic and real image data to generalize on real images. We include an evaluation section where we evaluate the accuracy of our estimated 8D pose parameters, and our method shows promising results by achieving competitive scores when compared against state-of-the-art category-level 6D pose estimators that use rigid canonical object templates for matching.
☆ PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding NeurIPS 2025
Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23,000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset's superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.
comment: NeurIPS 2025 DB Track. Project page: https://authoritywang.github.io/partnext
♻ ☆ DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.
comment: Preprint
♻ ☆ Watermarking Autoregressive Image Generation NeurIPS 2025
Watermarking the outputs of generative models has emerged as a promising approach for tracking their provenance. Despite significant interest in autoregressive image generation models and their potential for misuse, no prior work has attempted to watermark their outputs at the token level. In this work, we present the first such approach by adapting language model watermarking techniques to this setting. We identify a key challenge: the lack of reverse cycle-consistency (RCC), wherein re-tokenizing generated image tokens significantly alters the token sequence, effectively erasing the watermark. To address this and to make our method robust to common image transformations, neural compression, and removal attacks, we introduce (i) a custom tokenizer-detokenizer finetuning procedure that improves RCC, and (ii) a complementary watermark synchronization layer. As our experiments demonstrate, our approach enables reliable and robust watermark detection with theoretically grounded p-values. Code and models are available at https://github.com/facebookresearch/wmar.
comment: NeurIPS 2025
♻ ☆ Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector
Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FF++, such as DF, f2f, FS, and NT, together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many post-processing procedures.
♻ ☆ GenLit: Reformulating Single-Image Relighting as Video Generation
Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible -- one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing. . Project page: https://genlit.is.tue.mpg.de/.
♻ ☆ mmWalk: Towards Multi-modal Multi-view Walking Assistance NeurIPS 2025
Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.
comment: Accepted by NeurIPS 2025 Datasets and Benchmarks Track. Data and Code: https://github.com/KediYing/mmWalk
♻ ☆ Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning
When applying reinforcement learning--typically through GRPO--to large vision-language model reasoning struggles to effectively scale reasoning length or generates verbose outputs across all tasks with only marginal gains in accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. Inspired by these observations, we introduce two complementary metrics to estimate the difficulty of the questions, guiding the model to determine when fast or slow thinking is more appropriate. Next, we incorporate adaptive length-based rewards and difficulty-aware KL divergence into the GRPO algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10\% relative improvement compared to the base model, while reducing token usage by 32.7-67.3\% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.
♻ ☆ Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans
With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work of academic research, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.
comment: 24 pages, 15 figures
♻ ☆ FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation
Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.
comment: Code: https://github.com/Nihukat/FreeGraftor
♻ ☆ Uncovering Anomalous Events for Marine Environmental Monitoring via Visual Anomaly Detection
Underwater video monitoring is a promising strategy for assessing marine biodiversity, but the vast volume of uneventful footage makes manual inspection highly impractical. In this work, we explore the use of visual anomaly detection (VAD) based on deep neural networks to automatically identify interesting or anomalous events. We introduce AURA, the first multi-annotator benchmark dataset for underwater VAD, and evaluate four VAD models across two marine scenes. We demonstrate the importance of robust frame selection strategies to extract meaningful video segments. Our comparison against multiple annotators reveals that VAD performance of current models varies dramatically and is highly sensitive to both the amount of training data and the variability in visual content that defines "normal" scenes. Our results highlight the value of soft and consensus labels and offer a practical approach for supporting scientific exploration and scalable biodiversity monitoring.
♻ ☆ X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation
Large Language Models (LLMs) have been shown to enhance the effectiveness of enriching item descriptions, thereby improving the accuracy of recommendation systems. However, most existing approaches either rely on text-only prompting or employ basic multimodal strategies that do not fully exploit the complementary information available from both textual and visual modalities. This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, designed to address these limitations by prompting Multimodal Large Language Models (MLLMs) to explicitly identify and reconcile supportive and conflicting information between text and images. By capturing nuanced insights from both modalities, this approach generates more comprehensive and contextually rich item representations. Extensive experiments conducted on two widely used benchmarks demonstrate that our method outperforms existing prompting baselines in downstream recommendation accuracy. Furthermore, we identify a U-shaped relationship between text-image dissimilarity and recommendation performance, suggesting the benefit of applying multimodal prompting selectively. To support efficient real-time inference, we also introduce X-Reflect-keyword, a lightweight variant that summarizes image content using keywords and replaces the base model with a smaller backbone, achieving nearly 50% reduction in input length while maintaining competitive performance. This work underscores the importance of integrating multimodal information and presents an effective solution for improving item understanding in multimodal recommendation systems.
♻ ☆ CALM-PDE: Continuous and Adaptive Convolutions for Latent Space Modeling of Time-dependent PDEs NeurIPS
Solving time-dependent Partial Differential Equations (PDEs) using a densely discretized spatial domain is a fundamental problem in various scientific and engineering disciplines, including modeling climate phenomena and fluid dynamics. However, performing these computations directly in the physical space often incurs significant computational costs. To address this issue, several neural surrogate models have been developed that operate in a compressed latent space to solve the PDE. While these approaches reduce computational complexity, they often use Transformer-based attention mechanisms to handle irregularly sampled domains, resulting in increased memory consumption. In contrast, convolutional neural networks allow memory-efficient encoding and decoding but are limited to regular discretizations. Motivated by these considerations, we propose CALM-PDE, a model class that efficiently solves arbitrarily discretized PDEs in a compressed latent space. We introduce a novel continuous convolution-based encoder-decoder architecture that uses an epsilon-neighborhood-constrained kernel and learns to apply the convolution operator to adaptive and optimized query points. We demonstrate the effectiveness of CALM-PDE on a diverse set of PDEs with both regularly and irregularly sampled spatial domains. CALM-PDE is competitive with or outperforms existing baseline methods while offering significant improvements in memory and inference time efficiency compared to Transformer-based methods.
comment: Accepted for publication at the 39th Conference on Neural Information Processing Systems (NeurIPS) 2025, San Diego, California, USA
♻ ☆ REOBench: Benchmarking Robustness of Earth Observation Foundation Models
Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models. Code and data are publicly available at https://github.com/lx709/REOBench.
comment: Accepted to NeruIPS 2025 D&B Track
♻ ☆ BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning NeurIPS 2025
Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.
comment: NeurIPS 2025 Spotlight; Project page: https://imageomics.github.io/bioclip-2/
♻ ☆ Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.
comment: 30 pages, 27 figures
♻ ☆ Residual Kolmogorov-Arnold Network for Enhanced Deep Learning
Despite their immense success, deep convolutional neural networks (CNNs) can be difficult to optimize and costly to train due to hundreds of layers within the network depth. Conventional convolutional operations are fundamentally limited by their linear nature along with fixed activations, where many layers are needed to learn meaningful patterns in data. Because of the sheer size of these networks, this approach is simply computationally inefficient, and poses overfitting or gradient explosion risks, especially in small datasets. As a result, we introduce a "plug-in" module, called Residual Kolmogorov-Arnold Network (RKAN). Our module is highly compact, so it can be easily added into any stage (level) of traditional deep networks, where it learns to integrate supportive polynomial feature transformations to existing convolutional frameworks. RKAN offers consistent improvements over baseline models in different vision tasks and widely tested benchmarks, accomplishing cutting-edge performance on them.
comment: Code is available at https://github.com/withray/residualKAN.git
♻ ☆ A novel attention mechanism for noise-adaptive and robust segmentation of microtubules in microscopy images
Segmenting cytoskeletal filaments in microscopy images is essential for understanding their cellular roles but remains challenging, especially in dense, complex networks and under noisy or low-contrast image conditions. While deep learning has advanced image segmentation, performance often degrades in these adverse scenarios. Additional challenges include the difficulty of obtaining accurate annotations and managing severe class imbalance. We proposed a novel noise-adaptive attention mechanism, extending the Squeeze-and-Excitation (SE) module, to dynamically adjust to varying noise levels. This Adaptive SE (ASE) mechanism is integrated into a U-Net decoder, with residual encoder blocks, forming a lightweight yet powerful model: ASE_Res_U-Net. We also developed a synthetic-dataset strategy and employed tailored loss functions and evaluation metrics to mitigate class imbalance and ensure fair assessment. ASE_Res_U-Net effectively segmented microtubules in both synthetic and real noisy images, outperforming its ablated variants and state-of-the-art curvilinear-structure segmentation methods. It achieved this while using fewer parameters, making it suitable for resource-constrained environments. Importantly, ASE_Res_U-Net generalised well to other curvilinear structures (blood vessels and nerves) under diverse imaging conditions. Availability and implementation: Original microtubule datasets (synthetic and real noisy images) are available on Zenodo (DOIs: 10.5281/zenodo.14696279 and 10.5281/zenodo.15852660). ASE_Res_UNet model will be shared upon publication.
♻ ☆ Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models
Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic, \textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.
comment: Project Page: https://shinmohuang.github.io/spatialdise_page/
♻ ☆ BevSplat: Resolving Height Ambiguity via Feature-Based Gaussian Primitives for Weakly-Supervised Cross-View Localization
This paper addresses the problem of weakly supervised cross-view localization, where the goal is to estimate the pose of a ground camera relative to a satellite image with noisy ground truth annotations. A common approach to bridge the cross-view domain gap for pose estimation is Bird's-Eye View (BEV) synthesis. However, existing methods struggle with height ambiguity due to the lack of depth information in ground images and satellite height maps. Previous solutions either assume a flat ground plane or rely on complex models, such as cross-view transformers. We propose BevSplat, a novel method that resolves height ambiguity by using feature-based Gaussian primitives. Each pixel in the ground image is represented by a 3D Gaussian with semantic and spatial features, which are synthesized into a BEV feature map for relative pose estimation. Additionally, to address challenges with panoramic query images, we introduce an icosphere-based supervision strategy for the Gaussian primitives. We validate our method on the widely used KITTI and VIGOR datasets, which include both pinhole and panoramic query images. Experimental results show that BevSplat significantly improves localization accuracy over prior approaches.
♻ ☆ PolyPose: Deformable 2D/3D Registration via Polyrigid Transformations NeurIPS 2025
Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise-rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient's preoperative volume to as few as two X-rays, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail. Additional visualizations, tutorials, and code are available at https://polypose.csail.mit.edu.
comment: NeurIPS 2025. Code available at https://github.com/eigenvivek/polypose
♻ ☆ Frequency-Dynamic Attention Modulation for Dense Prediction ICCV 2025
Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at https://github.com/Linwei-Chen/FDAM.
comment: Accepted by ICCV 2025
♻ ☆ A primal-dual algorithm for image reconstruction with input-convex neural network regularizers
We address the optimization problem in a data-driven variational reconstruction framework, where the regularizer is parameterized by an input-convex neural network (ICNN). While gradient-based methods are commonly used to solve such problems, they struggle to effectively handle non-smooth problems which often leads to slow convergence. Moreover, the nested structure of the neural network complicates the application of standard non-smooth optimization techniques, such as proximal algorithms. To overcome these challenges, we reformulate the problem and eliminate the network's nested structure. By relating this reformulation to epigraphical projections of the activation functions, we transform the problem into a convex optimization problem that can be efficiently solved using a primal-dual algorithm. We also prove that this reformulation is equivalent to the original variational problem. Through experiments on several imaging tasks, we show that the proposed approach not only outperforms subgradient methods and even accelerated methods in the smooth setting, but also facilitates the training of the regularizer itself.
♻ ☆ MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.
comment: Data available at https://huggingface.co/datasets/FBK-MT/MCIF | Evaluation and baselines available at https://github.com/hlt-mt/mcif
♻ ☆ Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants NeurIPS 2025
Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench includes a development set and a test set, each with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. We also explore which abilities of MLLMs need to be supplemented by specialist models. The dataset and evaluation code have been made publicly available at https://face-human-bench.github.io.
comment: 50 pages, 14 figures, 42 tables. NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
♻ ☆ Learning Dense Hand Contact Estimation from Imbalanced Data NeurIPS 2025
Hands are essential to human interaction, and exploring contact between hands and the world can promote comprehensive understanding of their function. Recently, there have been growing number of hand interaction datasets that cover interaction with object, other hand, scene, and body. Despite the significance of the task and increasing high-quality data, how to effectively learn dense hand contact estimation remains largely underexplored. There are two major challenges for learning dense hand contact estimation. First, there exists class imbalance issue from hand contact datasets where majority of regions are not in contact. Second, hand contact datasets contain spatial imbalance issue with most of hand contact exhibited in finger tips, resulting in challenges for generalization towards contacts in other hand regions. To tackle these issues, we present a framework that learns dense HAnd COntact estimation (HACO) from imbalanced data. To resolve the class imbalance issue, we introduce balanced contact sampling, which builds and samples from multiple sampling groups that fairly represent diverse contact statistics for both contact and non-contact vertices. Moreover, to address the spatial imbalance issue, we propose vertex-level class-balanced (VCB) loss, which incorporates spatially varying contact distribution by separately reweighting loss contribution of each vertex based on its contact frequency across dataset. As a result, we effectively learn to predict dense hand contact estimation with large-scale hand contact data without suffering from class and spatial imbalance issue. The codes are available at https://github.com/dqj5182/HACO_RELEASE.
comment: Accepted at NeurIPS 2025. Project page: http://haco-release.github.io
♻ ☆ HumanCM: One Step Human Motion Prediction
We present HumanCM, a one-step human motion prediction framework built upon consistency models. Instead of relying on multi-step denoising as in diffusion-based methods, HumanCM performs efficient single-step generation by learning a self-consistent mapping between noisy and clean motion states. The framework adopts a Transformer-based spatiotemporal architecture with temporal embeddings to model long-range dependencies and preserve motion coherence. Experiments on Human3.6M and HumanEva-I demonstrate that HumanCM achieves comparable or superior accuracy to state-of-the-art diffusion models while reducing inference steps by up to two orders of magnitude.
comment: 6 pages, 3 figures, 2 tables
♻ ☆ Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models' performance on average. Our code is available at https://github.com/EmbodiedCity/NeurIPS2025-Balanced-Token-Pruning.
comment: Accepted by Neurips 2025
♻ ☆ Frequency Cam: Imaging Periodic Signals in Real-Time
Due to their high temporal resolution and large dynamic range, event cameras are uniquely suited for the analysis of time-periodic signals in an image. In this work we present an efficient and fully asynchronous event camera algorithm for detecting the fundamental frequency at which image pixels flicker. The algorithm employs a second-order digital infinite impulse response (IIR) filter to perform an approximate per-pixel brightness reconstruction and is more robust to high-frequency noise than the baseline method we compare to. We further demonstrate that using the falling edge of the signal leads to more accurate period estimates than the rising edge, and that for certain signals interpolating the zero-level crossings can further increase accuracy. Our experiments find that the outstanding capabilities of the camera in detecting frequencies up to 64kHz for a single pixel do not carry over to full sensor imaging as readout bandwidth limitations become a serious obstacle. This suggests that a hardware implementation closer to the sensor will allow for greatly improved frequency imaging. We discuss the important design parameters for fullsensor frequency imaging and present Frequency Cam, an open-source implementation as a ROS node that can run on a single core of a laptop CPU at more than 50 million events per second. It produces results that are qualitatively very similar to those obtained from the closed source vibration analysis module in Prophesee's Metavision Toolkit. The code for Frequency Cam and a demonstration video can be found at https://github.com/ros-event-camera/frequency_cam
comment: 13 pages, 16 figures, one table
♻ ☆ Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning NeurIPS 2025
Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present Mesh-RFT, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6% and improves Topology Score (TS) by 3.8% over pre-trained models, while outperforming global DPO methods with a 17.4% HD reduction and 4.9% TS gain. These results demonstrate Mesh-RFT's ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation. Project Page: https://hitcslj.github.io/mesh-rft/.
comment: NeurIPS 2025, Spotlight
♻ ☆ Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving
Robust perception in automated driving requires reliable performance under adverse conditions, where sensors may be affected by partial failures or environmental occlusions. Although existing autonomous driving datasets inherently contain sensor noise and environmental variability, very few enable controlled, parameterised, and reproducible degradations across multiple sensing modalities. This gap limits the ability to systematically evaluate how perception and fusion architectures perform under well-defined adverse conditions. To address this limitation, we introduce the Occluded nuScenes Dataset, a novel extension of the widely used nuScenes benchmark. For the camera modality, we release both the full and mini versions with four types of occlusions, two adapted from public implementations and two newly designed. For radar and LiDAR, we provide parameterised occlusion scripts that implement three types of degradations each, enabling flexible and repeatable generation of occluded data. This resource supports consistent, reproducible evaluation of perception models under partial sensor failures and environmental interference. By releasing the first multi-sensor occlusion dataset with controlled and reproducible degradations, we aim to advance research on robust sensor fusion, resilience analysis, and safety-critical perception in automated driving.
♻ ☆ Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at https://ipro-alimama.github.io/.
♻ ☆ OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection
The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.
comment: Updated results for NNGuide and ViM
♻ ☆ ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding NeurIPS 2025
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
comment: NeurIPS 2025
♻ ☆ MODEM: A Morton-Order Degradation Estimation Mechanism for Adverse Weather Image Recovery NeurIPS 2025
Restoring images degraded by adverse weather remains a significant challenge due to the highly non-uniform and spatially heterogeneous nature of weather-induced artifacts, e.g., fine-grained rain streaks versus widespread haze. Accurately estimating the underlying degradation can intuitively provide restoration models with more targeted and effective guidance, enabling adaptive processing strategies. To this end, we propose a Morton-Order Degradation Estimation Mechanism (MODEM) for adverse weather image restoration. Central to MODEM is the Morton-Order 2D-Selective-Scan Module (MOS2D), which integrates Morton-coded spatial ordering with selective state-space models to capture long-range dependencies while preserving local structural coherence. Complementing MOS2D, we introduce a Dual Degradation Estimation Module (DDEM) that disentangles and estimates both global and local degradation priors. These priors dynamically condition the MOS2D modules, facilitating adaptive and context-aware restoration. Extensive experiments and ablation studies demonstrate that MODEM achieves state-of-the-art results across multiple benchmarks and weather types, highlighting its effectiveness in modeling complex degradation dynamics. Our code will be released at https://github.com/hainuo-wang/MODEM.git.
comment: Accepted by NeurIPS 2025
♻ ☆ FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions
While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set. As DMs are now widely used in real-world applications, these biases could perpetuate a distorted worldview and hinder opportunities for minority groups. Existing methods on debiasing DMs usually requires model retraining with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose FairGen, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. Specifically, FairGen consists of two parts: a set of attribute adapters and a distribution indicator. Each adapter in the set aims to learn an attribute latent direction, and is optimized via noise composition through a self-discovering process. Then, the distribution indicator is multiplied by the set of adapters to guide the generation process towards the prescribed distribution. Our method enables debiasing multiple attributes in DMs simultaneously, while remaining lightweight and easily integrable with other DMs, eliminating the need for retraining. Extensive experiments on debiasing gender, racial, and their intersectional biases show that our method outperforms previous SOTA by a large margin.
♻ ☆ Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval
Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment $\Delta_{ij}$ between text $t_i$ and video $v_j$, redistributing gradients to relieve optimization tension and absorb noise. We derive $\Delta_{ij}$ via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize $\Delta$ through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation. Code is available at https://github.com/musicman217/GARE-text-video-retrieval.
♻ ☆ Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation NeurIPS 2025
Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence (AI) by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body PET/CT volumes from independent patients and their corresponding full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, especially for low-resource languages and clinical use in Vietnamese healthcare. The source code is available at https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts NeurIPS 2025
Current image fusion methods struggle to address the composite degradations encountered in real-world imaging scenarios and lack the flexibility to accommodate user-specific requirements. In response to these challenges, we propose a controllable image fusion framework with language-vision prompts, termed ControlFusion, which adaptively neutralizes composite degradations. On the one hand, we develop a degraded imaging model that integrates physical imaging mechanisms, including the Retinex theory and atmospheric scattering principle, to simulate composite degradations, thereby providing potential for addressing real-world complex degradations from the data level. On the other hand, we devise a prompt-modulated restoration and fusion network that dynamically enhances features with degradation prompts, enabling our method to accommodate composite degradation of varying levels. Specifically, considering individual variations in quality perception of users, we incorporate a text encoder to embed user-specified degradation types and severity levels as degradation prompts. We also design a spatial-frequency collaborative visual adapter that autonomously perceives degradations in source images, thus eliminating the complete dependence on user instructions. Extensive experiments demonstrate that ControlFusion outperforms SOTA fusion methods in fusion quality and degradation handling, particularly in countering real-world and compound degradations with various levels. The source code is publicly available at https://github.com/Linfeng-Tang/ControlFusion.
comment: Accepted to NeurIPS 2025. The code are available at https://github.com/Linfeng-Tang/ControlFusion
♻ ☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
♻ ☆ SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.
comment: Accepted to AICCC 2025
♻ ☆ Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection
A recent class of hyperspectral anomaly detection methods that can be trained once on background datasets and then universally deployed -- without per-scene retraining or parameter tuning -- has demonstrated remarkable efficiency and robustness. Building upon this paradigm, we focus on the integration of spectral and spatial cues and introduce a novel "Rebellious Student" framework for complementary feature learning. Unlike conventional teacher-student paradigms driven by imitation, our method intentionally trains the spatial branch to diverge from the spectral teacher, thereby learning complementary spatial patterns that the teacher fails to capture. A two-stage learning strategy is adopted: (1) a spectral enhancement network is first trained via reverse distillation to obtain robust background spectral representations; and (2) a spatial network -- the rebellious student -- is subsequently optimized using decorrelation losses that enforce feature orthogonality while maintaining reconstruction fidelity to avoid irrelevant noise. Once trained, the framework enhances both spectral and spatial background features, enabling parameter-free and training-free anomaly detection when paired with conventional detectors. Experiments on the HAD100 benchmark show substantial improvements over several established baselines with modest computational overhead, confirming the effectiveness of the proposed complementary learning paradigm. Our code is publicly available at https://github.com/xjpp2016/FERS.
♻ ☆ SnapMoGen: Human Motion Generation from Expressive Texts
Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs. 12 words of HumanML3D). Importantly, these motion clips preserve original temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen. Project webpage: https://snap-research.github.io/SnapMoGen/
comment: Project Webpage: https://snap-research.github.io/SnapMoGen/
♻ ☆ The Faiss library
Vector databases typically manage large collections of embedding vectors. Currently, AI applications are growing rapidly, and so is the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This paper describes the trade-off space of vector search and the design principles of Faiss in terms of structure, approach to optimization and interfacing. We benchmark key features of the library and discuss a few selected applications to highlight its broad applicability.
♻ ☆ Sign-In to the Lottery: Reparameterizing Sparse Training From Scratch NeurIPS 2025
The performance gap between training sparse neural networks from scratch (PaI) and dense-to-sparse training presents a major roadblock for efficient deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on finding a problem specific parameter initialization. As we show, to this end, determining correct parameter signs is sufficient. Yet, they remain elusive to PaI. To address this issue, we propose Sign-In, which employs a dynamic reparameterization that provably induces sign flips. Such sign flips are complementary to the ones that dense-to-sparse training can accomplish, rendering Sign-In as an orthogonal method. While our experiments and theory suggest performance improvements of PaI, they also carve out the main open challenge to close the gap between PaI and dense-to-sparse training.
comment: Accepted at NeurIPS 2025
♻ ☆ A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
comment: 22 pages,2 figures
♻ ☆ Quantization-Aware Neuromorphic Architecture for Efficient Skin Disease Classification on Resource-Constrained Devices
Accurate and efficient skin lesion classification on edge devices is critical for accessible dermatological care but remains challenging due to computational, energy, and privacy constraints. We introduce QANA, a novel quantization-aware neuromorphic architecture for incremental skin lesion classification on resource-limited hardware. QANA effectively integrates ghost modules, efficient channel attention, and squeeze-and-excitation blocks for robust feature representation with low-latency and energy-efficient inference. Its quantization-aware head and spike-compatible transformations enable seamless conversion to spiking neural networks (SNNs) and deployment on neuromorphic platforms. Evaluation on the large-scale HAM10000 benchmark and a real-world clinical dataset shows that QANA achieves 91.6% Top-1 accuracy and 82.4% macro F1 on HAM10000, and 90.8%/81.7% on the clinical dataset, significantly outperforming state-of-the-art CNN-to-SNN models under fair comparison. Deployed on BrainChip Akida hardware, QANA achieves 1.5 ms inference latency and 1.7,mJ energy per image, reducing inference latency and energy use by over 94.6%/98.6% compared to GPU-based CNNs surpassing state-of-the-art CNN-to-SNN conversion baselines. These results demonstrate the effectiveness of QANA for accurate, real-time, and privacy-sensitive medical analysis in edge environments.
♻ ☆ A Style-Based Profiling Framework for Quantifying the Synthetic-to-Real Gap in Autonomous Driving Datasets
Ensuring the reliability of autonomous driving perception systems requires extensive environment-based testing, yet real-world execution is often impractical. Synthetic datasets have therefore emerged as a promising alternative, offering advantages such as cost-effectiveness, bias free labeling, and controllable scenarios. However, the domain gap between synthetic and real-world datasets remains a major obstacle to model generalization. To address this challenge from a data-centric perspective, this paper introduces a profile extraction and discovery framework for characterizing the style profiles underlying both synthetic and real image datasets. We propose Style Embedding Distribution Discrepancy (SEDD) as a novel evaluation metric. Our framework combines Gram matrix-based style extraction with metric learning optimized for intra-class compactness and inter-class separation to extract style embeddings. Furthermore, we establish a benchmark using publicly available datasets. Experiments are conducted on a variety of datasets and sim-to-real methods, and the results show that our method is capable of quantifying the synthetic-to-real gap. This work provides a standardized profiling-based quality control paradigm that enables systematic diagnosis and targeted enhancement of synthetic datasets, advancing future development of data-driven autonomous driving systems.
comment: 7 pages, 4 figures
♻ ☆ Learning Contrastive Feature Representations for Facial Action Unit Detection
For the Facial Action Unit (AU) detection task, accurately capturing the subtle facial differences between distinct AUs is essential for reliable detection. Additionally, AU detection faces challenges from class imbalance and the presence of noisy or false labels, which undermine detection accuracy. In this paper, we introduce a novel contrastive learning framework aimed for AU detection that incorporates both self-supervised and supervised signals, thereby enhancing the learning of discriminative features for accurate AU detection. To tackle the class imbalance issue, we employ a negative sample re-weighting strategy that adjusts the step size of updating parameters for minority and majority class samples. Moreover, to address the challenges posed by noisy and false AU labels, we employ a sampling technique that encompasses three distinct types of positive sample pairs. This enables us to inject self-supervised signals into the supervised signal, effectively mitigating the adverse effects of noisy labels. Our experimental assessments, conducted on five widely-utilized benchmark datasets (BP4D, DISFA, BP4D+, GFT and Aff-Wild2), underscore the superior performance of our approach compared to state-of-the-art methods of AU detection. Our code is available at https://github.com/Ziqiao-Shang/AUNCE.
♻ ☆ EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models
Self-supervised models have recently achieved notable advancements, particularly in the domain of semantic occupancy prediction. These models utilize sophisticated loss computation strategies to compensate for the absence of ground-truth labels. For instance, techniques such as novel view synthesis, cross-view rendering, and depth estimation have been explored to address the issue of semantic and depth ambiguity. However, such techniques typically incur high computational costs and memory usage during the training stage, especially in the case of novel view synthesis. To mitigate these issues, we propose 3D pseudo-ground-truth labels generated by the foundation models Grounded-SAM and Metric3Dv2, and harness temporal information for label densification. Our 3D pseudo-labels can be easily integrated into existing models, which yields substantial performance improvements, with mIoU increasing by 45\%, from 9.73 to 14.09, when implemented into the OccNeRF model. This stands in contrast to earlier advancements in the field, which are often not readily transferable to other architectures. Additionally, we propose a streamlined model, EasyOcc, achieving 13.86 mIoU. This model conducts learning solely from our labels, avoiding complex rendering strategies mentioned previously. Furthermore, our method enables models to attain state-of-the-art performance when evaluated on the full scene without applying the camera mask, with EasyOcc achieving 7.71 mIoU, outperforming the previous best model by 31\%. These findings highlight the critical importance of foundation models, temporal context, and the choice of loss computation space in self-supervised learning for comprehensive scene understanding.
♻ ☆ PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling NeurIPS 2025
Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.
comment: This paper is accepted by 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions NeurIPS 2025
Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN's visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.
comment: Accepted to NeurIPS 2025
♻ ☆ Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning
Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.
comment: Project Page: https://directlayout.github.io/
♻ ☆ Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology NeurIPS 2025
Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. It mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient (<10 RTX3090 hours). We show the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https://github.com/DearCaat/E2E-WSI-ABMILX.
comment: published on NeurIPS 2025
♻ ☆ Vision-Centric Activation and Coordination for Multimodal Large Language Models
Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.
♻ ☆ MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment
Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbf{MARIS} (\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM}) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.
♻ ☆ VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning NeurIPS 2025
Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.
comment: Accepted by NeurIPS 2025
♻ ☆ Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning
Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame features to capture frame information effectively through frequency-domain analysis. Experimental results across multiple I2V datasets demonstrate that fine-tuning a video generation model with VCD significantly enhances temporal consistency without degrading other performance compared to the previous method.
comment: 17 pages
♻ ☆ Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
The Earth's surface is constantly changing, and detecting these changes provides valuable insights that benefit various aspects of human society. While traditional change detection methods have been employed to detect changes from bi-temporal images, these approaches typically require expert knowledge for accurate interpretation. To enable broader and more flexible access to change information by non-expert users, the task of Change Detection Visual Question Answering (CDVQA) has been introduced. However, existing CDVQA methods have been developed under the assumption that training and testing datasets share similar distributions. This assumption does not hold in real-world applications, where domain shifts often occur. In this paper, the CDVQA task is revisited with a focus on addressing domain shift. To this end, a new multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate domain generalization research in CDVQA. Furthermore, a novel state space model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The TCSSM framework is designed to leverage both bi-temporal imagery and geo-disaster-related textual information in an unified manner to extract domain-invariant features across domains. Input-dependent parameters existing in TCSSM are dynamically predicted by using both bi-temporal images and geo-disaster-related description, thereby facilitating the alignment between bi-temporal visual data and the associated textual descriptions. Extensive experiments are conducted to evaluate the proposed method against state-of-the-art models, and superior performance is consistently demonstrated. The code and dataset will be made publicly available upon acceptance at https://github.com/Elman295/TCSSM.
♻ ☆ Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models
Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.
♻ ☆ LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics -- conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions. LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbone's hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or MLLM captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition on -- rather than adding parameters or relying on text prompts -- is the governing lever for robust and caption-free universal image restoration in the wild.
comment: Project Page: https://w2genai-lab.github.io/LucidFlux
♻ ☆ CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization
Image Forgery Localization (IFL) is a crucial task in image forensics, aimed at accurately identifying manipulated or tampered regions within an image at the pixel level. Existing methods typically generate a single deterministic localization map, which often lacks the precision and reliability required for high-stakes applications such as forensic analysis and security surveillance. To enhance the credibility of predictions and mitigate the risk of errors, we introduce an advanced Conditional Bernoulli Diffusion Model (CBDiff). Given a forged image, CBDiff generates multiple diverse and plausible localization maps, thereby offering a richer and more comprehensive representation of the forgery distribution. This approach addresses the uncertainty and variability inherent in tampered regions. Furthermore, CBDiff innovatively incorporates Bernoulli noise into the diffusion process to more faithfully reflect the inherent binary and sparse properties of forgery masks. Additionally, CBDiff introduces a Time-Step Cross-Attention (TSCAttention), which is specifically designed to leverage semantic feature guidance with temporal steps to improve manipulation detection. Extensive experiments on eight publicly benchmark datasets demonstrate that CBDiff significantly outperforms existing state-of-the-art methods, highlighting its strong potential for real-world deployment.
♻ ☆ PlantSegNeRF: A few-shot, cross-species method for plant 3D instance point cloud reconstruction via joint-channel NeRF with multi-view image instance matching
Organ segmentation of plant point clouds is a prerequisite for the high-resolution and accurate extraction of organ-level phenotypic traits. Although the fast development of deep learning has boosted much research on segmentation of plant point clouds, the existing techniques for organ segmentation still face limitations in resolution, segmentation accuracy, and generalizability across various plant species. In this study, we proposed a novel approach called plant segmentation neural radiance fields (PlantSegNeRF), aiming to directly generate high-precision instance point clouds from multi-view RGB image sequences for a wide range of plant species. PlantSegNeRF performed 2D instance segmentation on the multi-view images to generate instance masks for each organ with a corresponding ID. The multi-view instance IDs corresponding to the same plant organ were then matched and refined using a specially designed instance matching module. The instance NeRF was developed to render an implicit scene, containing color, density, semantic and instance information. The implicit scene was ultimately converted into high-precision plant instance point clouds based on the volume density. The results proved that in semantic segmentation of point clouds, PlantSegNeRF outperformed the commonly used methods, demonstrating an average improvement of 16.1%, 18.3%, 17.8%, and 24.2% in precision, recall, F1-score, and IoU compared to the second-best results on structurally complex species. More importantly, PlantSegNeRF exhibited significant advantages in plant point cloud instance segmentation tasks. Across all plant species, it achieved average improvements of 11.7%, 38.2%, 32.2% and 25.3% in mPrec, mRec, mCov, mWCov, respectively. This study extends the organ-level plant phenotyping and provides a high-throughput way to supply high-quality 3D data for the development of large-scale models in plant science.
♻ ☆ FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies NeurIPS 2025
The increasing realism of synthetic images generated by advanced models such as VAEs, GANs, and LDMs poses significant challenges for synthetic image detection. To address this issue, we explore two artifact types introduced during the generation process: (1) latent distribution deviations and (2) decoding-induced smoothing effects, which manifest as inconsistencies in local textures, edges, and color transitions. Leveraging local pixel dependencies (LPD) properties rooted in Markov Random Fields, we reconstruct synthetic images using neighboring pixel information to expose disruptions in texture continuity and edge coherence. Building upon LPD, we propose FerretNet, a lightweight neural network with only 1.1M parameters that delivers efficient and robust synthetic image detection. Extensive experiments demonstrate that FerretNet, trained exclusively on the 4-class ProGAN dataset, achieves an average accuracy of 97.1% on an open-world benchmark comprising 22 generative models. Our code and datasets are publicly available at https://github.com/xigua7105/FerretNet.
comment: 9 pages, 4 figures, 8 tables, accepted at NeurIPS 2025
♻ ☆ OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts
The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model's spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Code is available at https://github.com/GinnyXiao/OpenWorldSAM.
♻ ☆ Generative diffusion model surrogates for mechanistic agent-based biological models
Mechanistic, multicellular, agent-based models are commonly used to investigate tissue, organ, and organism-scale biology at single-cell resolution. The Cellular-Potts Model (CPM) is a powerful and popular framework for developing and interrogating these models. CPMs become computationally expensive at large space- and time- scales making application and investigation of developed models difficult. Surrogate models may allow for the accelerated evaluation of CPMs of complex biological systems. However, the stochastic nature of these models means each set of parameters may give rise to different model configurations, complicating surrogate model development. In this work, we leverage denoising diffusion probabilistic models to train a generative AI surrogate of a CPM used to investigate in vitro vasculogenesis. We describe the use of an image classifier to learn the characteristics that define unique areas of a 2-dimensional parameter space. We then apply this classifier to aid in surrogate model selection and verification. Our CPM model surrogate generates model configurations 20,000 timesteps ahead of a reference configuration and demonstrates approximately a 22x reduction in computational time as compared to native code execution. Our work represents a step towards the implementation of DDPMs to develop digital twins of stochastic biological systems.
♻ ☆ Sherlock: Self-Correcting Reasoning in Vision-Language Models NeurIPS 2025
Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.
comment: Published at NeurIPS 2025, 27 pages
♻ ☆ REOrdering Patches Improves Vision Models NeurIPS 2025
Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.
comment: Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Spiking Neural Networks Need High Frequency Information
Spiking Neural Networks promise brain-inspired and energy-efficient computation by transmitting information through binary (0/1) spikes. Yet, their performance still lags behind that of artificial neural networks, often assumed to result from information loss caused by sparse and binary activations. In this work, we challenge this long-standing assumption and reveal a previously overlooked frequency bias: spiking neurons inherently suppress high-frequency components and preferentially propagate low-frequency information. This frequency-domain imbalance, we argue, is the root cause of degraded feature representation in SNNs. Empirically, on Spiking Transformers, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73% on Cifar-100, whereas replacing it with Max-Pool (high-pass) pushes the top-1 accuracy to 79.12%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: (1) extra Max-Pool in patch embedding, and (2) Depth-Wise Convolution in place of self-attention. Notably, Max-Former attains 82.39% top-1 accuracy on ImageNet using only 63.99M parameters, surpassing Spikformer (74.81%, 66.34M) by +7.58%. Extending our insight beyond transformers, our Max-ResNet-18 achieves state-of-the-art performance on convolution-based benchmarks: 97.17% on CIFAR-10 and 83.06% on CIFAR-100. We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks. Code is available: https://github.com/bic-L/MaxFormer.
♻ ☆ Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.
comment: This paper contains fundamental errors and will not be replaced
♻ ☆ Panoptic-CUDAL: Rural Australia Point Cloud Dataset in Rainy Conditions
Existing autonomous driving datasets are predominantly oriented towards well-structured urban settings and favourable weather conditions, leaving the complexities of rural environments and adverse weather conditions largely unaddressed. Although some datasets encompass variations in weather and lighting, bad weather scenarios do not appear often. Rainfall can significantly impair sensor functionality, introducing noise and reflections in LiDAR and camera data and reducing the system's capabilities for reliable environmental perception and safe navigation. This paper introduces the Panoptic-CUDAL dataset, a novel dataset purpose-built for panoptic segmentation in rural areas subject to rain. By recording high-resolution LiDAR, camera, and pose data, Panoptic-CUDAL offers a diverse, information-rich dataset in a challenging scenario. We present the analysis of the recorded data and provide baseline results for panoptic, semantic segmentation, and 3D occupancy prediction methods on LiDAR point clouds. The dataset can be found here: https://robotics.sydney.edu.au/our-research/intelligent-transportation-systems, https://vision.rwth-aachen.de/panoptic-cudal
♻ ☆ SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model
High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on three datasets, and consistently improves performance across various SR architectures. Notably, for the x4 SR task on UCMerced dataset, it attained a PSNR of 29.3042 dB and an SSIM of 0.7961.
♻ ☆ RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.
♻ ☆ FuseUNet: A Multi-Scale Feature Fusion Method for U-like Networks
Medical image segmentation is a critical task in computer vision, with UNet serving as a milestone architecture. The typical component of UNet family is the skip connection, however, their skip connections face two significant limitations: (1) they lack effective interaction between features at different scales, and (2) they rely on simple concatenation or addition operations, which constrain efficient information integration. While recent improvements to UNet have focused on enhancing encoder and decoder capabilities, these limitations remain overlooked. To overcome these challenges, we propose a novel multi-scale feature fusion method that reimagines the UNet decoding process as solving an initial value problem (IVP), treating skip connections as discrete nodes. By leveraging principles from the linear multistep method, we propose an adaptive ordinary differential equation method to enable effective multi-scale feature fusion. Our approach is independent of the encoder and decoder architectures, making it adaptable to various U-Net-like networks. Experiments on ACDC, KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets demonstrate improved feature utilization, reduced network parameters, and maintained high performance. The code is available at https://github.com/nayutayuki/FuseUNet.
comment: Updated author information to clarify institutional affiliation. The research was conducted prior to the author joining the University of Maryland
Machine Learning 150
☆ Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge
Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.
☆ VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation
A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: https://vamos-vla.github.io/
☆ KL-Regularized Reinforcement Learning is Designed to Mode Collapse
It is commonly believed that optimizing the reverse KL divergence results in "mode seeking", while optimizing forward KL results in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show -- mathematically and empirically -- that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.
☆ On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?
With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.
☆ Real Deep Research for AI, Robotics and Beyond
With the rapid growth of research in AI and robotics now producing over 10,000 papers annually it has become increasingly difficult for researchers to stay up to date. Fast evolving trends, the rise of interdisciplinary work, and the need to explore domains beyond one's expertise all contribute to this challenge. To address these issues, we propose a generalizable pipeline capable of systematically analyzing any research area: identifying emerging trends, uncovering cross domain opportunities, and offering concrete starting points for new inquiry. In this work, we present Real Deep Research (RDR) a comprehensive framework applied to the domains of AI and robotics, with a particular focus on foundation models and robotics advancements. We also briefly extend our analysis to other areas of science. The main paper details the construction of the RDR pipeline, while the appendix provides extensive results across each analyzed topic. We hope this work sheds light for researchers working in the field of AI and beyond.
comment: website: https://realdeepresearch.github.io
☆ The Reality Gap in Robotics: Challenges, Solutions, and Best Practices
Machine learning has facilitated significant advancements across various robotics domains, including navigation, locomotion, and manipulation. Many such achievements have been driven by the extensive use of simulation as a critical tool for training and testing robotic systems prior to their deployment in real-world environments. However, simulations consist of abstractions and approximations that inevitably introduce discrepancies between simulated and real environments, known as the reality gap. These discrepancies significantly hinder the successful transfer of systems from simulation to the real world. Closing this gap remains one of the most pressing challenges in robotics. Recent advances in sim-to-real transfer have demonstrated promising results across various platforms, including locomotion, navigation, and manipulation. By leveraging techniques such as domain randomization, real-to-sim transfer, state and action abstractions, and sim-real co-training, many works have overcome the reality gap. However, challenges persist, and a deeper understanding of the reality gap's root causes and solutions is necessary. In this survey, we present a comprehensive overview of the sim-to-real landscape, highlighting the causes, solutions, and evaluation metrics for the reality gap and sim-to-real transfer.
comment: Accepted for Publication as part of the Annual Review of Control, Robotics, and Autonomous Systems 2026
☆ Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers
Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
comment: 14 pages, 14 figures
☆ Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples
Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected -- eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data -- both for computing the indicative gradients and for measuring the final accuracy -- suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets -- entirely without fine-tuning.
☆ Simple Context Compression: Mean-Pooling and Multi-Ratio Training
A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
comment: Code available at https://github.com/lil-lab/simple-context-compression
☆ Bayesian Inference of Primordial Magnetic Field Parameters from CMB with Spherical Graph Neural Networks
Deep learning has emerged as a transformative methodology in modern cosmology, providing powerful tools to extract meaningful physical information from complex astronomical datasets. This paper implements a novel Bayesian graph deep learning framework for estimating key cosmological parameters in a primordial magnetic field (PMF) cosmology directly from simulated Cosmic Microwave Background (CMB) maps. Our methodology utilizes DeepSphere, a spherical convolutional neural network architecture specifically designed to respect the spherical geometry of CMB data through HEALPix pixelization. To advance beyond deterministic point estimates and enable robust uncertainty quantification, we integrate Bayesian Neural Networks (BNNs) into the framework, capturing aleatoric and epistemic uncertainties that reflect the model confidence in its predictions. The proposed approach demonstrates exceptional performance, achieving $R^{2}$ scores exceeding 0.89 for the magnetic parameter estimation. We further obtain well-calibrated uncertainty estimates through post-hoc training techniques including Variance Scaling and GPNormal. This integrated DeepSphere-BNNs framework not only delivers accurate parameter estimation from CMB maps with PMF contributions but also provides reliable uncertainty quantification, providing the necessary tools for robust cosmological inference in the era of precision cosmology.
comment: 16 pages, 6 figures, 4 tables
☆ BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method targeting latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
☆ Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
comment: 19 pages, 5 figures
☆ A Coherence-Based Measure of AGI
Recent work by \citet{hendrycks2025agidefinition} formalized \textit{Artificial General Intelligence} (AGI) as the arithmetic mean of proficiencies across cognitive domains derived from the Cattell--Horn--Carroll (CHC) model of human cognition. While elegant, this definition assumes \textit{compensability} -- that exceptional ability in some domains can offset failure in others. True general intelligence, however, should reflect \textit{coherent sufficiency}: balanced competence across all essential domains. We propose a coherence-aware measure of AGI based on the integral of generalized means over a continuum of compensability exponents. This formulation spans arithmetic, geometric, and harmonic regimes, and the resulting \textit{area under the curve} (AUC) quantifies robustness under varying compensability assumptions. Unlike the arithmetic mean, which rewards specialization, the AUC penalizes imbalance and captures inter-domain dependency. Applied to published CHC-based domain scores for GPT-4 and GPT-5, the coherence-adjusted AUC reveals that both systems remain far from general competence despite high arithmetic scores (e.g., GPT-5 at~24\%). Integrating the generalized mean thus yields a principled, interpretable, and stricter foundation for measuring genuine progress toward AGI.
comment: 13 pages, 1 figure, 12 tables
☆ Out-of-distribution Tests Reveal Compositionality in Chess Transformers
Chess is a canonical example of a task that requires rigorous reasoning and long-term planning. Modern decision Transformers - trained similarly to LLMs - are able to learn competent gameplay, but it is unclear to what extent they truly capture the rules of chess. To investigate this, we train a 270M parameter chess Transformer and test it on out-of-distribution scenarios, designed to reveal failures of systematic generalization. Our analysis shows that Transformers exhibit compositional generalization, as evidenced by strong rule extrapolation: they adhere to fundamental syntactic rules of the game by consistently choosing valid moves even in situations very different from the training data. Moreover, they also generate high-quality moves for OOD puzzles. In a more challenging test, we evaluate the models on variants including Chess960 (Fischer Random Chess) - a variant of chess where starting positions of pieces are randomized. We found that while the model exhibits basic strategy adaptation, they are inferior to symbolic AI algorithms that perform explicit search, but gap is smaller when playing against users on Lichess. Moreover, the training dynamics revealed that the model initially learns to move only its own pieces, suggesting an emergent compositional understanding of the game.
☆ AlphaFlow: Understanding and Improving MeanFlow Models
MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE).
☆ CSU-PCAST: A Dual-Branch Transformer Framework for medium-range ensemble Precipitation Forecasting
Accurate medium-range precipitation forecasting is crucial for hydrometeorological risk management and disaster mitigation, yet remains challenging for current numerical weather prediction (NWP) systems. Traditional ensemble systems such as the Global Ensemble Forecast System (GEFS) struggle to maintain high skill, especially for moderate and heavy rainfall at extended lead times. This study develops a deep learning-based ensemble framework for multi-step precipitation prediction through joint modeling of a comprehensive set of atmospheric variables. The model is trained on ERA5 reanalysis data at 0.25$^{\circ}$ spatial resolution, with precipitation labels from NASA's Integrated Multi-satellite Retrievals for Global Precipitation Measurement (GPM) constellation (IMERG), incorporating 57 input variables, including upper-air and surface predictors. The architecture employs a patch-based Swin Transformer backbone with periodic convolutions to handle longitudinal continuity and integrates time and noise embeddings through conditional layer normalization. A dual-branch decoder predicts total precipitation and other variables, with targeted freezing of encoder-decoder pathways for specialized training. Training minimizes a hybrid loss combining the Continuous Ranked Probability Score (CRPS) and weighted log1p mean squared error (log1pMSE), balancing probabilistic accuracy and magnitude fidelity. During inference, the model ingests real-time Global Forecast System (GFS) initial conditions to generate 15-day forecasts autoregressively. Evaluation against GEFS using IMERG data demonstrates higher Critical Success Index (CSI) scores at precipitation thresholds of 0.1 mm, 1 mm, 10 mm, and 20 mm, highlighting improved performance for moderate to heavy rainfall.
comment: 20 pages, 12 figures, submitted to arXiv under Atmospheric and Oceanic Physics (physics.ao-ph) and Machine Learning (cs.LG)
☆ MEIcoder: Decoding Visual Stimuli from Neural Activity by Leveraging Most Exciting Inputs NeurIPS 2025
Decoding visual stimuli from neural population activity is crucial for understanding the brain and for applications in brain-machine interfaces. However, such biological data is often scarce, particularly in primates or humans, where high-throughput recording techniques, such as two-photon imaging, remain challenging or impossible to apply. This, in turn, poses a challenge for deep learning decoding techniques. To overcome this, we introduce MEIcoder, a biologically informed decoding method that leverages neuron-specific most exciting inputs (MEIs), a structural similarity index measure loss, and adversarial training. MEIcoder achieves state-of-the-art performance in reconstructing visual stimuli from single-cell activity in primary visual cortex (V1), especially excelling on small datasets with fewer recorded neurons. Using ablation studies, we demonstrate that MEIs are the main drivers of the performance, and in scaling experiments, we show that MEIcoder can reconstruct high-fidelity natural-looking images from as few as 1,000-2,500 neurons and less than 1,000 training data points. We also propose a unified benchmark with over 160,000 samples to foster future research. Our results demonstrate the feasibility of reliable decoding in early visual system and provide practical insights for neuroscience and neuroengineering applications.
comment: Accepted to NeurIPS 2025
☆ Reinforcement Learning and Consumption-Savings Behavior
This paper demonstrates how reinforcement learning can explain two puzzling empirical patterns in household consumption behavior during economic downturns. I develop a model where agents use Q-learning with neural network approximation to make consumption-savings decisions under income uncertainty, departing from standard rational expectations assumptions. The model replicates two key findings from recent literature: (1) unemployed households with previously low liquid assets exhibit substantially higher marginal propensities to consume (MPCs) out of stimulus transfers compared to high-asset households (0.50 vs 0.34), even when neither group faces borrowing constraints, consistent with Ganong et al. (2024); and (2) households with more past unemployment experiences maintain persistently lower consumption levels after controlling for current economic conditions, a "scarring" effect documented by Malmendier and Shen (2024). Unlike existing explanations based on belief updating about income risk or ex-ante heterogeneity, the reinforcement learning mechanism generates both higher MPCs and lower consumption levels simultaneously through value function approximation errors that evolve with experience. Simulation results closely match the empirical estimates, suggesting that adaptive learning through reinforcement learning provides a unifying framework for understanding how past experiences shape current consumption behavior beyond what current economic conditions would predict.
comment: 41 pages, 10 figures
☆ Learning to Triage Taint Flows Reported by Dynamic Program Analysis in Node.js Packages
Program analysis tools often produce large volumes of candidate vulnerability reports that require costly manual review, creating a practical challenge: how can security analysts prioritize the reports most likely to be true vulnerabilities? This paper investigates whether machine learning can be applied to prioritizing vulnerabilities reported by program analysis tools. We focus on Node.js packages and collect a benchmark of 1,883 Node.js packages, each containing one reported ACE or ACI vulnerability. We evaluate a variety of machine learning approaches, including classical models, graph neural networks (GNNs), large language models (LLMs), and hybrid models that combine GNN and LLMs, trained on data based on a dynamic program analysis tool's output. The top LLM achieves $F_{1} {=} 0.915$, while the best GNN and classical ML models reaching $F_{1} {=} 0.904$. At a less than 7% false-negative rate, the leading model eliminates 66.9% of benign packages from manual review, taking around 60 ms per package. If the best model is tuned to operate at a precision level of 0.8 (i.e., allowing 20% false positives amongst all warnings), our approach can detect 99.2% of exploitable taint flows while missing only 0.8%, demonstrating strong potential for real-world vulnerability triage.
☆ Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process
Developing effective multimodal fusion approaches has become increasingly essential in many real-world scenarios, such as health care and finance. The key challenge is how to preserve the feature expressiveness in each modality while learning cross-modal interactions. Previous approaches primarily focus on the cross-modal alignment, while over-emphasis on the alignment of marginal distributions of modalities may impose excess regularization and obstruct meaningful representations within each modality. The Dirichlet process (DP) mixture model is a powerful Bayesian non-parametric method that can amplify the most prominent features by its richer-gets-richer property, which allocates increasing weights to them. Inspired by this unique characteristic of DP, we propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment. Specifically, we assume that each modality follows a mixture of multivariate Gaussian distributions and further adopt DP to calculate the mixture weights for all the components. This paradigm allows DP to dynamically allocate the contributions of features and select the most prominent ones, leveraging its richer-gets-richer property, thus facilitating multimodal feature fusion. Extensive experiments on several multimodal datasets demonstrate the superior performance of our model over other competitors. Ablation analysis further validates the effectiveness of DP in aligning modality distributions and its robustness to changes in key hyperparameters. Code is anonymously available at https://github.com/HKU-MedAI/DPMM.git
comment: Accepted by NeruIPS 2025
☆ Thought Communication in Multiagent Collaboration NeurIPS 2025
Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM-based multi-agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, thought communication, which enables agents to interact directly mind-to-mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface-level observation alone, regardless of compute or data scale.
comment: NeurIPS 2025 Spotlight
☆ No-Regret Thompson Sampling for Finite-Horizon Markov Decision Processes with Gaussian Processes NeurIPS
Thompson sampling (TS) is a powerful and widely used strategy for sequential decision-making, with applications ranging from Bayesian optimization to reinforcement learning (RL). Despite its success, the theoretical foundations of TS remain limited, particularly in settings with complex temporal structure such as RL. We address this gap by establishing no-regret guarantees for TS using models with Gaussian marginal distributions. Specifically, we consider TS in episodic RL with joint Gaussian process (GP) priors over rewards and transitions. We prove a regret bound of $\mathcal{\tilde{O}}(\sqrt{KH\Gamma(KH)})$ over $K$ episodes of horizon $H$, where $\Gamma(\cdot)$ captures the complexity of the GP model. Our analysis addresses several challenges, including the non-Gaussian nature of value functions and the recursive structure of Bellman updates, and extends classical tools such as the elliptical potential lemma to multi-output settings. This work advances the understanding of TS in RL and highlights how structural assumptions and model uncertainty shape its performance in finite-horizon Markov Decision Processes.
comment: Appearing in NeurIPS, 2025
☆ Unsupervised Anomaly Prediction with N-BEATS and Graph Neural Network in Multi-variate Semiconductor Process Time Series
Semiconductor manufacturing is an extremely complex and precision-driven process, characterized by thousands of interdependent parameters collected across diverse tools and process steps. Multi-variate time-series analysis has emerged as a critical field for real-time monitoring and fault detection in such environments. However, anomaly prediction in semiconductor fabrication presents several critical challenges, including high dimensionality of sensor data and severe class imbalance due to the rarity of true faults. Furthermore, the complex interdependencies between variables complicate both anomaly prediction and root-cause-analysis. This paper proposes two novel approaches to advance the field from anomaly detection to anomaly prediction, an essential step toward enabling real-time process correction and proactive fault prevention. The proposed anomaly prediction framework contains two main stages: (a) training a forecasting model on a dataset assumed to contain no anomalies, and (b) performing forecast on unseen time series data. The forecast is compared with the forecast of the trained signal. Deviations beyond a predefined threshold are flagged as anomalies. The two approaches differ in the forecasting model employed. The first assumes independence between variables by utilizing the N-BEATS model for univariate time series forecasting. The second lifts this assumption by utilizing a Graph Neural Network (GNN) to capture inter-variable relationships. Both models demonstrate strong forecasting performance up to a horizon of 20 time points and maintain stable anomaly prediction up to 50 time points. The GNN consistently outperforms the N-BEATS model while requiring significantly fewer trainable parameters and lower computational cost. These results position the GNN as promising solution for online anomaly forecasting to be deployed in manufacturing environments.
comment: 17 pages, 27 figures
☆ Optimizing Clinical Fall Risk Prediction: A Data-Driven Integration of EHR Variables with the Johns Hopkins Fall Risk Assessment Tool
In this study we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models on JHFRAT assessment data and additional electronic health record (EHR) variables. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labelling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.
comment: 19 pages, 7 figures, 4 tables
☆ Separating the what and how of compositional computation to enable reuse and continual learning
The ability to continually learn, retain and deploy skills to accomplish goals is a key feature of intelligent and efficient behavior. However, the neural mechanisms facilitating the continual learning and flexible (re-)composition of skills remain elusive. Here, we study continual learning and the compositional reuse of learned computations in recurrent neural network (RNN) models using a novel two-system approach: one system that infers what computation to perform, and one that implements how to perform it. We focus on a set of compositional cognitive tasks commonly studied in neuroscience. To construct the what system, we first show that a large family of tasks can be systematically described by a probabilistic generative model, where compositionality stems from a shared underlying vocabulary of discrete task epochs. The shared epoch structure makes these tasks inherently compositional. We first show that this compositionality can be systematically described by a probabilistic generative model. Furthermore, We develop an unsupervised online learning approach that can learn this model on a single-trial basis, building its vocabulary incrementally as it is exposed to new tasks, and inferring the latent epoch structure as a time-varying computational context within a trial. We implement the how system as an RNN whose low-rank components are composed according to the context inferred by the what system. Contextual inference facilitates the creation, learning, and reuse of low-rank RNN components as new tasks are introduced sequentially, enabling continual learning without catastrophic forgetting. Using an example task set, we demonstrate the efficacy and competitive performance of this two-system learning framework, its potential for forward and backward transfer, as well as fast compositional generalization to unseen tasks.
☆ Neural Diversity Regularizes Hallucinations in Small Models
Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. Inspired by portfolio theory, where uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination probability is bounded by representational correlation: $P(H) \leq f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language models need an optimal amount of neurodiversity. To validate this, we introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces hallucinations by up to 25.6% (and 14.6% on average) without degrading general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational analyses indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different amounts of optimal neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.
☆ A Scalable, Causal, and Energy Efficient Framework for Neural Decoding with Spiking Neural Networks
Brain-computer interfaces (BCIs) promise to enable vital functions, such as speech and prosthetic control, for individuals with neuromotor impairments. Central to their success are neural decoders, models that map neural activity to intended behavior. Current learning-based decoding approaches fall into two classes: simple, causal models that lack generalization, or complex, non-causal models that generalize and scale offline but struggle in real-time settings. Both face a common challenge, their reliance on power-hungry artificial neural network backbones, which makes integration into real-world, resource-limited systems difficult. Spiking neural networks (SNNs) offer a promising alternative. Because they operate causally these models are suitable for real-time use, and their low energy demands make them ideal for battery-constrained environments. To this end, we introduce Spikachu: a scalable, causal, and energy-efficient neural decoding framework based on SNNs. Our approach processes binned spikes directly by projecting them into a shared latent space, where spiking modules, adapted to the timing of the input, extract relevant features; these latent representations are then integrated and decoded to generate behavioral predictions. We evaluate our approach on 113 recording sessions from 6 non-human primates, totaling 43 hours of recordings. Our method outperforms causal baselines when trained on single sessions using between 2.26 and 418.81 times less energy. Furthermore, we demonstrate that scaling up training to multiple sessions and subjects improves performance and enables few-shot transfer to unseen sessions, subjects, and tasks. Overall, Spikachu introduces a scalable, online-compatible neural decoding framework based on SNNs, whose performance is competitive relative to state-of-the-art models while consuming orders of magnitude less energy.
☆ Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling
Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at https://github.com/a2jinhee/EMQNet_jk.
☆ GRACE: GRaph-based Addiction Care prEdiction
Determining the appropriate locus of care for addiction patients is one of the most critical clinical decisions that affects patient treatment outcomes and effective use of resources. With a lack of sufficient specialized treatment resources, such as inpatient beds or staff, there is an unmet need to develop an automated framework for the same. Current decision-making approaches suffer from severe class imbalances in addiction datasets. To address this limitation, we propose a novel graph neural network (GRACE) framework that formalizes locus of care prediction as a structured learning problem. Further, we perform extensive feature engineering and propose a new approach of obtaining an unbiased meta-graph to train a GNN to overcome the class imbalance problem. Experimental results in real-world data show an improvement of 11-35% in terms of the F1 score of the minority class over competitive baselines. The codes and note embeddings are available at https://anonymous.4open.science/r/GRACE-F8E1/.
☆ From Masks to Worlds: A Hitchhiker's Guide to World Models
This is not a typical survey of world models; it is a guide for those who want to build worlds. We do not aim to catalog every paper that has ever mentioned a ``world model". Instead, we follow one clear road: from early masked models that unified representation learning across modalities, to unified architectures that share a single paradigm, then to interactive generative models that close the action-perception loop, and finally to memory-augmented systems that sustain consistent worlds over time. We bypass loosely related branches to focus on the core: the generative heart, the interactive loop, and the memory system. We show that this is the most promising path towards true world models.
comment: Github: https://github.com/M-E-AGI-Lab/Awesome-World-Models
☆ Bayesian Jammer Localization with a Hybrid CNN and Path-Loss Mixture of Experts ICASSP
Global Navigation Satellite System (GNSS) signals are vulnerable to jamming, particularly in urban areas where multipath and shadowing distort received power. Previous data-driven approaches achieved reasonable localization but poorly reconstructed the received signal strength (RSS) field due to limited spatial context. We propose a hybrid Bayesian mixture-of-experts framework that fuses a physical path-loss (PL) model and a convolutional neural network (CNN) through log-linear pooling. The PL expert ensures physical consistency, while the CNN leverages building-height maps to capture urban propagation effects. Bayesian inference with Laplace approximation provides posterior uncertainty over both the jammer position and RSS field. Experiments on urban ray-tracing data show that localization accuracy improves and uncertainty decreases with more training points, while uncertainty concentrates near the jammer and along urban canyons where propagation is most sensitive.
comment: 5 pages, 4 figures, Submitted to ICASSPW 2026
☆ Finding the Sweet Spot: Trading Quality, Cost, and Speed During Inference-Time LLM Reflection
As Large Language Models (LLMs) continue to evolve, practitioners face increasing options for enhancing inference-time performance without model retraining, including budget tuning and multi-step techniques like self-reflection. While these methods improve output quality, they create complex trade-offs among accuracy, cost, and latency that remain poorly understood across different domains. This paper systematically compares self-reflection and budget tuning across mathematical reasoning and translation tasks. We evaluate prominent LLMs, including Anthropic Claude, Amazon Nova, and Mistral families, along with other models under varying reflection depths and compute budgets to derive Pareto optimal performance frontiers. Our analysis reveals substantial domain dependent variation in self-reflection effectiveness, with performance gains up to 220\% in mathematical reasoning. We further investigate how reflection round depth and feedback mechanism quality influence performance across model families. To validate our findings in a real-world setting, we deploy a self-reflection enhanced marketing content localisation system at Lounge by Zalando, where it shows market-dependent effectiveness, reinforcing the importance of domain specific evaluation when deploying these techniques. Our results provide actionable guidance for selecting optimal inference strategies given specific domains and resource constraints. We open source our self-reflection implementation for reproducibility at https://github.com/aws-samples/sample-genai-reflection-for-bedrock.
☆ xTime: Extreme Event Prediction with Hierarchical Knowledge Distillation and Expert Fusion
Extreme events frequently occur in real-world time series and often carry significant practical implications. In domains such as climate and healthcare, these events, such as floods, heatwaves, or acute medical episodes, can lead to serious consequences. Accurate forecasting of such events is therefore of substantial importance. Most existing time series forecasting models are optimized for overall performance within the prediction window, but often struggle to accurately predict extreme events, such as high temperatures or heart rate spikes. The main challenges are data imbalance and the neglect of valuable information contained in intermediate events that precede extreme events. In this paper, we propose xTime, a novel framework for extreme event forecasting in time series. xTime leverages knowledge distillation to transfer information from models trained on lower-rarity events, thereby improving prediction performance on rarer ones. In addition, we introduce a mixture of experts (MoE) mechanism that dynamically selects and fuses outputs from expert models across different rarity levels, which further improves the forecasting performance for extreme events. Experiments on multiple datasets show that xTime achieves consistent improvements, with forecasting accuracy on extreme events improving from 3% to 78%.
☆ Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning NeurIPS 2025
Mutual Information (MI) is a fundamental measure of statistical dependence widely used in representation learning. While direct optimization of MI via its definition as a Kullback-Leibler divergence (KLD) is often intractable, many recent methods have instead maximized alternative dependence measures, most notably, the Jensen-Shannon divergence (JSD) between joint and product of marginal distributions via discriminative losses. However, the connection between these surrogate objectives and MI remains poorly understood. In this work, we bridge this gap by deriving a new, tight, and tractable lower bound on KLD as a function of JSD in the general case. By specializing this bound to joint and marginal distributions, we demonstrate that maximizing the JSD-based information increases a guaranteed lower bound on mutual information. Furthermore, we revisit the practical implementation of JSD-based objectives and observe that minimizing the cross-entropy loss of a binary classifier trained to distinguish joint from marginal pairs recovers a known variational lower bound on the JSD. Extensive experiments demonstrate that our lower bound is tight when applied to MI estimation. We compared our lower bound to state-of-the-art neural estimators of variational lower bound across a range of established reference scenarios. Our lower bound estimator consistently provides a stable, low-variance estimate of a tight lower bound on MI. We also demonstrate its practical usefulness in the context of the Information Bottleneck framework. Taken together, our results provide new theoretical justifications and strong empirical evidence for using discriminative learning in MI-based representation learning.
comment: Accepted at NeurIPS 2025. Code available at https://github.com/ReubenDo/JSDlowerbound/
☆ Attention Enhanced Entity Recommendation for Intelligent Monitoring in Cloud Systems
In this paper, we present DiRecGNN, an attention-enhanced entity recommendation framework for monitoring cloud services at Microsoft. We provide insights on the usefulness of this feature as perceived by the cloud service owners and lessons learned from deployment. Specifically, we introduce the problem of recommending the optimal subset of attributes (dimensions) that should be tracked by an automated watchdog (monitor) for cloud services. To begin, we construct the monitor heterogeneous graph at production-scale. The interaction dynamics of these entities are often characterized by limited structural and engagement information, resulting in inferior performance of state-of-the-art approaches. Moreover, traditional methods fail to capture the dependencies between entities spanning a long range due to their homophilic nature. Therefore, we propose an attention-enhanced entity ranking model inspired by transformer architectures. Our model utilizes a multi-head attention mechanism to focus on heterogeneous neighbors and their attributes, and further attends to paths sampled using random walks to capture long-range dependencies. We also employ multi-faceted loss functions to optimize for relevant recommendations while respecting the inherent sparsity of the data. Empirical evaluations demonstrate significant improvements over existing methods, with our model achieving a 43.1% increase in MRR. Furthermore, product teams who consumed these features perceive the feature as useful and rated it 4.5 out of 5.
☆ Large Multimodal Models-Empowered Task-Oriented Autonomous Communications: Design Methodology and Implementation Challenges
Large language models (LLMs) and large multimodal models (LMMs) have achieved unprecedented breakthrough, showcasing remarkable capabilities in natural language understanding, generation, and complex reasoning. This transformative potential has positioned them as key enablers for 6G autonomous communications among machines, vehicles, and humanoids. In this article, we provide an overview of task-oriented autonomous communications with LLMs/LMMs, focusing on multimodal sensing integration, adaptive reconfiguration, and prompt/fine-tuning strategies for wireless tasks. We demonstrate the framework through three case studies: LMM-based traffic control, LLM-based robot scheduling, and LMM-based environment-aware channel estimation. From experimental results, we show that the proposed LLM/LMM-aided autonomous systems significantly outperform conventional and discriminative deep learning (DL) model-based techniques, maintaining robustness under dynamic objectives, varying input parameters, and heterogeneous multimodal conditions where conventional static optimization degrades.
☆ Equitable Survival Prediction: A Fairness-Aware Survival Modeling (FASM) Approach
As machine learning models become increasingly integrated into healthcare, structural inequities and social biases embedded in clinical data can be perpetuated or even amplified by data-driven models. In survival analysis, censoring and time dynamics can further add complexity to fair model development. Additionally, algorithmic fairness approaches often overlook disparities in cross-group rankings, e.g., high-risk Black patients may be ranked below lower-risk White patients who do not experience the event of mortality. Such misranking can reinforce biological essentialism and undermine equitable care. We propose a Fairness-Aware Survival Modeling (FASM), designed to mitigate algorithmic bias regarding both intra-group and cross-group risk rankings over time. Using breast cancer prognosis as a representative case and applying FASM to SEER breast cancer data, we show that FASM substantially improves fairness while preserving discrimination performance comparable to fairness-unaware survival models. Time-stratified evaluations show that FASM maintains stable fairness over a 10-year horizon, with the greatest improvements observed during the mid-term of follow-up. Our approach enables the development of survival models that prioritize both accuracy and equity in clinical decision-making, advancing fairness as a core principle in clinical care.
☆ H-SPLID: HSIC-based Saliency Preserving Latent Information Decomposition NeurIPS 2025
We introduce H-SPLID, a novel algorithm for learning salient feature representations through the explicit decomposition of salient and non-salient features into separate spaces. We show that H-SPLID promotes learning low-dimensional, task-relevant features. We prove that the expected prediction deviation under input perturbations is upper-bounded by the dimension of the salient subspace and the Hilbert-Schmidt Independence Criterion (HSIC) between inputs and representations. This establishes a link between robustness and latent representation compression in terms of the dimensionality and information preserved. Empirical evaluations on image classification tasks show that models trained with H-SPLID primarily rely on salient input components, as indicated by reduced sensitivity to perturbations affecting non-salient features, such as image backgrounds. Our code is available at https://github.com/neu-spiral/H-SPLID.
comment: Accepted at NeurIPS 2025
☆ On Optimal Hyperparameters for Differentially Private Deep Transfer Learning
Differentially private (DP) transfer learning, i.e., fine-tuning a pretrained model on private data, is the current state-of-the-art approach for training large models under privacy constraints. We focus on two key hyperparameters in this setting: the clipping bound $C$ and batch size $B$. We show a clear mismatch between the current theoretical understanding of how to choose an optimal $C$ (stronger privacy requires smaller $C$) and empirical outcomes (larger $C$ performs better under strong privacy), caused by changes in the gradient distributions. Assuming a limited compute budget (fixed epochs), we demonstrate that the existing heuristics for tuning $B$ do not work, while cumulative DP noise better explains whether smaller or larger batches perform better. We also highlight how the common practice of using a single $(C,B)$ setting across tasks can lead to suboptimal performance. We find that performance drops especially when moving between loose and tight privacy and between plentiful and limited compute, which we explain by analyzing clipping as a form of gradient re-weighting and examining cumulative DP noise.
comment: 25 pages, 30 figures
☆ MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation NeurIPS 2025
Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.
comment: NeurIPS 2025, We provide the data and code at https://github.com/OpenDFM/MS-BART
☆ Black Box Absorption: LLMs Undermining Innovative Ideas
Large Language Models are increasingly adopted as critical tools for accelerating innovation. This paper identifies and formalizes a systemic risk inherent in this paradigm: \textbf{Black Box Absorption}. We define this as the process by which the opaque internal architectures of LLM platforms, often operated by large-scale service providers, can internalize, generalize, and repurpose novel concepts contributed by users during interaction. This mechanism threatens to undermine the foundational principles of innovation economics by creating severe informational and structural asymmetries between individual creators and platform operators, thereby jeopardizing the long-term sustainability of the innovation ecosystem. To analyze this challenge, we introduce two core concepts: the idea unit, representing the transportable functional logic of an innovation, and idea safety, a multidimensional standard for its protection. This paper analyzes the mechanisms of absorption and proposes a concrete governance and engineering agenda to mitigate these risks, ensuring that creator contributions remain traceable, controllable, and equitable.
☆ PSO-XAI: A PSO-Enhanced Explainable AI Framework for Reliable Breast Cancer Detection
Breast cancer is considered the most critical and frequently diagnosed cancer in women worldwide, leading to an increase in cancer-related mortality. Early and accurate detection is crucial as it can help mitigate possible threats while improving survival rates. In terms of prediction, conventional diagnostic methods are often limited by variability, cost, and, most importantly, risk of misdiagnosis. To address these challenges, machine learning (ML) has emerged as a powerful tool for computer-aided diagnosis, with feature selection playing a vital role in improving model performance and interpretability. This research study proposes an integrated framework that incorporates customized Particle Swarm Optimization (PSO) for feature selection. This framework has been evaluated on a comprehensive set of 29 different models, spanning classical classifiers, ensemble techniques, neural networks, probabilistic algorithms, and instance-based algorithms. To ensure interpretability and clinical relevance, the study uses cross-validation in conjunction with explainable AI methods. Experimental evaluation showed that the proposed approach achieved a superior score of 99.1\% across all performance metrics, including accuracy and precision, while effectively reducing dimensionality and providing transparent, model-agnostic explanations. The results highlight the potential of combining swarm intelligence with explainable ML for robust, trustworthy, and clinically meaningful breast cancer diagnosis.
☆ Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets
We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.
☆ Convergence Analysis of SGD under Expected Smoothness AISTATS 2026
Stochastic gradient descent (SGD) is the workhorse of large-scale learning, yet classical analyses rely on assumptions that can be either too strong (bounded variance) or too coarse (uniform noise). The expected smoothness (ES) condition has emerged as a flexible alternative that ties the second moment of stochastic gradients to the objective value and the full gradient. This paper presents a self-contained convergence analysis of SGD under ES. We (i) refine ES with interpretations and sampling-dependent constants; (ii) derive bounds of the expectation of squared full gradient norm; and (iii) prove $O(1/K)$ rates with explicit residual errors for various step-size schedules. All proofs are given in full detail in the appendix. Our treatment unifies and extends recent threads (Khaled and Richt\'arik, 2020; Umeda and Iiduka, 2025).
comment: 23 pages, 11 figures, AISTATS 2026
☆ Generalizable Reasoning through Compositional Energy Minimization
Generalization is a key challenge in machine learning, specifically in reasoning tasks, where models are expected to solve problems more complex than those encountered during training. Existing approaches typically train reasoning models in an end-to-end fashion, directly mapping input instances to solutions. While this allows models to learn useful heuristics from data, it often results in limited generalization beyond the training distribution. In this work, we propose a novel approach to reasoning generalization by learning energy landscapes over the solution spaces of smaller, more tractable subproblems. At test time, we construct a global energy landscape for a given problem by combining the energy functions of multiple subproblems. This compositional approach enables the incorporation of additional constraints during inference, allowing the construction of energy landscapes for problems of increasing difficulty. To improve the sample quality from this newly constructed energy landscape, we introduce Parallel Energy Minimization (PEM). We evaluate our approach on a wide set of reasoning problems. Our method outperforms existing state-of-the-art methods, demonstrating its ability to generalize to larger and more complex problems. Project website can be found at: https://alexoarga.github.io/compositional_reasoning/
☆ Strategic Costs of Perceived Bias in Fair Selection NeurIPS 2025
Meritocratic systems, from admissions to hiring, aim to impartially reward skill and effort. Yet persistent disparities across race, gender, and class challenge this ideal. Some attribute these gaps to structural inequality; others to individual choice. We develop a game-theoretic model in which candidates from different socioeconomic groups differ in their perceived post-selection value--shaped by social context and, increasingly, by AI-powered tools offering personalized career or salary guidance. Each candidate strategically chooses effort, balancing its cost against expected reward; effort translates into observable merit, and selection is based solely on merit. We characterize the unique Nash equilibrium in the large-agent limit and derive explicit formulas showing how valuation disparities and institutional selectivity jointly determine effort, representation, social welfare, and utility. We further propose a cost-sensitive optimization framework that quantifies how modifying selectivity or perceived value can reduce disparities without compromising institutional goals. Our analysis reveals a perception-driven bias: when perceptions of post-selection value differ across groups, these differences translate into rational differences in effort, propagating disparities backward through otherwise "fair" selection processes. While the model is static, it captures one stage of a broader feedback cycle linking perceptions, incentives, and outcome--bridging rational-choice and structural explanations of inequality by showing how techno-social environments shape individual incentives in meritocratic systems.
comment: The paper has been accepted by NeurIPS 2025
☆ Diffusion Autoencoders with Perceivers for Long, Irregular and Multimodal Astronomical Sequences
Self-supervised learning has become a central strategy for representation learning, but the majority of architectures used for encoding data have only been validated on regularly-sampled inputs such as images, audios. and videos. In many scientific domains, data instead arrive as long, irregular, and multimodal sequences. To extract semantic information from these data, we introduce the Diffusion Autoencoder with Perceivers (daep). daep tokenizes heterogeneous measurements, compresses them with a Perceiver encoder, and reconstructs them with a Perceiver-IO diffusion decoder, enabling scalable learning in diverse data settings. To benchmark the daep architecture, we adapt the masked autoencoder to a Perceiver encoder/decoder design, and establish a strong baseline (maep) in the same architectural family as daep. Across diverse spectroscopic and photometric astronomical datasets, daep achieves lower reconstruction errors, produces more discriminative latent spaces, and better preserves fine-scale structure than both VAE and maep baselines. These results establish daep as an effective framework for scientific domains where data arrives as irregular, heterogeneous sequences.
☆ Embedding the MLOps Lifecycle into OT Reference Models
Machine Learning Operations (MLOps) practices are increas- ingly adopted in industrial settings, yet their integration with Opera- tional Technology (OT) systems presents significant challenges. This pa- per analyzes the fundamental obstacles in combining MLOps with OT en- vironments and proposes a systematic approach to embed MLOps prac- tices into established OT reference models. We evaluate the suitability of the Reference Architectural Model for Industry 4.0 (RAMI 4.0) and the International Society of Automation Standard 95 (ISA-95) for MLOps integration and present a detailed mapping of MLOps lifecycle compo- nents to RAMI 4.0 exemplified by a real-world use case. Our findings demonstrate that while standard MLOps practices cannot be directly transplanted to OT environments, structured adaptation using existing reference models can provide a pathway for successful integration.
☆ Structural Invariance Matters: Rethinking Graph Rewiring through Graph Metrics
Graph rewiring has emerged as a key technique to alleviate over-squashing in Graph Neural Networks (GNNs) and Graph Transformers by modifying the graph topology to improve information flow. While effective, rewiring inherently alters the graph's structure, raising the risk of distorting important topology-dependent signals. Yet, despite the growing use of rewiring, little is known about which structural properties must be preserved to ensure both performance gains and structural fidelity. In this work, we provide the first systematic analysis of how rewiring affects a range of graph structural metrics, and how these changes relate to downstream task performance. We study seven diverse rewiring strategies and correlate changes in local and global graph properties with node classification accuracy. Our results reveal a consistent pattern: successful rewiring methods tend to preserve local structure while allowing for flexibility in global connectivity. These findings offer new insights into the design of effective rewiring strategies, bridging the gap between graph theory and practical GNN optimization.
comment: 21 pages, 5 figures, conference
☆ A Unified Framework for Zero-Shot Reinforcement Learning
Zero-shot reinforcement learning (RL) has emerged as a setting for developing general agents in an unsupervised manner, capable of solving downstream tasks without additional training or planning at test-time. Unlike conventional RL, which optimizes policies for a fixed reward, zero-shot RL requires agents to encode representations rich enough to support immediate adaptation to any objective, drawing parallels to vision and language foundation models. Despite growing interest, the field lacks a common analytical lens. We present the first unified framework for zero-shot RL. Our formulation introduces a consistent notation and taxonomy that organizes existing approaches and allows direct comparison between them. Central to our framework is the classification of algorithms into two families: direct representations, which learn end-to-end mappings from rewards to policies, and compositional representations, which decompose the representation leveraging the substructure of the value function. Within this framework, we highlight shared principles and key differences across methods, and we derive an extended bound for successor-feature methods, offering a new perspective on their performance in the zero-shot regime. By consolidating existing work under a common lens, our framework provides a principled foundation for future research in zero-shot RL and outlines a clear path toward developing more general agents.
☆ SheafAlign: A Sheaf-theoretic Framework for Decentralized Multimodal Alignment
Conventional multimodal alignment methods assume mutual redundancy across all modalities, an assumption that fails in real-world distributed scenarios. We propose SheafAlign, a sheaf-theoretic framework for decentralized multimodal alignment that replaces single-space alignment with multiple comparison spaces. This approach models pairwise modality relations through sheaf structures and leverages decentralized contrastive learning-based objectives for training. SheafAlign overcomes the limitations of prior methods by not requiring mutual redundancy among all modalities, preserving both shared and unique information. Experiments on multimodal sensing datasets show superior zero-shot generalization, cross-modal alignment, and robustness to missing modalities, with 50\% lower communication cost than state-of-the-art baselines.
comment: 5 pages, 3 figures, 1 table
☆ Blur2seq: Blind Deblurring and Camera Trajectory Estimation from a Single Camera Motion-blurred Image
Motion blur caused by camera shake, particularly under large or rotational movements, remains a major challenge in image restoration. We propose a deep learning framework that jointly estimates the latent sharp image and the underlying camera motion trajectory from a single blurry image. Our method leverages the Projective Motion Blur Model (PMBM), implemented efficiently using a differentiable blur creation module compatible with modern networks. A neural network predicts a full 3D rotation trajectory, which guides a model-based restoration network trained end-to-end. This modular architecture provides interpretability by revealing the camera motion that produced the blur. Moreover, this trajectory enables the reconstruction of the sequence of sharp images that generated the observed blurry image. To further refine results, we optimize the trajectory post-inference via a reblur loss, improving consistency between the blurry input and the restored output. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real datasets, particularly in cases with severe or spatially variant blur, where end-to-end deblurring networks struggle. Code and trained models are available at https://github.com/GuillermoCarbajal/Blur2Seq/
☆ Adversary-Aware Private Inference over Wireless Channels
AI-based sensing at wireless edge devices has the potential to significantly enhance Artificial Intelligence (AI) applications, particularly for vision and perception tasks such as in autonomous driving and environmental monitoring. AI systems rely both on efficient model learning and inference. In the inference phase, features extracted from sensing data are utilized for prediction tasks (e.g., classification or regression). In edge networks, sensors and model servers are often not co-located, which requires communication of features. As sensitive personal data can be reconstructed by an adversary, transformation of the features are required to reduce the risk of privacy violations. While differential privacy mechanisms provide a means of protecting finite datasets, protection of individual features has not been addressed. In this paper, we propose a novel framework for privacy-preserving AI-based sensing, where devices apply transformations of extracted features before transmission to a model server.
☆ Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment ICASSP 2026
Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech
comment: Submitted to ICASSP 2026. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech
☆ Hurdle-IMDL: An Imbalanced Learning Framework for Infrared Rainfall Retrieval
Artificial intelligence has advanced quantitative remote sensing, yet its effectiveness is constrained by imbalanced label distribution. This imbalance leads conventionally trained models to favor common samples, which in turn degrades retrieval performance for rare ones. Rainfall retrieval exemplifies this issue, with performance particularly compromised for heavy rain. This study proposes Hurdle-Inversion Model Debiasing Learning (IMDL) framework. Following a divide-and-conquer strategy, imbalance in the rain distribution is decomposed into two components: zero inflation, defined by the predominance of non-rain samples; and long tail, defined by the disproportionate abundance of light-rain samples relative to heavy-rain samples. A hurdle model is adopted to handle the zero inflation, while IMDL is proposed to address the long tail by transforming the learning object into an unbiased ideal inverse model. Comprehensive evaluation via statistical metrics and case studies investigating rainy weather in eastern China confirms Hurdle-IMDL's superiority over conventional, cost-sensitive, generative, and multi-task learning methods. Its key advancements include effective mitigation of systematic underestimation and a marked improvement in the retrieval of heavy-to-extreme rain. IMDL offers a generalizable approach for addressing imbalance in distributions of environmental variables, enabling enhanced retrieval of rare yet high-impact events.
comment: 26 pages
☆ Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models
Exploiting unlabeled data through semi-supervised learning (SSL) or leveraging pre-trained models via fine-tuning are two prevailing paradigms for addressing label-scarce scenarios. Recently, growing attention has been given to combining fine-tuning of pre-trained vision-language models (VLMs) with SSL, forming the emerging paradigm of semi-supervised fine-tuning. However, existing methods often suffer from model bias and hyperparameter sensitivity, due to reliance on prediction consistency or pre-defined confidence thresholds. To address these limitations, we propose a simple yet effective plug-and-play methodology named $\underline{\textbf{Bi-Co}}$nsistency-$\underline{\textbf{G}}$uided Self-Training (Bi-CoG), which assigns high-quality and low-bias pseudo-labels, by simultaneously exploiting inter-model and intra-model consistency, along with an error-aware dynamic pseudo-label assignment strategy. Both theoretical analysis and extensive experiments over 14 datasets demonstrate the effectiveness of Bi-CoG, which consistently and significantly improves the performance of existing methods.
☆ Concentration and excess risk bounds for imbalanced classification with synthetic oversampling
Synthetic oversampling of minority examples using SMOTE and its variants is a leading strategy for addressing imbalanced classification problems. Despite the success of this approach in practice, its theoretical foundations remain underexplored. We develop a theoretical framework to analyze the behavior of SMOTE and related methods when classifiers are trained on synthetic data. We first derive a uniform concentration bound on the discrepancy between the empirical risk over synthetic minority samples and the population risk on the true minority distribution. We then provide a nonparametric excess risk guarantee for kernel-based classifiers trained using such synthetic data. These results lead to practical guidelines for better parameter tuning of both SMOTE and the downstream learning algorithm. Numerical experiments are provided to illustrate and support the theoretical findings
comment: Page 35, including appendix, Figures 12, including appendix
☆ Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models NeurIPS 2025
Recent years have seen a surge in interest in digital content watermarking techniques, driven by the proliferation of generative models and increased legal pressure. With an ever-growing percentage of AI-generated content available online, watermarking plays an increasingly important role in ensuring content authenticity and attribution at scale. There have been many works assessing the robustness of watermarking to removal attacks, yet, watermark forging, the scenario when a watermark is stolen from genuine content and applied to malicious content, remains underexplored. In this work, we investigate watermark forging in the context of widely used post-hoc image watermarking. Our contributions are as follows. First, we introduce a preference model to assess whether an image is watermarked. The model is trained using a ranking loss on purely procedurally generated images without any need for real watermarks. Second, we demonstrate the model's capability to remove and forge watermarks by optimizing the input image through backpropagation. This technique requires only a single watermarked image and works without knowledge of the watermarking model, making our attack much simpler and more practical than attacks introduced in related work. Third, we evaluate our proposed method on a variety of post-hoc image watermarking models, demonstrating that our approach can effectively forge watermarks, questioning the security of current watermarking approaches. Our code and further resources are publicly available.
comment: NeurIPS 2025
☆ Neural Reasoning for Robust Instance Retrieval in $\mathcal{SHOIQ}$
Concept learning exploits background knowledge in the form of description logic axioms to learn explainable classification models from knowledge bases. Despite recent breakthroughs in neuro-symbolic concept learning, most approaches still cannot be deployed on real-world knowledge bases. This is due to their use of description logic reasoners, which are not robust against inconsistencies nor erroneous data. We address this challenge by presenting a novel neural reasoner dubbed EBR. Our reasoner relies on embeddings to approximate the results of a symbolic reasoner. We show that EBR solely requires retrieving instances for atomic concepts and existential restrictions to retrieve or approximate the set of instances of any concept in the description logic $\mathcal{SHOIQ}$. In our experiments, we compare EBR with state-of-the-art reasoners. Our results suggest that EBR is robust against missing and erroneous data in contrast to existing reasoners.
comment: Accepted as a full research paper at K-CAP 2025
☆ Intransitive Player Dominance and Market Inefficiency in Tennis Forecasting: A Graph Neural Network Approach
Intransitive player dominance, where player A beats B, B beats C, but C beats A, is common in competitive tennis. Yet, there are few known attempts to incorporate it within forecasting methods. We address this problem with a graph neural network approach that explicitly models these intransitive relationships through temporal directed graphs, with players as nodes and their historical match outcomes as directed edges. We find the bookmaker Pinnacle Sports poorly handles matches with high intransitive complexity and posit that our graph-based approach is uniquely positioned to capture relational dynamics in these scenarios. When selectively betting on higher intransitivity matchups with our model (65.7% accuracy, 0.215 Brier Score), we achieve significant positive returns of 3.26% ROI with Kelly staking over 1903 bets, suggesting a market inefficiency in handling intransitive matchups that our approach successfully exploits.
comment: 39 pages, 8 figures
☆ Symbolic Regression and Differentiable Fits in Beyond the Standard Model Physics
We demonstrate the efficacy of symbolic regression (SR) to probe models of particle physics Beyond the Standard Model (BSM), by considering the so-called Constrained Minimal Supersymmetric Standard Model (CMSSM). Like many incarnations of BSM physics this model has a number (four) of arbitrary parameters, which determine the experimental signals, and cosmological observables such as the dark matter relic density. We show that analysis of the phenomenology can be greatly accelerated by using symbolic expressions derived for the observables in terms of the input parameters. Here we focus on the Higgs mass, the cold dark matter relic density, and the contribution to the anomalous magnetic moment of the muon. We find that SR can produce remarkably accurate expressions. Using them we make global fits to derive the posterior probability densities of the CMSSM input parameters which are in good agreement with those performed using conventional methods. Moreover, we demonstrate a major advantage of SR which is the ability to make fits using differentiable methods rather than sampling methods. We also compare the method with neural network (NN) regression. SR produces more globally robust results, while NNs require data that is focussed on the promising regions in order to be equally performant.
comment: 18 pages, 4 figures
☆ MolBridge: Atom-Level Joint Graph Refinement for Robust Drug-Drug Interaction Event Prediction
Drug combinations offer therapeutic benefits but also carry the risk of adverse drug-drug interactions (DDIs), especially under complex molecular structures. Accurate DDI event prediction requires capturing fine-grained inter-drug relationships, which are critical for modeling metabolic mechanisms such as enzyme-mediated competition. However, existing approaches typically rely on isolated drug representations and fail to explicitly model atom-level cross-molecular interactions, limiting their effectiveness across diverse molecular complexities and DDI type distributions. To address these limitations, we propose MolBridge, a novel atom-level joint graph refinement framework for robust DDI event prediction. MolBridge constructs a joint graph that integrates atomic structures of drug pairs, enabling direct modeling of inter-drug associations. A central challenge in such joint graph settings is the potential loss of information caused by over-smoothing when modeling long-range atomic dependencies. To overcome this, we introduce a structure consistency module that iteratively refines node features while preserving the global structural context. This joint design allows MolBridge to effectively learn both local and global interaction outperforms state-of-the-art baselines, achieving superior performance across long-tail and inductive scenarios. patterns, yielding robust representations across both frequent and rare DDI types. Extensive experiments on two benchmark datasets show that MolBridge consistently. These results demonstrate the advantages of fine-grained graph refinement in improving the accuracy, robustness, and mechanistic interpretability of DDI event prediction.This work contributes to Web Mining and Content Analysis by developing graph-based methods for mining and analyzing drug-drug interaction networks.
☆ Explainable Benchmarking through the Lense of Concept Learning
Evaluating competing systems in a comparable way, i.e., benchmarking them, is an undeniable pillar of the scientific method. However, system performance is often summarized via a small number of metrics. The analysis of the evaluation details and the derivation of insights for further development or use remains a tedious manual task with often biased results. Thus, this paper argues for a new type of benchmarking, which is dubbed explainable benchmarking. The aim of explainable benchmarking approaches is to automatically generate explanations for the performance of systems in a benchmark. We provide a first instantiation of this paradigm for knowledge-graph-based question answering systems. We compute explanations by using a novel concept learning approach developed for large knowledge graphs called PruneCEL. Our evaluation shows that PruneCEL outperforms state-of-the-art concept learners on the task of explainable benchmarking by up to 0.55 points F1 measure. A task-driven user study with 41 participants shows that in 80\% of the cases, the majority of participants can accurately predict the behavior of a system based on our explanations. Our code and data are available at https://github.com/dice-group/PruneCEL/tree/K-cap2025
comment: Accepted as full research paper at K-CAP 2025
☆ Learning Decentralized Routing Policies via Graph Attention-based Multi-Agent Reinforcement Learning in Lunar Delay-Tolerant Networks
We present a fully decentralized routing framework for multi-robot exploration missions operating under the constraints of a Lunar Delay-Tolerant Network (LDTN). In this setting, autonomous rovers must relay collected data to a lander under intermittent connectivity and unknown mobility patterns. We formulate the problem as a Partially Observable Markov Decision Problem (POMDP) and propose a Graph Attention-based Multi-Agent Reinforcement Learning (GAT-MARL) policy that performs Centralized Training, Decentralized Execution (CTDE). Our method relies only on local observations and does not require global topology updates or packet replication, unlike classical approaches such as shortest path and controlled flooding-based algorithms. Through Monte Carlo simulations in randomized exploration environments, GAT-MARL provides higher delivery rates, no duplications, and fewer packet losses, and is able to leverage short-term mobility forecasts; offering a scalable solution for future space robotic systems for planetary exploration, as demonstrated by successful generalization to larger rover teams.
☆ Partial Optimality in Cubic Correlation Clustering for General Graphs
The higher-order correlation clustering problem for a graph $G$ and costs associated with cliques of $G$ consists in finding a clustering of $G$ so as to minimize the sum of the costs of those cliques whose nodes all belong to the same cluster. To tackle this NP-hard problem in practice, local search heuristics have been proposed and studied in the context of applications. Here, we establish partial optimality conditions for cubic correlation clustering, i.e., for the special case of at most 3-cliques. We define and implement algorithms for deciding these conditions and examine their effectiveness numerically, on two data sets.
comment: 35 pages
☆ An Empirical Study of Sample Selection Strategies for Large Language Model Repair
Large language models (LLMs) are increasingly deployed in real-world systems, yet they can produce toxic or biased outputs that undermine safety and trust. Post-hoc model repair provides a practical remedy, but the high cost of parameter updates motivates selective use of repair data. Despite extensive prior work on data selection for model training, it remains unclear which sampling criteria are most effective and efficient when applied specifically to behavioral repair of large generative models. Our study presents a systematic analysis of sample prioritization strategies for LLM repair. We evaluate five representative selection methods, including random sampling, K-Center, gradient-norm-based selection(GraNd), stratified coverage (CCS), and a Semantic-Aware Prioritized Sampling (SAPS) approach we proposed. Repair effectiveness and trade-offs are assessed through toxicity reduction, perplexity on WikiText-2 and LAMBADA, and three composite metrics: the Repair Proximity Score (RPS), the Overall Performance Score (OPS), and the Repair Efficiency Score (RES). Experimental results show that SAPS achieves the best balance between detoxification, utility preservation, and efficiency, delivering comparable or superior repair outcomes with substantially less data. Random sampling remains effective for large or robust models, while high-overhead methods such as CCS and GraNd provide limited benefit. The optimal data proportion depends on model scale and repair method, indicating that sample selection should be regarded as a tunable component of repair pipelines. Overall, these findings establish selection-based repair as an efficient and scalable paradigm for maintaining LLM reliability.
☆ Learning Coupled Earth System Dynamics with GraphDOP
Interactions between different components of the Earth System (e.g. ocean, atmosphere, land and cryosphere) are a crucial driver of global weather patterns. Modern Numerical Weather Prediction (NWP) systems typically run separate models of the different components, explicitly coupled across their interfaces to additionally model exchanges between the different components. Accurately representing these coupled interactions remains a major scientific and technical challenge of weather forecasting. GraphDOP is a graph-based machine learning model that learns to forecast weather directly from raw satellite and in-situ observations, without reliance on reanalysis products or traditional physics-based NWP models. GraphDOP simultaneously embeds information from diverse observation sources spanning the full Earth system into a shared latent space. This enables predictions that implicitly capture cross-domain interactions in a single model without the need for any explicit coupling. Here we present a selection of case studies which illustrate the capability of GraphDOP to forecast events where coupled processes play a particularly key role. These include rapid sea-ice freezing in the Arctic, mixing-induced ocean surface cooling during Hurricane Ian and the severe European heat wave of 2022. The results suggest that learning directly from Earth System observations can successfully characterise and propagate cross-component interactions, offering a promising path towards physically consistent end-to-end data-driven Earth System prediction with a single model.
☆ Addressing Mark Imbalance in Integration-free Neural Marked Temporal Point Processes NeurIPS 2025
Marked Temporal Point Process (MTPP) has been well studied to model the event distribution in marked event streams, which can be used to predict the mark and arrival time of the next event. However, existing studies overlook that the distribution of event marks is highly imbalanced in many real-world applications, with some marks being frequent but others rare. The imbalance poses a significant challenge to the performance of the next event prediction, especially for events of rare marks. To address this issue, we propose a thresholding method, which learns thresholds to tune the mark probability normalized by the mark's prior probability to optimize mark prediction, rather than predicting the mark directly based on the mark probability as in existing studies. In conjunction with this method, we predict the mark first and then the time. In particular, we develop a novel neural MTPP model to support effective time sampling and estimation of mark probability without computationally expensive numerical improper integration. Extensive experiments on real-world datasets demonstrate the superior performance of our solution against various baselines for the next event mark and time prediction. The code is available at https://github.com/undes1red/IFNMTPP.
comment: NeurIPS 2025 poster
☆ Why DPO is a Misspecified Estimator and How to Fix It
Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.
☆ Balancing Specialization and Centralization: A Multi-Agent Reinforcement Learning Benchmark for Sequential Industrial Control
Autonomous control of multi-stage industrial processes requires both local specialization and global coordination. Reinforcement learning (RL) offers a promising approach, but its industrial adoption remains limited due to challenges such as reward design, modularity, and action space management. Many academic benchmarks differ markedly from industrial control problems, limiting their transferability to real-world applications. This study introduces an enhanced industry-inspired benchmark environment that combines tasks from two existing benchmarks, SortingEnv and ContainerGym, into a sequential recycling scenario with sorting and pressing operations. We evaluate two control strategies: a modular architecture with specialized agents and a monolithic agent governing the full system, while also analyzing the impact of action masking. Our experiments show that without action masking, agents struggle to learn effective policies, with the modular architecture performing better. When action masking is applied, both architectures improve substantially, and the performance gap narrows considerably. These results highlight the decisive role of action space constraints and suggest that the advantages of specialization diminish as action complexity is reduced. The proposed benchmark thus provides a valuable testbed for exploring practical and robust multi-agent RL solutions in industrial automation, while contributing to the ongoing debate on centralization versus specialization.
comment: Preprint (submitted version) to be presented at the 13th International Conference on Industrial Engineering and Applications (ICIEA-EU), Milan, 2026. The final Version of Record will appear in the official conference proceedings
☆ PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning
Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. The overview and demos are available on our project page: https://point-map.github.io/Point-Map/
☆ Relative-Based Scaling Law for Neural Language Models
Scaling laws aim to accurately predict model performance across different scales. Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric. However, cross-entropy provides only a partial view of performance: it measures the absolute probability assigned to the correct token, but ignores the relative ordering between correct and incorrect tokens. Yet, relative ordering is crucial for language models, such as in greedy-sampling scenario. To address this limitation, we investigate scaling from the perspective of relative ordering. We first propose the Relative-Based Probability (RBP) metric, which quantifies the probability that the correct token is ranked among the top predictions. Building on this metric, we establish the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size. Through extensive experiments on four datasets and four model families spanning five orders of magnitude, we demonstrate the robustness and accuracy of this law. Finally, we illustrate the broad application of this law with two examples, namely providing a deeper explanation of emergence phenomena and facilitating finding fundamental theories of scaling laws. In summary, the Relative-Based Scaling Law complements the cross-entropy perspective and contributes to a more complete understanding of scaling large language models. Thus, it offers valuable insights for both practical development and theoretical exploration.
☆ Hierarchical Time Series Forecasting with Robust Reconciliation
This paper focuses on forecasting hierarchical time-series data, where each higher-level observation equals the sum of its corresponding lower-level time series. In such contexts, the forecast values should be coherent, meaning that the forecast value of each parent series exactly matches the sum of the forecast values of its child series. Existing hierarchical forecasting methods typically generate base forecasts independently for each series and then apply a reconciliation procedure to adjust them so that the resulting forecast values are coherent across the hierarchy. These methods generally derive an optimal reconciliation, using a covariance matrix of the forecast error. In practice, however, the true covariance matrix is unknown and has to be estimated from finite samples in advance. This gap between the true and estimated covariance matrix may degrade forecast performance. To address this issue, we propose a robust optimization framework for hierarchical reconciliation that accounts for uncertainty in the estimated covariance matrix. We first introduce an uncertainty set for the estimated covariance matrix and formulate a reconciliation problem that minimizes the worst-case expected squared error over this uncertainty set. We show that our problem can be cast as a semidefinite optimization problem. Numerical experiments demonstrate that the proposed robust reconciliation method achieved better forecast performance than existing hierarchical forecasting methods, which indicates the effectiveness of integrating uncertainty into the reconciliation process.
☆ Testing Most Influential Sets ICLR
Small subsets of data with disproportionate influence on model outcomes can have dramatic impacts on conclusions, with a few data points sometimes overturning key findings. While recent work has developed methods to identify these \emph{most influential sets}, no formal theory exists to determine when their influence reflects genuine problems rather than natural sampling variation. We address this gap by developing a principled framework for assessing the statistical significance of most influential sets. Our theoretical results characterize the extreme value distributions of maximal influence and enable rigorous hypothesis tests for excessive influence, replacing current ad-hoc sensitivity checks. We demonstrate the practical value of our approach through applications across economics, biology, and machine learning benchmarks.
comment: 9 pages, 1 figure, submitted to ICLR
♻ ☆ One-Step Offline Distillation of Diffusion-based Models via Koopman Modeling
Diffusion-based generative models have demonstrated exceptional performance, yet their iterative sampling procedures remain computationally expensive. A prominent strategy to mitigate this cost is distillation, with offline distillation offering particular advantages in terms of efficiency, modularity, and flexibility. In this work, we identify two key observations that motivate a principled distillation framework: (1) while diffusion models have been viewed through the lens of dynamical systems theory, powerful and underexplored tools can be further leveraged; and (2) diffusion models inherently impose structured, semantically coherent trajectories in latent space. Building on these observations, we introduce the Koopman Distillation Model (KDM), a novel offline distillation approach grounded in Koopman theory - a classical framework for representing nonlinear dynamics linearly in a transformed space. KDM encodes noisy inputs into an embedded space where a learned linear operator propagates them forward, followed by a decoder that reconstructs clean samples. This enables single-step generation while preserving semantic fidelity. We provide theoretical justification for our approach: (1) under mild assumptions, the learned diffusion dynamics admit a finite-dimensional Koopman representation; and (2) proximity in the Koopman latent space correlates with semantic similarity in the generated outputs, allowing for effective trajectory alignment. KDM achieves highly competitive performance across standard offline distillation benchmarks.
♻ ☆ DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.
comment: Preprint
♻ ☆ Autoencoding Random Forests NeurIPS 2025
We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
comment: 10 pages main text, 34 pages total (including checklist). 9 figures, 4 tables. To be published in proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Watermarking Autoregressive Image Generation NeurIPS 2025
Watermarking the outputs of generative models has emerged as a promising approach for tracking their provenance. Despite significant interest in autoregressive image generation models and their potential for misuse, no prior work has attempted to watermark their outputs at the token level. In this work, we present the first such approach by adapting language model watermarking techniques to this setting. We identify a key challenge: the lack of reverse cycle-consistency (RCC), wherein re-tokenizing generated image tokens significantly alters the token sequence, effectively erasing the watermark. To address this and to make our method robust to common image transformations, neural compression, and removal attacks, we introduce (i) a custom tokenizer-detokenizer finetuning procedure that improves RCC, and (ii) a complementary watermark synchronization layer. As our experiments demonstrate, our approach enables reliable and robust watermark detection with theoretically grounded p-values. Code and models are available at https://github.com/facebookresearch/wmar.
comment: NeurIPS 2025
♻ ☆ Learning Modular Exponentiation with Transformers NeurIPS'25
Modular exponentiation is crucial to number theory and cryptography, yet remains largely unexplored from a mechanistic interpretability standpoint. We train a 4-layer encoder-decoder Transformer model to perform this operation and investigate the emergence of numerical reasoning during training. Utilizing principled sampling strategies, PCA-based embedding analysis, and activation patching, we examine how number-theoretic properties are encoded within the model. We find that reciprocal operand training leads to strong performance gains, with sudden generalization across related moduli. These synchronized accuracy surges reflect grokking-like dynamics, suggesting the model internalizes shared arithmetic structure. We also find a subgraph consisting entirely of attention heads in the final layer sufficient to achieve full performance on the task of regular exponentiation. These results suggest that transformer models learn modular arithmetic through specialized computational circuits, paving the way for more interpretable and efficient neural approaches to modular exponentiation.
comment: Accepted at the 5th MATH-AI Workshop, NeurIPS'25
♻ ☆ Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector
Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FF++, such as DF, f2f, FS, and NT, together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many post-processing procedures.
♻ ☆ FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts NeurIPS 2025
Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
comment: NeurIPS 2025 accepted paper
♻ ☆ Position: Many generalization measures for deep learning are fragile
A wide variety of generalization measures have been applied to deep neural networks (DNNs). Although obtaining tight bounds remains challenging, such measures are often assumed to reproduce qualitative generalization trends. In this position paper, we argue that many post-mortem generalization measures -- those computed on trained networks -- are \textbf{fragile}: small training modifications that barely affect the underlying DNN can substantially change a measure's value, trend, or scaling behavior. For example, minor hyperparameter changes, such as learning rate adjustments or switching between SGD variants can reverse the slope of a learning curve in widely used generalization measures like the path norm. We also identify subtler forms of fragility. For instance, the PAC-Bayes origin measure is regarded as one of the most reliable, and is indeed less sensitive to hyperparameter tweaks than many other measures. However, it completely fails to capture differences in data complexity across learning curves. This data fragility contrasts with the function-based marginal-likelihood PAC-Bayes bound, which does capture differences in data-complexity, including scaling behavior, in learning curves, but which is not a post-mortem measure. Beyond demonstrating that many bounds -- such as path, spectral and Frobenius norms, flatness proxies, and deterministic PAC-Bayes surrogates -- are fragile, this position paper also argues that developers of new measures should explicitly audit them for fragility.
♻ ☆ Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs
We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on the MiniF2F benchmark, establishing a new state-of-the-art among methods using small language models (SLMs) with a much lower sample budget than previous approaches. We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems. Our code is publicly available at: https://github.com/kAIto47802/Prover-Agent.
comment: 36 pages, 3 figures
♻ ☆ xRFM: Accurate, scalable, and interpretable feature learning models for tabular data
Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to $31$ other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.
♻ ☆ CLEVER: A Curated Benchmark for Formally Verified Code Generation
We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).
♻ ☆ Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning NeurIPS 2025
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at https://github.com/CJReinforce/PURE.
comment: Accepted by NeurIPS 2025
♻ ☆ TabR1: Taming GRPO for tabular reasoning LLMs
Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (LLMs) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning LLM for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-preserving permutations per sample and estimating advantages both within and across permutations, PRPO transforms sparse rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of LLMs for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger LLMs across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).
♻ ☆ Sampling from multi-modal distributions with polynomial query complexity in fixed dimension via reverse diffusion
Even in low dimensions, sampling from multi-modal distributions is challenging. We provide the first sampling algorithm for a broad class of distributions -- including all Gaussian mixtures -- with a query complexity that is polynomial in the parameters governing multi-modality, assuming fixed dimension. Our sampling algorithm simulates a time-reversed diffusion process, using a self-normalized Monte Carlo estimator of the intermediate score functions. Unlike previous works, it avoids metastability, requires no prior knowledge of the mode locations, and relaxes the well-known log-smoothness assumption which excluded general Gaussian mixtures so far.
♻ ☆ On the Emergence of Linear Analogies in Word Embeddings NeurIPS 2025
Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.
comment: Main: 10 pages, 3 figures. Appendices: 11 pages, 7 figures. Accepted at NeurIPS 2025 as a poster
♻ ☆ Stochastic gradient descent in high dimensions for multi-spiked tensor PCA
We study the high-dimensional dynamics of online stochastic gradient descent (SGD) for the multi-spiked tensor model. This multi-index model arises from the tensor principal component analysis (PCA) problem with multiple spikes, where the goal is to estimate $r$ unknown signal vectors within the $N$-dimensional unit sphere through maximum likelihood estimation from noisy observations of a $p$-tensor. We determine the number of samples and the conditions on the signal-to-noise ratios (SNRs) required to efficiently recover the unknown spikes from natural random initializations. We show that full recovery of all spikes is possible provided a number of sample scaling as $N^{p-2}$, matching the algorithmic threshold identified in the rank-one case [Ben Arous, Gheissari, Jagannath 2020, 2021]. Our results are obtained through a detailed analysis of a low-dimensional system that describes the evolution of the correlations between the estimators and the spikes, while controlling the noise in the dynamics. We find that the spikes are recovered sequentially in a process we term "sequential elimination": once a correlation exceeds a critical threshold, all correlations sharing a row or column index become sufficiently small, allowing the next correlation to grow and become macroscopic. The order in which correlations become macroscopic depends on their initial values and the corresponding SNRs, leading to either exact recovery or recovery of a permutation of the spikes. In the matrix case, when $p=2$, if the SNRs are sufficiently separated, we achieve exact recovery of the spikes, whereas equal SNRs lead to recovery of the subspace spanned by them.
comment: 68 pages, 8 figures
♻ ☆ Superposition Yields Robust Neural Scaling NeurIPS 2025
The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law, that loss decreases as a power law with model size, remains unclear. We propose that representation superposition, meaning that LLMs represent more features than they have dimensions, can be a key contributor to loss and cause neural scaling. Based on Anthropic's toy model, we use weight decay to control the degree of superposition, allowing us to systematically study how loss scales with model size. When superposition is weak, the loss follows a power law only if data feature frequencies are power-law distributed. In contrast, under strong superposition, the loss generically scales inversely with model dimension across a broad class of frequency distributions, due to geometric overlaps between representation vectors. We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling like one over the model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.
comment: Accepted at NeurIPS 2025
♻ ☆ Flow based approach for Dynamic Temporal Causal models with non-Gaussian or Heteroscedastic Noises
Understanding causal relationships in multivariate time series is crucial in many scenarios, such as those dealing with financial or neurological data. Many such time series exhibit multiple regimes, i.e., consecutive temporal segments with a priori unknown boundaries, with each regime having its own causal structure. Inferring causal dependencies and regime shifts is critical for analyzing the underlying processes. However, causal structure learning in this setting is challenging due to (1) non-stationarity, i.e., each regime can have its own causal graph and mixing function, and (2) complex noise distributions, which may be nonGaussian or heteroscedastic. Existing causal discovery approaches cannot address these challenges, since generally assume stationarity or Gaussian noise with constant variance. Hence, we introduce FANTOM, a unified framework for causal discovery that handles non-stationary processes along with non-Gaussian and heteroscedastic noises. FANTOM simultaneously infers the number of regimes and their corresponding indices and learns each regime's Directed Acyclic Graph. It uses a Bayesian Expectation Maximization algorithm that maximizes the evidence lower bound of the data log-likelihood. On the theoretical side, we prove, under mild assumptions, that temporal heteroscedastic causal models, introduced in FANTOM's formulation, are identifiable in both stationary and non-stationary settings. In addition, extensive experiments on synthetic and real data show that FANTOM outperforms existing methods.
♻ ☆ CTSketch: Compositional Tensor Sketching for Scalable Neurosymbolic Learning
Many computational tasks benefit from being formulated as the composition of neural networks followed by a discrete symbolic program. The goal of neurosymbolic learning is to train the neural networks using end-to-end input-output labels of the composite. We introduce CTSketch, a novel, scalable neurosymbolic learning algorithm. CTSketch uses two techniques to improve the scalability of neurosymbolic inference: decompose the symbolic program into sub-programs and summarize each sub-program with a sketched tensor. This strategy allows us to approximate the output distribution of the program with simple tensor operations over the input distributions and the sketches. We provide theoretical insight into the maximum approximation error. Furthermore, we evaluate CTSketch on benchmarks from the neurosymbolic learning literature, including some designed for evaluating scalability. Our results show that CTSketch pushes neurosymbolic learning to new scales that were previously unattainable, with neural predictors obtaining high accuracy on tasks with one thousand inputs, despite supervision only on the final output.
comment: 18 pages, 6 figures
♻ ☆ From Counterfactuals to Trees: Competitive Analysis of Model Extraction Attacks
The advent of Machine Learning as a Service (MLaaS) has heightened the trade-off between model explainability and security. In particular, explainability techniques, such as counterfactual explanations, inadvertently increase the risk of model extraction attacks, enabling unauthorized replication of proprietary models. In this paper, we formalize and characterize the risks and inherent complexity of model reconstruction, focusing on the "oracle'' queries required for faithfully inferring the underlying prediction function. We present the first formal analysis of model extraction attacks through the lens of competitive analysis, establishing a foundational framework to evaluate their efficiency. Focusing on models based on additive decision trees (e.g., decision trees, gradient boosting, and random forests), we introduce novel reconstruction algorithms that achieve provably perfect fidelity while demonstrating strong anytime performance. Our framework provides theoretical bounds on the query complexity for extracting tree-based model, offering new insights into the security vulnerabilities of their deployment.
♻ ☆ Temporal-Difference Variational Continual Learning NeurIPS 2025
Machine Learning models in real-world applications must continuously learn new tasks to adapt to shifts in the data-generating distribution. Yet, for Continual Learning (CL), models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. In the Bayesian CL literature, variational methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution while constraining it to stay close to its previous estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. Experiments on challenging CL benchmarks show that our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.
comment: Published at NeurIPS 2025
♻ ☆ ReDit: Reward Dithering for Improved LLM Policy Optimization
DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.
comment: 34 pages, 19 figures
♻ ☆ CALM-PDE: Continuous and Adaptive Convolutions for Latent Space Modeling of Time-dependent PDEs NeurIPS
Solving time-dependent Partial Differential Equations (PDEs) using a densely discretized spatial domain is a fundamental problem in various scientific and engineering disciplines, including modeling climate phenomena and fluid dynamics. However, performing these computations directly in the physical space often incurs significant computational costs. To address this issue, several neural surrogate models have been developed that operate in a compressed latent space to solve the PDE. While these approaches reduce computational complexity, they often use Transformer-based attention mechanisms to handle irregularly sampled domains, resulting in increased memory consumption. In contrast, convolutional neural networks allow memory-efficient encoding and decoding but are limited to regular discretizations. Motivated by these considerations, we propose CALM-PDE, a model class that efficiently solves arbitrarily discretized PDEs in a compressed latent space. We introduce a novel continuous convolution-based encoder-decoder architecture that uses an epsilon-neighborhood-constrained kernel and learns to apply the convolution operator to adaptive and optimized query points. We demonstrate the effectiveness of CALM-PDE on a diverse set of PDEs with both regularly and irregularly sampled spatial domains. CALM-PDE is competitive with or outperforms existing baseline methods while offering significant improvements in memory and inference time efficiency compared to Transformer-based methods.
comment: Accepted for publication at the 39th Conference on Neural Information Processing Systems (NeurIPS) 2025, San Diego, California, USA
♻ ☆ Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve NeurIPS 2025
Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation--banking--selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.
comment: NeurIPS 2025. Code is available at https://github.com/MITIBM-FastCoder/LessonL
♻ ☆ Which Is Better For Reducing Outdated and Vulnerable Dependencies: Pinning or Floating?
Developers consistently use version constraints to specify acceptable versions of the dependencies for their project. \emph{Pinning} dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, \emph{floating} can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate \emph{pinning} dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since \emph{pinning} is the tightest version constraint, \emph{pinning} is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using \emph{pinning} as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is \emph{floating-minor}, with \emph{pinning} being the next most common. We also find that \emph{floating-major} is the least likely to result in outdated and \emph{floating-minor} is the least likely to result in vulnerable dependencies.
comment: Accepted to ASE 2025
♻ ☆ Embedding principle of homogeneous neural network for classification problem
In this paper, we study the Karush-Kuhn-Tucker (KKT) points of the associated maximum-margin problem in homogeneous neural networks, including fully-connected and convolutional neural networks. In particular, We investigates the relationship between such KKT points across networks of different widths generated. We introduce and formalize the \textbf{KKT point embedding principle}, establishing that KKT points of a homogeneous network's max-margin problem ($P_{\Phi}$) can be embedded into the KKT points of a larger network's problem ($P_{\tilde{\Phi}}$) via specific linear isometric transformations. We rigorously prove this principle holds for neuron splitting in fully-connected networks and channel splitting in convolutional neural networks. Furthermore, we connect this static embedding to the dynamics of gradient flow training with smooth losses. We demonstrate that trajectories initiated from appropriately mapped points remain mapped throughout training and that the resulting $\omega$-limit sets of directions are correspondingly mapped, thereby preserving the alignment with KKT directions dynamically when directional convergence occurs. We conduct several experiments to justify that trajectories are preserved. Our findings offer insights into the effects of network width, parameter redundancy, and the structural connections between solutions found via optimization in homogeneous networks of varying sizes.
♻ ☆ Making Classic GNNs Strong Baselines Across Varying Homophily: A Smoothness-Generalization Perspective NeurIPS 2025
Graph Neural Networks (GNNs) have achieved great success but are often considered to be challenged by varying levels of homophily in graphs. Recent empirical studies have surprisingly shown that homophilic GNNs can perform well across datasets of different homophily levels with proper hyperparameter tuning, but the underlying theory and effective architectures remain unclear. To advance GNN universality across varying homophily, we theoretically revisit GNN message passing and uncover a novel smoothness-generalization dilemma, where increasing hops inevitably enhances smoothness at the cost of generalization. This dilemma hinders learning in higher-order homophilic neighborhoods and all heterophilic ones, where generalization is critical due to complex neighborhood class distributions that are sensitive to shifts induced by noise and sparsity. To address this, we introduce the Inceptive Graph Neural Network (IGNN) built on three simple yet effective design principles, which alleviate the dilemma by enabling distinct hop-wise generalization alongside improved overall generalization with adaptive smoothness. Benchmarking against 30 baselines demonstrates IGNN's superiority and reveals notable universality in certain homophilic GNN variants. Our code and datasets are available at https://github.com/galogm/IGNN.
comment: 36 pages. Accepted by NeurIPS 2025
♻ ☆ BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning NeurIPS 2025
Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.
comment: NeurIPS 2025 Spotlight; Project page: https://imageomics.github.io/bioclip-2/
♻ ☆ Teaming LLMs to Detect and Mitigate Hallucinations NeurIPS 2025
Recent work has demonstrated state-of-the-art results in large language model (LLM) hallucination detection and mitigation through consistency-based approaches which involve aggregating multiple responses sampled from a single LLM for a given prompt. These approaches help offset limitations stemming from the imperfect data on which LLMs are trained, which includes biases and under-representation of information required at deployment time among other limitations which can lead to hallucinations. We show that extending these single-model consistency methods to combine responses from multiple LLMs with different training data, training schemes and model architectures can result in substantial further improvements in hallucination detection and mitigation capabilities beyond their single-model consistency counterparts. We evaluate this "consortium consistency" approach across many model teams from a pool of 15 LLMs and explore under what conditions it is beneficial to team together different LLMs in this manner. Further, we show that these performance improvements often come with reduced inference costs, offsetting a significant drawback with single-model consistency methods.
comment: Accepted to NeurIPS 2025 workshop on Reliable ML from Unreliable Data
♻ ☆ Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons NeurIPS 2025
Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about $5\%$ safety neurons, and by only patching their activations we can restore over $90\%$ of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax'' phenomenon by revealing that the key neurons for model safety and helpfulness significantly overlap, yet they require different activation patterns for the same neurons. Furthermore, we demonstrate an application of our findings in safeguarding LLMs by detecting unsafe outputs before generation. The source code is available at https://github.com/THU-KEG/SafetyNeuron.
comment: NeurIPS 2025
♻ ☆ Edit Flows: Flow Matching with Edit Operations
Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operations$\unicode{x2013}$insertions, deletions, and substitutions. By modeling these operations within a Continuous-time Markov Chain over the sequence space, Edit Flows enable flexible, position-relative generation that aligns more closely with the structure of sequence data. Our training method leverages an expanded state space with auxiliary variables, making the learning process efficient and tractable. Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation.
♻ ☆ Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.
♻ ☆ Streaming Federated Learning with Markovian Data
Federated learning (FL) is now recognized as a key framework for communication-efficient collaborative learning. Most theoretical and empirical studies, however, rely on the assumption that clients have access to pre-collected data sets, with limited investigation into scenarios where clients continuously collect data. In many real-world applications, particularly when data is generated by physical or biological processes, client data streams are often modeled by non-stationary Markov processes. Unlike standard i.i.d. sampling, the performance of FL with Markovian data streams remains poorly understood due to the statistical dependencies between client samples over time. In this paper, we investigate whether FL can still support collaborative learning with Markovian data streams. Specifically, we analyze the performance of Minibatch SGD, Local SGD, and a variant of Local SGD with momentum. We answer affirmatively under standard assumptions and smooth non-convex client objectives: the sample complexity is proportional to the inverse of the number of clients with a communication complexity comparable to the i.i.d. scenario. However, the sample complexity for Markovian data streams remains higher than for i.i.d. sampling.
comment: Neurips 2025 camera-ready version
♻ ☆ Residual Kolmogorov-Arnold Network for Enhanced Deep Learning
Despite their immense success, deep convolutional neural networks (CNNs) can be difficult to optimize and costly to train due to hundreds of layers within the network depth. Conventional convolutional operations are fundamentally limited by their linear nature along with fixed activations, where many layers are needed to learn meaningful patterns in data. Because of the sheer size of these networks, this approach is simply computationally inefficient, and poses overfitting or gradient explosion risks, especially in small datasets. As a result, we introduce a "plug-in" module, called Residual Kolmogorov-Arnold Network (RKAN). Our module is highly compact, so it can be easily added into any stage (level) of traditional deep networks, where it learns to integrate supportive polynomial feature transformations to existing convolutional frameworks. RKAN offers consistent improvements over baseline models in different vision tasks and widely tested benchmarks, accomplishing cutting-edge performance on them.
comment: Code is available at https://github.com/withray/residualKAN.git
♻ ☆ floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL
A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.
comment: Added new experiments, fixed typos. Code -- https://github.com/CMU-AIRe/floq
♻ ☆ Prognostic Framework for Robotic Manipulators Operating Under Dynamic Task Severities
Robotic manipulators are critical in many applications but are known to degrade over time. This degradation is influenced by the nature of the tasks performed by the robot. Tasks with higher severity, such as handling heavy payloads, can accelerate the degradation process. One way this degradation is reflected is in the position accuracy of the robot's end-effector. In this paper, we present a prognostic modeling framework that predicts a robotic manipulator's Remaining Useful Life (RUL) while accounting for the effects of task severity. Our framework represents the robot's position accuracy as a Brownian motion process with a random drift parameter that is influenced by task severity. The dynamic nature of task severity is modeled using a continuous-time Markov chain (CTMC). To evaluate RUL, we discuss two approaches -- (1) a novel closed-form expression for Remaining Lifetime Distribution (RLD), and (2) Monte Carlo simulations, commonly used in prognostics literature. Theoretical results establish the equivalence between these RUL computation approaches. We validate our framework through experiments using two distinct physics-based simulators for planar and spatial robot fleets. Our findings show that robots in both fleets experience shorter RUL when handling a higher proportion of high-severity tasks.
comment: Accepted for Publication in IEEE Transactions on Systems, Man, and Cybernetics: Systems
♻ ☆ Optimizing Time Series Forecasting Architectures: A Hierarchical Neural Architecture Search Approach
The rapid development of time series forecasting research has brought many deep learning-based modules in this field. However, despite the increasing amount of new forecasting architectures, it is still unclear if we have leveraged the full potential of these existing modules within a properly designed architecture. In this work, we propose a novel hierarchical neural architecture search approach for time series forecasting tasks. With the design of a hierarchical search space, we incorporate many architecture types designed for forecasting tasks and allow for the efficient combination of different forecasting architecture modules. Results on long-term-time-series-forecasting tasks show that our approach can search for lightweight high-performing forecasting architectures across different forecasting tasks.
♻ ☆ Deep Learning for Continuous-time Stochastic Control with Jumps
In this paper, we introduce a model-based deep-learning approach to solve finite-horizon continuous-time stochastic control problems with jumps. We iteratively train two neural networks: one to represent the optimal policy and the other to approximate the value function. Leveraging a continuous-time version of the dynamic programming principle, we derive two different training objectives based on the Hamilton-Jacobi-Bellman equation, ensuring that the networks capture the underlying stochastic dynamics. Empirical evaluations on different problems illustrate the accuracy and scalability of our approach, demonstrating its effectiveness in solving complex, high-dimensional stochastic control tasks.
♻ ☆ How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning
In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.
♻ ☆ KOALA++: Efficient Kalman-Based Optimization of Neural Networks with Gradient-Covariance Products
We propose KOALA++, a scalable Kalman-based optimization algorithm that explicitly models structured gradient uncertainty in neural network training. Unlike second-order methods, which rely on expensive second order gradient calculation, our method directly estimates the parameter covariance matrix by recursively updating compact gradient covariance products. This design improves upon the original KOALA framework that assumed diagonal covariance by implicitly capturing richer uncertainty structure without storing the full covariance matrix and avoiding large matrix inversions. Across diverse tasks, including image classification and language modeling, KOALA++ achieves accuracy on par or better than state-of-the-art first- and second-order optimizers while maintaining the efficiency of first-order methods.
♻ ☆ Fast Inference via Hierarchical Speculative Decoding
Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.
♻ ☆ Execution Guided Line-by-Line Code Generation NeurIPS 2026
We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance (EG-CFG), dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions. EG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions. Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming and data science tasks. Our code is available at: https://github.com/boazlavon/eg_cfg
comment: Accepted to NeurIPS 2026
♻ ☆ Solving 0-1 Integer Programs with Unknown Knapsack Constraints Using Membership Oracles
We consider solving a combinatorial optimization problem with unknown knapsack constraints using a membership oracle for each unknown constraint such that, given a solution, the oracle determines whether the constraint is satisfied or not with absolute certainty. The goal of the decision maker is to find the best possible solution subject to a budget on the number of oracle calls. Inspired by active learning for binary classification based on Support Vector Machines (SVMs), we devise a framework to solve the problem by learning and exploiting surrogate linear constraints. The framework includes training linear separators on the labeled points and selecting new points to be labeled, which is achieved by applying a sampling strategy and solving a 0-1 integer linear program. Following the active learning literature, a natural choice would be SVM as a linear classifier and the information-based sampling strategy known as simple margin, for each unknown constraint. We improve on both sides: we propose an alternative sampling strategy based on mixed-integer quadratic programming and a linear separation method inspired by an algorithm for convex optimization in the oracle model. We conduct experiments on classical problems and variants inspired by realistic applications to show how different linear separation methods and sampling strategies influence the quality of the results in terms of several metrics including objective value, dual bound and running time.
♻ ☆ Certified Self-Consistency: Statistical Guarantees and Test-Time Training for Reliable Reasoning in LLMs
Recent advances such as self-consistency and test-time reinforcement learning (TTRL) improve the reliability of large language models (LLMs) without additional supervision, yet their underlying mechanisms and statistical guarantees remain poorly understood. We present a unified framework for certifiable inference in LLMs, showing that majority voting provides a statistical certificate of self-consistency: under mild assumptions, the aggregated answer coincides with the mode of the model's terminal distribution with high probability. We derive finite-sample and anytime-valid concentration bounds that quantify this confidence, and introduce the Martingale Majority Certificate (MMC), a sequential stopping rule that adaptively determines when sufficient samples have been drawn. We further prove that label-free post-training methods such as TTRL implicitly sharpen the answer distribution by exponentially tilting it toward its mode, thereby reducing the number of samples required for certification. Building on this insight, we propose new post-training objectives that explicitly optimise this trade-off between sharpness and bias. Together, these results explain and connect two central test-time scaling strategies, self-consistency and TTRL, within a single statistical framework for label-free, certifiable reliability in reasoning LLMs.
♻ ☆ On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning
While significant progress has been made in conventional fairness-aware machine learning (ML) and differentially private ML (DPML), the fairness of privacy protection across groups remains underexplored. Existing studies have proposed methods to assess group privacy risks, but these are based on the average-case privacy risks of data records. Such approaches may underestimate the group privacy risks, thereby potentially underestimating the disparity across group privacy risks. Moreover, the current method for assessing the worst-case privacy risks of data records is time-consuming, limiting their practical applicability. To address these limitations, we introduce a novel membership inference game that can efficiently audit the approximate worst-case privacy risks of data records. Experimental results demonstrate that our method provides a more stringent measurement of group privacy risks, yielding a reliable assessment of the disparity in group privacy risks. Furthermore, to promote privacy protection fairness in DPML, we enhance the standard DP-SGD algorithm with an adaptive group-specific gradient clipping strategy, inspired by the design of canaries in differential privacy auditing studies. Extensive experiments confirm that our algorithm effectively reduces the disparity in group privacy risks, thereby enhancing the fairness of privacy protection in DPML.
♻ ☆ SMRS: advocating a unified reporting standard for surrogate models in the artificial intelligence era NeurIPS 2025
Surrogate models are widely used to approximate complex systems across science and engineering to reduce computational costs. Despite their widespread adoption, the field lacks standardisation across key stages of the modelling pipeline, including data sampling, model selection, evaluation, and downstream analysis. This fragmentation limits reproducibility and cross-domain utility -- a challenge further exacerbated by the rapid proliferation of AI-driven surrogate models. We argue for the urgent need to establish a structured reporting standard, the Surrogate Model Reporting Standard (SMRS), that systematically captures essential design and evaluation choices while remaining agnostic to implementation specifics. By promoting a standardised yet flexible framework, we aim to improve the reliability of surrogate modelling, foster interdisciplinary knowledge transfer, and, as a result, accelerate scientific progress in the AI era.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), Position Track
♻ ☆ Channel Balance Interpolation in the Lightning Network via Machine Learning
The Bitcoin Lightning Network is a Layer 2 payment protocol that addresses Bitcoin's scalability by facilitating quick and cost effective transactions through payment channels. This research explores the feasibility of using machine learning models to interpolate channel balances within the network, which can be used for optimizing the network's pathfinding algorithms. While there has been much exploration in balance probing and multipath payment protocols, predicting channel balances using solely node and channel features remains an uncharted area. This paper evaluates the performance of several machine learning models against two heuristic baselines and investigates the predictive capabilities of various features. Our model performs favorably in experimental evaluation, outperforming by 10% against an equal split baseline where both edges are assigned half of the channel capacity.
♻ ☆ Geometry Aware Operator Transformer as an Efficient and Accurate Neural Surrogate for PDEs on Arbitrary Domains
The very challenging task of learning solution operators of PDEs on arbitrary domains accurately and efficiently is of vital importance to engineering and industrial simulations. Despite the existence of many operator learning algorithms to approximate such PDEs, we find that accurate models are not necessarily computationally efficient and vice versa. We address this issue by proposing a geometry aware operator transformer (GAOT) for learning PDEs on arbitrary domains. GAOT combines novel multiscale attentional graph neural operator encoders and decoders, together with geometry embeddings and (vision) transformer processors to accurately map information about the domain and the inputs into a robust approximation of the PDE solution. Multiple innovations in the implementation of GAOT also ensure computational efficiency and scalability. We demonstrate this significant gain in both accuracy and efficiency of GAOT over several baselines on a large number of learning tasks from a diverse set of PDEs, including achieving state of the art performance on three large scale three-dimensional industrial CFD datasets.
♻ ☆ Log Neural Controlled Differential Equations: The Lie Brackets Make a Difference
The vector field of a controlled differential equation (CDE) describes the relationship between a control path and the evolution of a solution path. Neural CDEs (NCDEs) treat time series data as observations from a control path, parameterise a CDE's vector field using a neural network, and use the solution path as a continuously evolving hidden state. As their formulation makes them robust to irregular sampling rates, NCDEs are a powerful approach for modelling real-world data. Building on neural rough differential equations (NRDEs), we introduce Log-NCDEs, a novel, effective, and efficient method for training NCDEs. The core component of Log-NCDEs is the Log-ODE method, a tool from the study of rough paths for approximating a CDE's solution. Log-NCDEs are shown to outperform NCDEs, NRDEs, the linear recurrent unit, S5, and MAMBA on a range of multivariate time series datasets with up to $50{,}000$ observations.
comment: 23 pages, 5 figures
♻ ☆ Equivariance Everywhere All At Once: A Recipe for Graph Foundation Models
Graph machine learning architectures are typically tailored to specific tasks on specific datasets, which hinders their broader applicability. This has led to a new quest in graph machine learning: how to build graph foundation models capable of generalizing across arbitrary graphs and features? In this work, we present a recipe for designing graph foundation models for node-level tasks from first principles. The key ingredient underpinning our study is a systematic investigation of the symmetries that a graph foundation model must respect. In a nutshell, we argue that label permutation-equivariance alongside feature permutation-invariance are necessary in addition to the common node permutation-equivariance on each local neighborhood of the graph. To this end, we first characterize the space of linear transformations that are equivariant to permutations of nodes and labels, and invariant to permutations of features. We then prove that the resulting network is a universal approximator on multisets that respect the aforementioned symmetries. Our recipe uses such layers on the multiset of features induced by the local neighborhood of the graph to obtain a class of graph foundation models for node property prediction. We validate our approach through extensive experiments on 29 real-world node classification datasets, demonstrating both strong zero-shot empirical performance and consistent improvement as the number of training graphs increases.
♻ ☆ Continuous Uniqueness and Novelty Metrics for Generative Modeling of Inorganic Crystals NeurIPS 2025
To address pressing scientific challenges such as climate change, increasingly sophisticated generative artificial intelligence models are being developed that can efficiently sample the large chemical space of possible functional materials. These models can quickly sample new chemical compositions paired with crystal structures. They are typically evaluated using uniqueness and novelty metrics, which depend on a chosen crystal distance function. However, the most prevalent distance function has four limitations: it fails to quantify the degree of similarity between compounds, cannot distinguish compositional difference and structural difference, lacks Lipschitz continuity against shifts in atomic coordinates, and results in a uniqueness metric that is not invariant against the permutation of generated samples. In this work, we propose using two continuous distance functions to evaluate uniqueness and novelty, which theoretically overcome these limitations. Our experiments show that these distances reveal insights missed by traditional distance functions, providing a more reliable basis for evaluating and comparing generative models for inorganic crystals.
comment: 13 pages (5 pages of main text), accepted to the AI4Mat workshop at NeurIPS 2025. See https://github.com/WMD-group/xtalmet for the code. Added references and a footnote in Section 3
♻ ☆ SafeDiver: Cooperative AUV-USV Assisted Diver Communication via Multi-agent Reinforcement Learning Approach
As underwater human activities are increasing, the demand for underwater communication service presents a significant challenge. Existing underwater diver communication methods face hurdles due to inherent disadvantages and complex underwater environments. To address this issue, we propose a scheme that utilizes maritime unmanned systems to assist divers with reliable and high-speed communication. Multiple AUVs are equipped with optical and acoustic multimodal communication devices as relay nodes, providing adaptive communication services based on changes in the diver's activity area. By using a multi-agent reinforcement learning (MARL) approach to control the cooperative movement of AUVs, high-speed and reliable data transmission between divers can be achieved. At the same time, utilizing the advantages of on-demand deployment and wide coverage of unmanned surface vehicles (USVs) as surface relay nodes to coordinate and forward information from AUVs, and controlling AUVs to adaptively select relay USV nodes for data transmission, high-quality communication between divers and surface platform can be achieved. Through simulation verification, the proposed scheme can effectively achieve reliable and high-speed communication for divers.
comment: Withdrawn to reorganize and extend the current findings in a future version
♻ ☆ Adaptive PCA-Based Outlier Detection for Multi-Feature Time Series in Space Missions CCS 2025
Analyzing multi-featured time series data is critical for space missions making efficient event detection, potentially onboard, essential for automatic analysis. However, limited onboard computational resources and data downlink constraints necessitate robust methods for identifying regions of interest in real time. This work presents an adaptive outlier detection algorithm based on the reconstruction error of Principal Component Analysis (PCA) for feature reduction, designed explicitly for space mission applications. The algorithm adapts dynamically to evolving data distributions by using Incremental PCA, enabling deployment without a predefined model for all possible conditions. A pre-scaling process normalizes each feature's magnitude while preserving relative variance within feature types. We demonstrate the algorithm's effectiveness in detecting space plasma events, such as distinct space environments, dayside and nightside transients phenomena, and transition layers through NASA's MMS mission observations. Additionally, we apply the method to NASA's THEMIS data, successfully identifying a dayside transient using onboard-available measurements.
comment: Accepted to ICCS 2025
♻ ☆ Benchmarking World-Model Learning
Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$\unicode{x2014}$models should support many different tasks unknown ahead of time$\unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$\unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.
comment: 30 pages, 10 figures
♻ ☆ Field theory for optimal signal propagation in ResNets
Residual networks have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range of values for this scaling parameter, the associated performance improvement and its universality across network hyperparameters yet need to be understood. For feed-forward networks, finite-size theories have led to important insights with regard to signal propagation and hyperparameter tuning. We here derive a systematic finite-size field theory for residual networks to study signal propagation and its dependence on the scaling for the residual branch. We derive analytical expressions for the response function, a measure for the network's sensitivity to inputs, and show that for deep networks the empirically found values for the scaling parameter lie within the range of maximal sensitivity. Furthermore, we obtain an analytical expression for the optimal scaling parameter that depends only weakly on other network hyperparameters, such as the weight variance, thereby explaining its universality across hyperparameters. Overall, this work provides a theoretical framework to study ResNets at finite size.
comment: 25 pages, 9 figures, accepted at Physical Review E
♻ ☆ Continuous Diffusion Model for Language Modeling NeurIPS 2025
Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. However, diffusion models that directly work on discrete data space fail to fully exploit the power of iterative refinement, as the signals are lost during transitions between discrete states. Existing continuous diffusion models for discrete data underperform compared to discrete methods, and the lack of a clear connection between the two approaches hinders the development of effective diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on this analogy, introduce a simple diffusion process that generalizes existing discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry, along with a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. The code is available at https://github.com/harryjo97/RDLM.
comment: NeurIPS 2025
♻ ☆ MIRA: Medical Time Series Foundation Model for Real-World Health Data NeurIPS 2025
A unified foundation model for medical time series -- pretrained on open access and ethics board-approved medical corpora -- offers the potential to reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, particularly in data-scarce or privacy-constrained environments. However, existing generalist time series foundation models struggle to handle medical time series data due to their inherent challenges, including irregular intervals, heterogeneous sampling rates, and frequent missing values. To address these challenges, we introduce MIRA, a unified foundation model specifically designed for medical time series forecasting. MIRA incorporates a Continuous-Time Rotary Positional Encoding that enables fine-grained modeling of variable time intervals, a frequency-specific mixture-of-experts layer that routes computation across latent frequency regimes to further promote temporal specialization, and a Continuous Dynamics Extrapolation Block based on Neural ODE that models the continuous trajectory of latent states, enabling accurate forecasting at arbitrary target timestamps. Pretrained on a large-scale and diverse medical corpus comprising over 454 billion time points collect from publicly available datasets, MIRA achieves reductions in forecasting errors by an average of 10% and 7% in out-of-distribution and in-distribution scenarios, respectively, when compared to other zero-shot and fine-tuned baselines. We also introduce a comprehensive benchmark spanning multiple downstream clinical tasks, establishing a foundation for future research in medical time series modeling.
comment: NeurIPS 2025 Main Conference
♻ ☆ Bayes or Heisenberg: Who(se) Rules?
Although quantum systems are generally described by quantum state vectors, we show that in certain cases their measurement processes can be reformulated as probabilistic equations expressed in terms of probabilistic state vectors. These probabilistic representations can, in turn, be approximated by the neural network dynamics of the Tensor Brain (TB) model. The Tensor Brain is a recently proposed framework for modeling perception and memory in the brain, providing a biologically inspired mechanism for efficiently integrating generated symbolic representations into reasoning processes.
♻ ☆ Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning NeurIPS 2025
Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
comment: To appear on NeurIPS 2025 (spotlight)
♻ ☆ PRUNE: A Patching Based Repair Framework for Certifiable Unlearning of Neural Networks
It is often desirable to remove (a.k.a. unlearn) a specific part of the training data from a trained neural network model. A typical application scenario is to protect the data holder's right to be forgotten, which has been promoted by many recent regulation rules. Existing unlearning methods involve training alternative models with remaining data, which may be costly and challenging to verify from the data holder or a thirdparty auditor's perspective. In this work, we provide a new angle and propose a novel unlearning approach by imposing carefully crafted "patch" on the original neural network to achieve targeted "forgetting" of the requested data to delete. Specifically, inspired by the research line of neural network repair, we propose to strategically seek a lightweight minimum "patch" for unlearning a given data point with certifiable guarantee. Furthermore, to unlearn a considerable amount of data points (or an entire class), we propose to iteratively select a small subset of representative data points to unlearn, which achieves the effect of unlearning the whole set. Extensive experiments on multiple categorical datasets demonstrates our approach's effectiveness, achieving measurable unlearning while preserving the model's performance and being competitive in efficiency and memory consumption compared to various baseline methods.
♻ ☆ OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection
The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.
comment: Updated results for NNGuide and ViM
♻ ☆ The Parameterized Complexity of Computing the VC-Dimension NeurIPS 2025
The VC-dimension is a well-studied and fundamental complexity measure of a set system (or hypergraph) that is central to many areas of machine learning. We establish several new results on the complexity of computing the VC-dimension. In particular, given a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$, we prove that the naive $2^{\mathcal{O}(|\mathcal{V}|)}$-time algorithm is asymptotically tight under the Exponential Time Hypothesis (ETH). We then prove that the problem admits a $1$-additive fixed-parameter approximation algorithm when parameterized by the maximum degree of $\mathcal{H}$ and a fixed-parameter algorithm when parameterized by its dimension, and that these are essentially the only such exploitable structural parameters. Lastly, we consider a generalization of the problem, formulated using graphs, which captures the VC-dimension of both set systems and graphs. We design a $2^{\mathcal{O}(\rm{tw}\cdot \log \rm{tw})}\cdot |V|$-time algorithm for any graph $G=(V,E)$ of treewidth $\rm{tw}$ (which, for a set system, applies to the treewidth of its incidence graph). This is in contrast with closely related problems that require a double-exponential dependency on the treewidth (assuming the ETH).
comment: To appear in the proceedings of NeurIPS 2025
♻ ☆ A Neural Difference-of-Entropies Estimator for Mutual Information
Estimating Mutual Information (MI), a key measure of dependence of random quantities without specific modelling assumptions, is a challenging problem in high dimensions. We propose a novel mutual information estimator based on parametrizing conditional densities using normalizing flows, a deep generative model that has gained popularity in recent years. This estimator leverages a block autoregressive structure to achieve improved bias-variance trade-offs on standard benchmark tasks.
comment: 23 pages, 17 figures
♻ ☆ CONFEX: Uncertainty-Aware Counterfactual Explanations with Conformal Guarantees
Counterfactual explanations (CFXs) provide human-understandable justifications for model predictions, enabling actionable recourse and enhancing interpretability. To be reliable, CFXs must avoid regions of high predictive uncertainty, where explanations may be misleading or inapplicable. However, existing methods often neglect uncertainty or lack principled mechanisms for incorporating it with formal guarantees. We propose CONFEX, a novel method for generating uncertainty-aware counterfactual explanations using Conformal Prediction (CP) and Mixed-Integer Linear Programming (MILP). CONFEX explanations are designed to provide local coverage guarantees, addressing the issue that CFX generation violates exchangeability. To do so, we develop a novel localised CP procedure that enjoys an efficient MILP encoding by leveraging an offline tree-based partitioning of the input space. This way, CONFEX generates CFXs with rigorous guarantees on both predictive uncertainty and optimality. We evaluate CONFEX against state-of-the-art methods across diverse benchmarks and metrics, demonstrating that our uncertainty-aware approach yields robust and plausible explanations.
comment: 35 pages [11+24 Appendix]. Metadata revised
♻ ☆ Multi-Agent Reinforcement Learning for Task Offloading in Wireless Edge Networks NeurIPS'25
In edge computing systems, autonomous agents must make fast local decisions while competing for shared resources. Existing MARL methods often resume to centralized critics or frequent communication, which fail under limited observability and communication constraints. We propose a decentralized framework in which each agent solves a constrained Markov decision process (CMDP), coordinating implicitly through a shared constraint vector. For the specific case of offloading, e.g., constraints prevent overloading shared server resources. Coordination constraints are updated infrequently and act as a lightweight coordination mechanism. They enable agents to align with global resource usage objectives but require little direct communication. Using safe reinforcement learning, agents learn policies that meet both local and global goals. We establish theoretical guarantees under mild assumptions and validate our approach experimentally, showing improved performance over centralized and independent baselines, especially in large-scale settings.
comment: Oral presentation at AI4NextG @ NeurIPS'25 Workshop
♻ ☆ Quantitative convergence of trained single layer neural networks to Gaussian processes NeurIPS 2025
In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit. While previous work has established qualitative convergence under broad settings, precise, finite-width estimates remain limited, particularly during training. We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time $t \ge 0$, demonstrating polynomial decay with network width. Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error.
comment: Submitted and accepted at NeurIPS 2025, main body of 10 pages, 3 figures, 28 pages of supplementary material
♻ ☆ WENDy for Nonlinear-in-Parameters ODEs
The Weak-form Estimation of Non-linear Dynamics (WENDy) framework is a recently developed approach for parameter estimation and inference of systems of ordinary differential equations (ODEs). Prior work demonstrated WENDy to be robust, computationally efficient, and accurate, but only works for ODEs which are linear-in-parameters. In this work, we derive a novel extension to accommodate systems of a more general class of ODEs that are nonlinear-in-parameters. Our new WENDy-MLE algorithm approximates a maximum likelihood estimator via local non-convex optimization methods. This is made possible by the availability of analytic expressions for the likelihood function and its first and second order derivatives. WENDy-MLE has better accuracy, a substantially larger domain of convergence, and is often faster than other weak form methods and the conventional output error least squares method. Moreover, we extend the framework to accommodate data corrupted by multiplicative log-normal noise. The WENDy.jl algorithm is efficiently implemented in Julia. In order to demonstrate the practical benefits of our approach, we present extensive numerical results comparing our method, other weak form methods, and output error least squares on a suite of benchmark systems of ODEs in terms of accuracy, precision, bias, and coverage.
♻ ☆ Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning NeurIPS 2025
Continual Learning (CL) aims to enable models to continuously acquire new knowledge from a sequence of tasks with avoiding the forgetting of learned information. However, existing CL methods only rely on the parameters of the most recent task for inference, which makes them susceptible to catastrophic forgetting. Inspired by the recent success of model merging techniques, we propose \textbf{Perturb-and-Merge (P\&M)}, a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting. Specifically, after training on each task, P\&M constructs a new model by forming a convex combination of the previous model and the newly trained task-specific model. Through theoretical analysis, We minimize the total loss increase across all tasks and derive a closed-form solution for the merging coefficient under mild assumptions. To further improve the performance of the merged model, we observe that the degradation introduced during merging can be alleviated by a regularization term composed of the task vector and the Hessian matrix of the loss function. Interestingly, we show that this term can be efficiently approximated using second-order symmetric finite differences, and a stochastic perturbation strategy along the task vector direction is accordingly devised which incurs no additional forward or backward passes while providing an effective approximation of the regularization term. Finally, we combine P\&M with LoRA, a parameter-efficient fine-tuning method, to reduce memory overhead. Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets. The code is available at https://github.com/qhmiao/P-M-for-Continual-Learning.
comment: Accepted by NeurIPS 2025
♻ ☆ Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning
Reinforcement fine-tuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.
♻ ☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
♻ ☆ UMoE: Unifying Attention and FFN with Shared Experts NeurIPS 2025
Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, that reveals an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.
comment: NeurIPS 2025 Spotlight
♻ ☆ Distributional Adversarial Attacks and Training in Deep Hedging NeurIPS 2025
In this paper, we study the robustness of classical deep hedging strategies under distributional shifts by leveraging the concept of adversarial attacks. We first demonstrate that standard deep hedging models are highly vulnerable to small perturbations in the input distribution, resulting in significant performance degradation. Motivated by this, we propose an adversarial training framework tailored to increase the robustness of deep hedging strategies. Our approach extends pointwise adversarial attacks to the distributional setting and introduces a computationally tractable reformulation of the adversarial optimization problem over a Wasserstein ball. This enables the efficient training of hedging strategies that are resilient to distributional perturbations. Through extensive numerical experiments, we show that adversarially trained deep hedging strategies consistently outperform their classical counterparts in terms of out-of-sample performance and resilience to model misspecification. Additional results indicate that the robust strategies maintain reliable performance on real market data and remain effective during periods of market change. Our findings establish a practical and effective framework for robust deep hedging under realistic market uncertainties.
comment: Camera-ready version (accepted at NeurIPS 2025 https://neurips.cc/virtual/2025/poster/115434)
♻ ☆ Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning
Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.
♻ ☆ Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions
Machine learning is a vital part of many real-world systems, but several concerns remain about the lack of interpretability, explainability and robustness of black-box AI systems. Concept Bottleneck Models (CBM) address some of these challenges by learning interpretable concepts from high-dimensional data, e.g. images, which are used to predict labels. An important issue in CBMs are spurious correlation between concepts, which effectively lead to learning "wrong" concepts. Current mitigating strategies have strong assumptions, e.g., they assume that the concepts are statistically independent of each other, or require substantial interaction in terms of both interventions and labels provided by annotators. In this paper, we describe a framework that provides theoretical guarantees on the correctness of the learned concepts and on the number of required labels, without requiring any interventions. Our framework leverages causal representation learning (CRL) methods to learn latent causal variables from high-dimensional observations in a unsupervised way, and then learns to align these variables with interpretable concepts with few concept labels. We propose a linear and a non-parametric estimator for this mapping, providing a finite-sample high probability result in the linear case and an asymptotic consistency result for the non-parametric estimator. We evaluate our framework in synthetic and image benchmarks, showing that the learned concepts have less impurities and are often more accurate than other CBMs, even in settings with strong correlations between concepts.
comment: 58 pages, 23 figures, 12 Tables, Published
♻ ☆ The Faiss library
Vector databases typically manage large collections of embedding vectors. Currently, AI applications are growing rapidly, and so is the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This paper describes the trade-off space of vector search and the design principles of Faiss in terms of structure, approach to optimization and interfacing. We benchmark key features of the library and discuss a few selected applications to highlight its broad applicability.
♻ ☆ Sign-In to the Lottery: Reparameterizing Sparse Training From Scratch NeurIPS 2025
The performance gap between training sparse neural networks from scratch (PaI) and dense-to-sparse training presents a major roadblock for efficient deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on finding a problem specific parameter initialization. As we show, to this end, determining correct parameter signs is sufficient. Yet, they remain elusive to PaI. To address this issue, we propose Sign-In, which employs a dynamic reparameterization that provably induces sign flips. Such sign flips are complementary to the ones that dense-to-sparse training can accomplish, rendering Sign-In as an orthogonal method. While our experiments and theory suggest performance improvements of PaI, they also carve out the main open challenge to close the gap between PaI and dense-to-sparse training.
comment: Accepted at NeurIPS 2025
Human-Computer Interaction 18
☆ FieldGen: From Teleoperated Pre-Manipulation Trajectories to Field-Guided Data Generation
Large-scale and diverse datasets are vital for training robust robotic manipulation policies, yet existing data collection methods struggle to balance scale, diversity, and quality. Simulation offers scalability but suffers from sim-to-real gaps, while teleoperation yields high-quality demonstrations with limited diversity and high labor cost. We introduce FieldGen, a field-guided data generation framework that enables scalable, diverse, and high-quality real-world data collection with minimal human supervision. FieldGen decomposes manipulation into two stages: a pre-manipulation phase, allowing trajectory diversity, and a fine manipulation phase requiring expert precision. Human demonstrations capture key contact and pose information, after which an attraction field automatically generates diverse trajectories converging to successful configurations. This decoupled design combines scalable trajectory diversity with precise supervision. Moreover, FieldGen-Reward augments generated data with reward annotations to further enhance policy learning. Experiments demonstrate that policies trained with FieldGen achieve higher success rates and improved stability compared to teleoperation-based baselines, while significantly reducing human effort in long-term real-world data collection. Webpage is available at https://fieldgen.github.io/.
comment: Webpage: https://fieldgen.github.io/
☆ Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations
We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges.
☆ Optimizing Feature Ordering in Radar Charts for Multi-Profile Comparison
Radar charts are widely used to visualize multivariate data and compare multiple profiles across features. However, the visual clarity of radar charts can be severely compromised when feature values alternate drastically in magnitude around the circle, causing areas to collapse, which misrepresents relative differences. In the present work we introduce a permutation optimization strategy that reorders features to minimize polygon ``spikiness'' across multiple profiles simultaneously. The method is combinatorial (exhaustive search) for moderate numbers of features and uses a lexicographic minimax criterion that first considers overall smoothness (mean jump) and then the largest single jump as a tie-breaker. This preserves more global information and produces visually balanced arrangements. We discuss complexity, practical bounds, and relations to existing approaches that either change the visualization (e.g., OrigamiPlot) or learn orderings (e.g., Versatile Ordering Network). An example with two profiles and $p=6$ features (before/after ordering) illustrates the qualitative improvement. Keywords: data visualization, radar charts, combinatorial optimization, minimax optimization, feature ordering
☆ User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios
Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs' ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users' perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users' evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs' ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users' perceived privacy and utility.
☆ Risk Psychology & Cyber-Attack Tactics
We examine whether measured cognitive processes predict cyber-attack behavior. We analyzed data that included psychometric scale responses and labeled attack behaviors from cybersecurity professionals who conducted red-team operations against a simulated enterprise network. We employed multilevel mixed-effects Poisson regression with technique counts nested within participants to test whether cognitive processes predicted technique-specific usage. The scales significantly predicted technique use, but effects varied by technique rather than operating uniformly. Neither expertise level nor experimental treatment condition significantly predicted technique patterns, indicating that cognitive processes may be stronger drivers of technique selection than training or experience. These findings demonstrate that individual cognitive differences shape cyber-attack behavior and support the development of psychology-informed defense strategies.
comment: Submitted and presented at AHFE Hawaii 2025. 2 tables, 2 figures
☆ From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail
In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation: geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.
☆ Designing Intent Communication for Agent-Human Collaboration
As autonomous agents, from self-driving cars to virtual assistants, become increasingly present in everyday life, safe and effective collaboration depends on human understanding of agents' intentions. Current intent communication approaches are often rigid, agent-specific, and narrowly scoped, limiting their adaptability across tasks, environments, and user preferences. A key gap remains: existing models of what to communicate are rarely linked to systematic choices of how and when to communicate, preventing the development of generalizable, multi-modal strategies. In this paper, we introduce a multidimensional design space for intent communication structured along three dimensions: Transparency (what is communicated), Abstraction (when), and Modality (how). We apply this design space to three distinct human-agent collaboration scenarios: (a) bystander interaction, (b) cooperative tasks, and (c) shared control, demonstrating its capacity to generate adaptable, scalable, and cross-domain communication strategies. By bridging the gap between intent content and communication implementation, our design space provides a foundation for designing safer, more intuitive, and more transferable agent-human interactions.
☆ From Generation to Attribution: Music AI Agent Architectures for the Post-Streaming Era NeurIPS 2025
Generative AI is reshaping music creation, but its rapid growth exposes structural gaps in attribution, rights management, and economic models. Unlike past media shifts, from live performance to recordings, downloads, and streaming, AI transforms the entire lifecycle of music, collapsing boundaries between creation, distribution, and monetization. However, existing streaming systems, with opaque and concentrated royalty flows, are ill-equipped to handle the scale and complexity of AI-driven production. We propose a content-based Music AI Agent architecture that embeds attribution directly into the creative workflow through block-level retrieval and agentic orchestration. Designed for iterative, session-based interaction, the system organizes music into granular components (Blocks) stored in BlockDB; each use triggers an Attribution Layer event for transparent provenance and real-time settlement. This framework reframes AI from a generative tool into infrastructure for a Fair AI Media Platform. By enabling fine-grained attribution, equitable compensation, and participatory engagement, it points toward a post-streaming paradigm where music functions not as a static catalog but as a collaborative and adaptive ecosystem.
comment: Accepted to the NeurIPS 2025 AI4Music Workshop
☆ Towards AI Agents for Course Instruction in Higher Education: Early Experiences from the Field
This article presents early findings from designing, deploying and evaluating an AI-based educational agent deployed as the primary instructor in a graduate-level Cloud Computing course at IISc. We detail the design of a Large Language Model (LLM)-driven Instructor Agent, and introduce a pedagogical framework that integrates the Instructor Agent into the course workflow for actively interacting with the students for content delivery, supplemented by the human instructor to offer the course structure and undertake question--answer sessions. We also propose an analytical framework that evaluates the Agent--Student interaction transcripts using interpretable engagement metrics of topic coverage, topic depth and turn-level elaboration. We report early experiences on how students interact with the Agent to explore concepts, clarify doubts and sustain inquiry-driven dialogue during live classroom sessions. We also report preliminary analysis on our evaluation metrics applied across two successive instructional modules that reveals patterns of engagement evolution, transitioning from broad conceptual exploration to deeper, focused inquiry. These demonstrate how structured integration of conversational AI agents can foster reflective learning, offer a reproducible methodology for studying engagement in authentic classroom settings, and support scalable, high-quality higher education.
☆ "Learning Together": AI-Mediated Support for Parental Involvement in Everyday Learning
Family learning takes place in everyday routines where children and caregivers read, practice, and develop new skills together. Although AI is increasingly present in learning environments, most systems remain child-centered and overlook the collaborative, distributed nature of family education. This paper investigates how AI can mediate family collaboration by addressing tensions of coordination, uneven workloads, and parental mediation. From a formative study with families using AI in daily learning, we identified challenges in responsibility sharing and recognition of contributions. Building on these insights, we designed FamLearn, an LLM-powered prototype that distributes tasks, visualizes contributions, and provides individualized support. A one-week field study with 11 families shows how this prototype can ease caregiving burdens, foster recognition, and enrich shared learning experiences. Our findings suggest that LLMs can move beyond the role of tutor to act as family mediators - balancing responsibilities, scaffolding intergenerational participation, and strengthening the relational fabric of family learning.
☆ Race and Gender in LLM-Generated Personas: A Large-Scale Audit of 41 Occupations
Generative AI tools are increasingly used to create portrayals of people in occupations, raising concerns about how race and gender are represented. We conducted a large-scale audit of over 1.5 million occupational personas across 41 U.S. occupations, generated by four large language models with different AI safety commitments and countries of origin (U.S., China, France). Compared with Bureau of Labor Statistics data, we find two recurring patterns: systematic shifts, where some groups are consistently under- or overrepresented, and stereotype exaggeration, where existing demographic skews are amplified. On average, White (--31pp) and Black (--9pp) workers are underrepresented, while Hispanic (+17pp) and Asian (+12pp) workers are overrepresented. These distortions can be extreme: for example, across all four models, Housekeepers are portrayed as nearly 100\% Hispanic, while Black workers are erased from many occupations. For HCI, these findings show provider choice materially changes who is visible, motivating model-specific audits and accountable design practices.
☆ NeuroPilot: A Realtime Brain-Computer Interface system to enhance concentration of students in online learning
Prevalence of online learning poses a vital challenge in real-time monitoring of students' concentration. Traditional methods such as questionnaire assessments require manual interventions and webcam-based monitoring fails to provide accurate insights into learners' mental focus as they are deceived by mere screen fixation without cognitive engagement. Existing BCI-based approaches lack real-time validation and evaluation procedures. To address these limitations, a Brain-Computer Interface (BCI) system is developed using a non-invasive Electroencephalogram (EEG) headband, FocusCalm, to record brainwave activity under attentive and non-attentive states. 20 minutes of data were collected from each of 20 participants watching a pre-recorded educational video. The data validation employed a novel intra-video questionnaire assessment. Subsequently, collected signals were segmented (sliding window), filtered (butterworth bandpass), and cleaned (removal of high-amplitude and EOG artifacts such as eye blinks). Time, frequency, wavelet and statistical features have been extracted, followed by recursive feature elimination (RFE) with Support vector machines (SVMs) to classify attention and non-attention states. The leave-one-subject-out (LOSO) cross-validation accuracy has been tested to be 88.77%. The system provides feedback alerts upon non-attention state detection and keeps focus profile logs. A pilot study was conducted to evaluate the effectiveness of real-time feedback. Five participants completed a 10-minute session consisting of a 5-minute baseline phase without feedback followed by a 5-minute feedback phase, during which alerts were issued if participants remained non-attentive for approximately 8 consecutive seconds. A paired t-test (t = 5.73, p = 0.007) indicated a statistically significant improvement in concentration during the feedback phase.
comment: 11 pages, 5 figures and 3 tables
☆ Digital Permission Structures: How Celebrity Disclosure Enables Black Masculine Vulnerability in Online Mental Health Discourse
Black men face a double barrier to mental health help-seeking: traditional masculinity norms demanding emotional restrictiveness and systemic racism fostering institutional mistrust. While celebrity mental health disclosures show promise for stigma reduction, limited research examines their impact on Black masculine communities through digital platforms. This convergent mixed-methods study analysed 11,306 YouTube comments following rapper Lil Wayne's unprecedented disclosure of childhood suicide attempt and lifelong mental health struggles. Quantitative analysis using VADER sentiment classification, Latent Dirichlet Allocation topic modelling, and NRC emotion lexicon analysis revealed predominantly positive sentiment with systematic community amplification of mental health discourse. Reflexive thematic analysis of 2,100 high-engagement comments identified eight themes, with peer support achieving the highest saturation, contradicting isolation narratives. Findings support a Digital Permission Structures Model demonstrating how intersectional celebrity status (race + gender + high-status), hip-hop authenticity values, and digital platform affordances create triadic authorisation mechanisms enabling vulnerability expression. Community responses revealed communal masculinity rooted in Ubuntu philosophy and active reconstruction of masculine norms, positioning help-seeking as strength. Results challenge deficit-based models of Black masculinity, suggesting interventions should leverage collectivism, partner with high-status cultural figures, employ strength-based messaging, and centre hip-hop authenticity rather than imposing Western individualistic frameworks. This study provides evidence-based strategies for culturally responsive mental health interventions addressing persistent disparities in Black men's service utilisation.
♻ ☆ S-DAT: A Multilingual, GenAI-Driven Framework for Automated Divergent Thinking Assessment
This paper introduces S-DAT (Synthetic-Divergent Association Task), a scalable, multilingual framework for automated assessment of divergent thinking (DT) -a core component of human creativity. Traditional creativity assessments are often labor-intensive, language-specific, and reliant on subjective human ratings, limiting their scalability and cross-cultural applicability. In contrast, S-DAT leverages large language models and advanced multilingual embeddings to compute semantic distance -- a language-agnostic proxy for DT. We evaluate S-DAT across eleven diverse languages, including English, Spanish, German, Russian, Hindi, and Japanese (Kanji, Hiragana, Katakana), demonstrating robust and consistent scoring across linguistic contexts. Unlike prior DAT approaches, the S-DAT shows convergent validity with other DT measures and correct discriminant validity with convergent thinking. This cross-linguistic flexibility allows for more inclusive, global-scale creativity research, addressing key limitations of earlier approaches. S-DAT provides a powerful tool for fairer, more comprehensive evaluation of cognitive flexibility in diverse populations and can be freely assessed online: https://sdat.iol.zib.de/.
♻ ☆ Toward Metaphor-Fluid Conversation Design for Voice User Interfaces
Metaphors play a critical role in shaping user experiences with Voice User Interfaces (VUIs), yet existing designs often rely on static, human-centric metaphors that fail to adapt to diverse contexts and user needs. This paper introduces Metaphor-Fluid Design, a novel approach that dynamically adjusts metaphorical representations based on conversational use-contexts. We compare this approach to a Default VUI, which characterizes the present implementation of commercial VUIs commonly designed around the persona of an assistant, offering a uniform interaction style across contexts. In Study 1 (N=130), metaphors were mapped to four key use-contexts-commands, information seeking, sociality, and error recovery-along the dimensions of formality and hierarchy, revealing distinct preferences for task-specific metaphorical designs. Study 2 (N=91) evaluates a Metaphor-Fluid VUI against a Default VUI, showing that the Metaphor-Fluid VUI enhances perceived intention to adopt, enjoyment, and likability by aligning better with user expectations for different contexts. However, individual differences in metaphor preferences highlight the need for personalization. These findings challenge the one-size-fits-all paradigm of VUI design and demonstrate the potential of Metaphor-Fluid Design to create more adaptive and engaging human-AI interactions.
♻ ☆ Learning To Defer To A Population With Limited Demonstrations
This paper addresses the critical data scarcity that hinders the practical deployment of learning to defer (L2D) systems to the population. We introduce a context-aware, semi-supervised framework that uses meta-learning to generate expert-specific embeddings from only a few demonstrations. We demonstrate the efficacy of a dual-purpose mechanism, where these embeddings are used first to generate a large corpus of pseudo-labels for training, and subsequently to enable on-the-fly adaptation to new experts at test-time. The experiment results on three different datasets confirm that a model trained on these synthetic labels rapidly approaches oracle-level performance, validating the data efficiency of our approach. By resolving a key training bottleneck, this work makes adaptive L2D systems more practical and scalable, paving the way for human-AI collaboration in real-world environments. To facilitate reproducibility and address implementation details not covered in the main text, we provide our source code and training configurations at https://github.com/nil123532/learning-to-defer-to-a-population-with-limited-demonstrations.
comment: Accepted to IEEE DICTA 2025 (poster). 7 pages, 2 figures
♻ ☆ Beyond Accuracy: Rethinking Hallucination and Regulatory Response in Generative AI
Hallucination in generative AI is often treated as a technical failure to produce factually correct output. Yet this framing underrepresents the broader significance of hallucinated content in language models, which may appear fluent, persuasive, and contextually appropriate while conveying distortions that escape conventional accuracy checks. This paper critically examines how regulatory and evaluation frameworks have inherited a narrow view of hallucination, one that prioritises surface verifiability over deeper questions of meaning, influence, and impact. We propose a layered approach to understanding hallucination risks, encompassing epistemic instability, user misdirection, and social-scale effects. Drawing on interdisciplinary sources and examining instruments such as the EU AI Act and the GDPR, we show that current governance models struggle to address hallucination when it manifests as ambiguity, bias reinforcement, or normative convergence. Rather than improving factual precision alone, we argue for regulatory responses that account for languages generative nature, the asymmetries between system and user, and the shifting boundaries between information, persuasion, and harm.
♻ ☆ AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting
This paper develops an agentic framework that employs large language models (LLMs) for grounded persuasive language generation in automated copywriting, with real estate marketing as a focal application. Our method is designed to align the generated content with user preferences while highlighting useful factual attributes. This agent consists of three key modules: (1) Grounding Module, mimicking expert human behavior to predict marketable features; (2) Personalization Module, aligning content with user preferences; (3) Marketing Module, ensuring factual accuracy and the inclusion of localized features. We conduct systematic human-subject experiments in the domain of real estate marketing, with a focus group of potential house buyers. The results demonstrate that marketing descriptions generated by our approach are preferred over those written by human experts by a clear margin while maintaining the same level of factual accuracy. Our findings suggest a promising agentic approach to automate large-scale targeted copywriting while ensuring factuality of content generation.
comment: V2: Add more human verification to ensure safety and examine potential hallucination. Significant reframing for the general audience. Website: https://yangalan123.github.io/ai-realtor/. Codebase: https://github.com/yangalan123/AI-Realtor-Codebase. Data released at Huggingface Hub (Sigma-Lab/AI_Realtor_xxx)
Software Engineering 25
☆ Learning to Triage Taint Flows Reported by Dynamic Program Analysis in Node.js Packages
Program analysis tools often produce large volumes of candidate vulnerability reports that require costly manual review, creating a practical challenge: how can security analysts prioritize the reports most likely to be true vulnerabilities? This paper investigates whether machine learning can be applied to prioritizing vulnerabilities reported by program analysis tools. We focus on Node.js packages and collect a benchmark of 1,883 Node.js packages, each containing one reported ACE or ACI vulnerability. We evaluate a variety of machine learning approaches, including classical models, graph neural networks (GNNs), large language models (LLMs), and hybrid models that combine GNN and LLMs, trained on data based on a dynamic program analysis tool's output. The top LLM achieves $F_{1} {=} 0.915$, while the best GNN and classical ML models reaching $F_{1} {=} 0.904$. At a less than 7% false-negative rate, the leading model eliminates 66.9% of benign packages from manual review, taking around 60 ms per package. If the best model is tuned to operate at a precision level of 0.8 (i.e., allowing 20% false positives amongst all warnings), our approach can detect 99.2% of exploitable taint flows while missing only 0.8%, demonstrating strong potential for real-world vulnerability triage.
☆ Exploring Large Language Models for Access Control Policy Synthesis and Summarization
Cloud computing is ubiquitous, with a growing number of services being hosted on the cloud every day. Typical cloud compute systems allow administrators to write policies implementing access control rules which specify how access to private data is governed. These policies must be manually written, and due to their complexity can often be error prone. Moreover, existing policies often implement complex access control specifications and thus can be difficult to precisely analyze in determining their behavior works exactly as intended. Recently, Large Language Models (LLMs) have shown great success in automated code synthesis and summarization. Given this success, they could potentially be used for automatically generating access control policies or aid in understanding existing policies. In this paper, we explore the effectiveness of LLMs for access control policy synthesis and summarization. Specifically, we first investigate diverse LLMs for access control policy synthesis, finding that: although LLMs can effectively generate syntactically correct policies, they have permissiveness issues, generating policies equivalent to the given specification 45.8% of the time for non-reasoning LLMs, and 93.7% of the time for reasoning LLMs. We then investigate how LLMs can be used to analyze policies by introducing a novel semantic-based request summarization approach which leverages LLMs to generate a precise characterization of the requests allowed by a policy. Our results show that while there are significant hurdles in leveraging LLMs for automated policy generation, LLMs show promising results when combined with symbolic approaches in analyzing existing policies.
comment: 20 pages, 7 figures
☆ A Soundness and Precision Benchmark for Java Debloating Tools SC
Modern software development reuses code by importing libraries as dependencies. Software projects typically include an average of 36 dependencies, with 80% being transitive, meaning they are dependencies of dependencies. Recent research indicates that only 24.9% of these dependencies are required at runtime, and even within those, many program constructs remain unused, adding unnecessary code to the project. This has led to the development of debloating tools that remove unnecessary dependencies and program constructs while balancing precision by eliminating unused constructs and soundness by preserving all required constructs. To systematically evaluate this trade-off, we developed Deblometer, a micro-benchmark consisting of 59 test cases designed to assess support for various Java language features in debloating tools. Each test case includes a manually curated ground truth specifying necessary and bloated classes, methods, and fields, enabling precise measurement of soundness and precision. Using Deblometer, we evaluated three popular Java debloating tools: Deptrim, JShrink, and ProGuard. Our evaluation reveals that all tools remove required program constructs, which results in changed semantics or execution crashes. In particular, the dynamic class loading feature introduces unsoundness in all evaluated tools. Our comparison shows that Deptrim retains more bloated constructs, while ProGuard removes more required constructs. JShrink's soundness is significantly affected by limited support for annotations, which leads to corrupted debloated artifacts. These soundness issues highlight the need to improve debloating tools to ensure stable and reliable debloated software.
comment: Preprint - accepted at the ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (SCORED '25)
☆ Large Language Models for Fault Localization: An Empirical Study
Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, particularly in automated program repair. However, the effectiveness of such repairs is highly dependent on the performance of upstream fault localization, for which comprehensive evaluations are currently lacking. This paper presents a systematic empirical study on LLMs in the statement-level code fault localization task. We evaluate representative open-source models (Qwen2.5-coder-32b-instruct, DeepSeek-V3) and closed-source models (GPT-4.1 mini, Gemini-2.5-flash) to assess their fault localization capabilities on the HumanEval-Java and Defects4J datasets. The study investigates the impact of different prompting strategies--including standard prompts, few-shot examples, and chain-of-reasoning--on model performance, with a focus on analysis across accuracy, time efficiency, and economic cost dimensions. Our experimental results show that incorporating bug report context significantly enhances model performance. Few-shot learning shows potential for improvement but exhibits noticeable diminishing marginal returns, while chain-of-thought reasoning's effectiveness is highly contingent on the model's inherent reasoning capabilities. This study not only highlights the performance characteristics and trade-offs of different models in fault localization tasks, but also offers valuable insights into the strengths of current LLMs and strategies for improving fault localization effectiveness.
comment: in Chinese language
☆ Toward Practical Deductive Verification: Insights from a Qualitative Survey in Industry and Academia
Deductive verification is an effective method to ensure that a given system exposes the intended behavior. In spite of its proven usefulness and feasibility in selected projects, deductive verification is still not a mainstream technique. To pave the way to widespread use, we present a study investigating the factors enabling successful applications of deductive verification and the underlying issues preventing broader adoption. We conducted semi-structured interviews with 30 practitioners of verification from both industry and academia and systematically analyzed the collected data employing a thematic analysis approach. Beside empirically confirming familiar challenges, e.g., the high level of expertise needed for conducting formal proofs, our data reveal several underexplored obstacles, such as proof maintenance, insufficient control over automation, and usability concerns. We further use the results from our data analysis to extract enablers and barriers for deductive verification and formulate concrete recommendations for practitioners, tool builders, and researchers, including principles for usability, automation, and integration with existing workflows.
☆ FMI-Based Distributed Co-Simulation with Enhanced Security and Intellectual Property Safeguards
Distributed co-simulation plays a key role in enabling collaborative modeling and simulation by different stakeholders while protecting their Intellectual Property (IP). Although IP protection is provided implicitly by co-simulation, there is no consensus in the guidelines to conduct distributed co-simulation of continuous-time or hybrid systems with no exposure to potential hacking attacks. We propose an approach for distributed co-simulation on top of UniFMU with enhanced cybersecurity and IP protection mechanisms, ensuring that the connection is initiated by the client and the models and binaries live on trusted platforms. We showcase the functionality of this approach using two co-simulation demos in four different network settings and analyze the trade-off between IP-protected distribution and performance efficiency in these settings.
comment: 6 pages, Proceedings of the 2025 Annual Modeling and Simulation Conference (ANNSIM)
☆ Symmetry in Software Platforms as an Architectural Principle
Software platforms often act as structure preserving systems. They provide consistent interfaces and behaviors that remain stable under specific transformations that we denote as symmetries. This paper explores the idea that architectural robustness emerges from enforcing such structural regularities
comment: Working paper, 11 pages
☆ Classport: Designing Runtime Dependency Introspection for Java
Runtime introspection of dependencies, i.e., the ability to observe which dependencies are currently used during program execution, is fundamental for Software Supply Chain security. Yet, Java has no support for it. We solve this problem with Classport, a system that embeds dependency information into Java class files, enabling the retrieval of dependency information at runtime. We evaluate Classport on six real-world projects, demonstrating the feasibility in identifying dependencies at runtime. Runtime dependency introspection with Classport opens important avenues for runtime integrity checking.
☆ Automated Cloud Infrastructure-as-Code Reconciliation with AI Agents
Cloud infrastructure is managed through a mix of interfaces -- traditionally, cloud consoles, command-line interfaces (CLI), and SDKs are the tools of choice. Recently, Infrastructure-as-Code/IaC frameworks (e.g., Terraform) have quickly gained popularity. Unlike conventional tools, IaC~frameworks encode the infrastructure in a "source-of-truth" configuration. They are capable of automatically carrying out modifications to the cloud -- deploying, updating, or destroying resources -- to bring the actual infrastructure into alignment with the IaC configuration. However, when IaC is used alongside consoles, CLIs, or SDKs, it loses visibility into external changes, causing infrastructure drift, where the configuration becomes outdated, and later IaC operations may undo valid updates or trigger errors. We present NSync, an automated system for IaC reconciliation that propagates out-of-band changes back into the IaC program. Our key insight is that infrastructure changes eventually all occur via cloud API invocations -- the lowest layer for cloud management operations. NSync gleans insights from API traces to detect drift (i.e., non-IaC changes) and reconcile it (i.e., update the IaC configuration to capture the changes). It employs an agentic architecture that leverages LLMs to infer high-level intents from noisy API sequences, synthesize targeted IaC updates using specialized tools, and continually improve through a self-evolving knowledge base of past reconciliations. We further introduce a novel evaluation pipeline for injecting realistic drifts into cloud infrastructure and assessing reconciliation performance. Experiments across five real-world Terraform projects and 372 drift scenarios show that NSync outperforms the baseline both in terms of accuracy (from 0.71 to 0.97 pass@3) and token efficiency (1.47$\times$ improvement).
☆ Developing a Model-Driven Reengineering Approach for Migrating PL/SQL Triggers to Java: A Practical Experience
Model-driven software engineering (MDE) techniques are not only useful in forward engineering scenarios, but can also be successfully applied to evolve existing systems. RAD (Rapid Application Development) platforms emerged in the nineties, but the success of modern software technologies motivated that a large number of enterprises tackled the migration of their RAD applications, such as Oracle Forms. Our research group has collaborated with a software company in developing a solution to migrate PL/SQL monolithic code on Forms triggers and program units to Java code separated in several tiers. Our research focused on the model-driven reengineering process applied to develop the migration tool for the conversion of PL/SQL code to Java. Legacy code is represented in form of KDM (Knowledge-Discovery Metamodel) models. In this paper, we propose a software process to implement a model-driven re-engineering. This process integrates a TDD-like approach to incrementally develop model transformations with three kinds of validations for the generated code. The implementation and validation of the re-engineering approach are explained in detail, as well as the evaluation of some issues related with the application of MDE.
comment: 31 pages, 22 figures
☆ AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents
The emergence of foundation models (FMs) has enabled the development of highly capable and autonomous agents, unlocking new application opportunities across a wide range of domains. Evaluating the architecture of agents is particularly important as the architectural decisions significantly impact the quality attributes of agents given their unique characteristics, including compound architecture, autonomous and non-deterministic behaviour, and continuous evolution. However, these traditional methods fall short in addressing the evaluation needs of agent architecture due to the unique characteristics of these agents. Therefore, in this paper, we present AgentArcEval, a novel agent architecture evaluation method designed specially to address the complexities of FM-based agent architecture and its evaluation. Moreover, we present a catalogue of agent-specific general scenarios, which serves as a guide for generating concrete scenarios to design and evaluate the agent architecture. We demonstrate the usefulness of AgentArcEval and the catalogue through a case study on the architecture evaluation of a real-world tax copilot, named Luna.
☆ Prefetching Cache Optimization Using Graph Neural Networks: A Modular Framework and Conceptual Analysis
Caching and prefetching techniques are fundamental to modern computing, serving to bridge the growing performance gap between processors and memory. Traditional prefetching strategies are often limited by their reliance on predefined heuristics or simplified statistical models, which fail to capture the complex, non-linear dependencies in modern data access patterns. This paper introduces a modular framework leveraging Graph Neural Networks (GNNs) to model and predict access patterns within graph-structured data, focusing on web navigation and hierarchical file systems. The toolchain consists of: a route mapper for extracting structural information, a graph constructor for creating graph representations, a walk session generator for simulating user behaviors, and a gnn prefetch module for training and inference. We provide a detailed conceptual analysis showing how GNN-based approaches can outperform conventional methods by learning intricate dependencies. This work offers both theoretical foundations and a practical, replicable pipeline for future research in graph-driven systems optimization.
☆ AI-Driven Development of a Publishing Imprint: Xynapse Traces
Xynapse Traces is an experimental publishing imprint created via a fusion of human and algorithmic methods using a configuration-driven architecture and a multi-model AI integration framework. The system achieved a remarkable 90% reduction in time-to-market (from a typical 6-12 months to just 2-4 weeks), with 80% cost reduction compared to traditional imprint development, while publishing 52 books in its first year and maintaining exceptional quality metrics, including 99% citation accuracy and 100% validation success after initial corrections. Key technical innovations include a continuous ideation pipeline with tournament-style evaluation, a novel codex design for transcriptive meditation practice, comprehensive automation spanning from ideation through production and distribution, and publisher personas that define and guide the imprint's mission. The system also integrates automated verification with human oversight, ensuring that gains in speed do not compromise publishing standards. This effort has significant implications for the future of book publishing, suggesting new paradigms for human-AI collaboration that democratize access to sophisticated publishing capabilities and make previously unviable niche markets accessible.
♻ ☆ CLEVER: A Curated Benchmark for Formally Verified Code Generation
We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).
♻ ☆ How Quickly Do Development Teams Update Their Vulnerable Dependencies?
Industry practitioners are increasingly concerned with software that contains vulnerable versions of third-party dependencies that are included both directly and transitively. To address this problem, projects are encouraged to both (a)~quickly update to non-vulnerable versions of dependencies and (b)~be mindful of the update practices of the dependencies they choose to use. To this end, researchers have proposed metrics to measure the responsiveness of the development teams of the packages in keeping their dependencies updated: Mean-Time-To-Update (MTTU) and Mean-Time-To-Remediate (MTTR). While MTTU covers all dependencies, MTTR quantifies the time needed for a package to update its vulnerable dependencies. However, existing metrics fail to capture important nuances, such as considering floating versions and prioritizing recent updates, leading to inaccurate reflections of a development team's update practices. \textit{The goal of this study is to aid practitioners in understanding how quickly packages update their dependencies.} We propose two novel metrics, Mean-Time-To-Update for dependencies (MTTU) and Mean-Time-To-Remediate for vulnerable dependencies (MTTR), that overcome the limitations of existing metrics. We conduct an empirical study using $163,207$ packages in npm ($117,129$), PyPI ($42,777$), and Cargo ($3,301$) and characterize how the ecosystems differ in MTTU and MTTR, as well as what package characteristics influence MTTU and MTTR. We found that most packages have a relatively fast dependency update practice. We further study whether MTTU can be used as a proxy for MTTR when sufficient vulnerability data is not available. As we did not find enough statistical evidence for a strong proxy, our findings suggest that MTTU could only be partially used (may be used but with caution) as a proxy for MTTR when vulnerability data is not available.
comment: under review
♻ ☆ Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve NeurIPS 2025
Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation--banking--selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.
comment: NeurIPS 2025. Code is available at https://github.com/MITIBM-FastCoder/LessonL
♻ ☆ Which Is Better For Reducing Outdated and Vulnerable Dependencies: Pinning or Floating?
Developers consistently use version constraints to specify acceptable versions of the dependencies for their project. \emph{Pinning} dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, \emph{floating} can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate \emph{pinning} dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since \emph{pinning} is the tightest version constraint, \emph{pinning} is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using \emph{pinning} as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is \emph{floating-minor}, with \emph{pinning} being the next most common. We also find that \emph{floating-major} is the least likely to result in outdated and \emph{floating-minor} is the least likely to result in vulnerable dependencies.
comment: Accepted to ASE 2025
♻ ☆ Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects
Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core review tasks. This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source LMs across three key automated code review tasks: Code Change Quality Estimation, Review Comment Generation, and Code Refinement. We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C\# specific dataset combining public benchmarks with industrial repositories. Our study investigates how different configurations of programming languages and natural languages in the training data affect LM performance, particularly in comment generation. Additionally, we benchmark the fine-tuned models against an automated software analysis tool (ASAT) and human reviewers to evaluate their practical utility in real-world settings. Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines. While LMs can effectively support code review workflows, especially for routine or repetitive tasks, human reviewers remain superior in handling semantically complex or context-sensitive changes. Our findings highlight the importance of language alignment and task-specific adaptation in optimizing LMs for automated code review.
♻ ☆ The Faiss library
Vector databases typically manage large collections of embedding vectors. Currently, AI applications are growing rapidly, and so is the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This paper describes the trade-off space of vector search and the design principles of Faiss in terms of structure, approach to optimization and interfacing. We benchmark key features of the library and discuss a few selected applications to highlight its broad applicability.
♻ ☆ Synthesizing Efficient and Permissive Programmatic Runtime Shields for Neural Policies
With the increasing use of neural policies in control systems, ensuring their safety and reliability has become a critical software engineering task. One prevalent approach to ensuring the safety of neural policies is to deploy programmatic runtime shields alongside them to correct their unsafe commands. However, the programmatic runtime shields synthesized by existing methods are either computationally expensive or insufficiently permissive, resulting in high overhead and unnecessary interventions on the system. To address these challenges, we propose Aegis, a novel framework that synthesizes lightweight and permissive programmatic runtime shields for neural policies. Aegis achieves this by formulating the seeking of a runtime shield as a sketch-based program synthesis problem and proposing a novel method that leverages counterexample-guided inductive synthesis and Bayesian optimization to solve it. To evaluate Aegis and its synthesized shields, we use eight representative control systems and compare Aegis with the current state-of-the-art. Our results show that the programmatic runtime shields synthesized by Aegis can correct all unsafe commands from neural policies, ensuring that the systems do not violate any desired safety properties at all times. Compared to the current state-of-the-art, Aegis's shields exhibit a 2.2$\times$ reduction in time overhead and a 3.9$\times$ reduction in memory usage, suggesting that they are much more lightweight. Moreover, Aegis's shields incur an average of 1.5$\times$ fewer interventions than other shields, showing better permissiveness.
comment: Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM), 39 pages
♻ ☆ CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.
♻ ☆ A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is hindered by a lack of comprehensive understanding of how benchmarks and solutions interconnect. This survey addresses this gap by providing the first holistic analysis of LLM-powered software engineering, offering insights into evaluation methodologies and solution paradigms. We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair. Our analysis highlights the evolution from simple prompt engineering to sophisticated agentic systems incorporating capabilities like planning, reasoning, memory mechanisms, and tool augmentation. To contextualize this progress, we present a unified pipeline illustrating the workflow from task specification to deliverables, detailing how different solution paradigms address various complexity levels. Unlike prior surveys that focus narrowly on specific aspects, this work connects 50+ benchmarks to their corresponding solution strategies, enabling researchers to identify optimal approaches for diverse evaluation criteria. We also identify critical research gaps and propose future directions, including multi-agent collaboration, self-evolving systems, and formal verification integration. This survey serves as a foundational guide for advancing LLM-driven software engineering. We maintain a GitHub repository that continuously updates the reviewed and related papers at https://github.com/lisaGuojl/LLM-Agent-SE-Survey.
comment: 22 pages
♻ ☆ How Far Have LLMs Come Toward Automated SATD Taxonomy Construction?
Technical debt refers to suboptimal code that degrades software quality. When developers intentionally introduce such debt, it is called self-admitted technical debt (SATD). Since SATD hinders maintenance, identifying its categories is key to uncovering quality issues. Traditionally, constructing such taxonomies requires manually inspecting SATD comments and surrounding code, which is time-consuming, labor-intensive, and often inconsistent due to annotator subjectivity. In this study, we investigated to what extent large language models (LLMs) could generate SATD taxonomies. We designed a structured, LLM-driven pipeline that mirrors the taxonomy construction steps researchers typically follow. We evaluated it on SATD datasets from three domains: quantum software, smart contracts, and machine learning. It successfully recovered domain-specific categories reported in prior work, such as Layer Configuration in machine learning. It also completed taxonomy generation in under two hours and for less than $1, even on the largest dataset. These results suggest that, while full automation remains challenging, LLMs can support semi-automated SATD taxonomy construction. Furthermore, our work opens up avenues for future work, such as automated taxonomy generation in other areas.
comment: 5 pages, APSEC 2025
♻ ☆ TAI3: Testing Agent Integrity in Interpreting User Intent NeurIPS 2025
LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent's actions that diverge from the user's intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce TAI3, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, TAI3 generates realistic tasks based on toolkits' documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, TAI3 maintains a datatype-aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that TAI3 effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, TAI3 generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.
comment: Accepted to NeurIPS 2025
♻ ☆ Quantum Artificial Intelligence for Software Engineering: the Road Ahead
In order to handle the increasing complexity of software systems, Artificial Intelligence (AI) has been applied to various areas of software engineering, including requirements engineering, coding, testing, and debugging. This has led to the emergence of AI for Software Engineering as a distinct research area within the field of software engineering. With the development of quantum computing, the field of Quantum AI (QAI) is arising, enhancing the performance of classical AI and holding significant potential for solving classical software engineering problems. Some initial applications of QAI in software engineering have already emerged, such as test case optimization. However, the path ahead remains open, offering ample opportunities to solve complex software engineering problems cost-effectively with QAI. To this end, this paper presents a roadmap towards the application of QAI in software engineering. Specifically, we consider two of the main categories of QAI, i.e., quantum optimization algorithms and quantum machine learning. For each software engineering phase, we discuss how these QAI approaches can address some of the tasks associated with that phase. Moreover, we provide an overview of some of the possible challenges that need to be addressed to make the application of QAI for software engineering successful.
Systems and Control 36
☆ Bilevel Analysis of Cost and Emissions Externalities from Data Center Load Shifting
Data centers are emerging as large, flexible electricity consumers capable of shifting computational workloads across locations in response to economic and environmental signals. While this flexibility has potential for emissions reduction, its impact on power system operations depends critically on how such behavior interacts with network constraints and market signals. We develop a bilevel optimization framework in which a data center minimizes a weighted combination of electricity cost and marginal emissions intensity (LME), while the system operator clears economic dispatch under transmission and generation constraints. Focusing on a stylized three-bus power system, we derive closed-form, piecewise-linear expressions for both the data center and system-wide objectives as functions of the data centers' load shift. These expressions capture threshold-driven regime changes due to congestion and renewable saturation. We identify sufficient conditions under which the data center's decentralized decisions align with or diverge from socially optimal behavior and characterize the resulting externalities. Our results reveal how system topology and generator asymmetry affect incentive alignment and provide insight into when marginal price or emissions signals may fail to guide flexible loads toward socially beneficial outcomes. Our results offer a tractable starting point for analyzing decentralized flexibility under carbon-aware incentives and suggest directions for improving coordination between flexible loads and system operations.
☆ Learning Optimal Power Flow with Pointwise Constraints
Training learning parameterizations to solve optimal power flow (OPF) with pointwise constraints is proposed. In this novel training approach, a learning parameterization is substituted directly into an OPF problem with constraints required to hold over all problem instances. This is different from existing supervised learning methods in which constraints are required to hold across the average of problem instances. Training with pointwise constraints is undertaken in the dual domain with the use of augmented Lagrangian and dual gradient ascent algorithm. Numerical experiments demonstrate that training with pointwise constraints produces solutions with smaller constraint violations. Experiments further demonstrated that pointwise constraints are most effective at reducing constraint violations in corner cases - defined as those realizations in which constraints are most difficult to satisfy. Gains are most pronounced in power systems with large numbers of buses.
comment: 19 pages, 13 figures
☆ Sugar Shack 4.0: Implementation of a Cyber-Physical System for Logistic and Sanitary Automation in a Maple Syrup Boiling Center
This paper presents the design and deployment of a process-aware cyber-physical system that automates plant-level logistics, traceability, and sanitation in a centralized maple-syrup boiling center. The system replaces ad-hoc, manual operations with event-driven orchestration on a local server, employing reusable device abstractions and a centralized interlock with priority-based arbitration for shared piping. It implements deterministic routines for delivery, reverse osmosis integration, evaporator feed, and permeate management. The system is sensor rich: inline measurements of flow, temperature, and sugar concentration (degrees Brix) drive routing decisions and trigger systematic post-transfer rinses (cleaning-in-place), ensuring consistent hygiene and complete, immediate traceability up to the evaporator inlet. During the 2025 production season, the system queued 431 operations without incident; executed 908 \enquote{topstock} and \enquote{downstock} balancing cycles; increased usable permeate reserves from 22,712 to approximately 41,640 L through dynamic storage assignment; eliminated mid-season contamination incidents previously observed under manual practice; and reduced administrative effort for billing and reporting from more than 30 hours to roughly 1 hour through automatic documentation. These results demonstrate a practical path to modular, plant-scale automation beyond traditional architectures, and lay the groundwork for packaging reusable elements for similar plants or adjacent industries. This work is part of a larger project involving the first scientifically-documented integration of Industry 4.0 technologies in a maple syrup boiling center.
comment: 8 pages, 6 figures
☆ Safe Decentralized Density Control of Multi-Robot Systems using PDE-Constrained Optimization with State Constraints
In this paper, we introduce a decentralized optimization-based density controller designed to enforce set invariance constraints in multi-robot systems. By designing a decentralized control barrier function, we derived sufficient conditions under which local safety constraints guarantee global safety. We account for localization and motion noise explicitly by modeling robots as spatial probability density functions governed by the Fokker-Planck equation. Compared to traditional centralized approaches, our controller requires less computational and communication power, making it more suitable for deployment in situations where perfect communication and localization are impractical. The controller is validated through simulations and experiments with four quadcopters.
comment: Accepted to MRS 2025
☆ Decentralized Small Gain and Phase Stability Conditions for Grid-Forming Converters: Limitations and Extensions
The increasing share of converter based resources in power systems calls for scalable methods to analyse stability without relying on exhaustive system wide simulations. Decentralized small gain and small-phase criteria have recently been proposed for this purpose, but their applicability to grid forming converters is severely limited by the sectoriality assumption, which is not typically satisfied at low frequencies. This work revisits and extends mixed gain phase conditions by introducing loop shaping transformations that reformulate converter and network models in alternative coordinate frames. The proposed approach resolves intrinsic non sectoriality at low frequencies and reduces conservativeness, thereby improving the applicability of decentralized stability certification. Analytical results are illustrated using an infinite bus system first and then extended to the IEEE 14 bus network, demonstrating the practicality and scalability of the method. These findings provide a pathway toward less conservative and more widely applicable decentralized stability certificates in power grids.
☆ Path-Based Conditions for the Identifiability of Non-additive Nonlinear Networks with Full Measurements
We analyze the identifiability of nonlinear networks with node dynamics characterized by functions that are non-additive. We consider the full measurement case (all the nodes are measured) in the path-independent delay scenario where all the excitation signals of a specific node have the same delay in the output of a measured node. Based on the notion of a generic nonlinear matrix associated with the network, we introduce the concept of generic identifiability and characterize the space of functions that satisfies this property. For directed acyclic graphs (DAGs) characterized by analytic functions, we derive a sufficient condition for identifiability based on vertex-disjoint paths from excited nodes to the in-neighbors of each node in the network. Furthermore, when we consider the class of polynomial functions, by using well-known results on algebraic varieties, we prove that the vertex-disjoint path condition is also necessary. Finally, we show that this identifiability condition is not necessary for the additive nonlinear model. Some examples are added to illustrate the results.
comment: 11 pages, 7 figures, submitted to IEEE Transactions on Automatic Control
☆ Joint Computation Offloading and Resource Management for Cooperative Satellite-Aerial-Marine Internet of Things Networks
Devices within the marine Internet of Things (MIoT) can connect to low Earth orbit (LEO) satellites and unmanned aerial vehicles (UAVs) to facilitate low-latency data transmission and execution, as well as enhanced-capacity data storage. However, without proper traffic handling strategy, it is still difficult to effectively meet the low-latency requirements. In this paper, we consider a cooperative satellite-aerial-MIoT network (CSAMN) for maritime edge computing and maritime data storage to prioritize delay-sensitive (DS) tasks by employing mobile edge computing, while handling delay-tolerant (DT) tasks via the store-carry-forward method. Considering the delay constraints of DS tasks, we formulate a constrained joint optimization problem of maximizing satellite-collected data volume while minimizing system energy consumption by controlling four interdependent variables, including the transmit power of UAVs for DS tasks, the start time of DT tasks, computing resource allocation, and offloading ratio. To solve this non-convex and non-linear problem, we propose a joint computation offloading and resource management (JCORM) algorithm using the Dinkelbach method and linear programming. Our results show that the volume of data collected by the proposed JCORM algorithm can be increased by up to 41.5% compared to baselines. Moreover, JCORM algorithm achieves a dramatic reduction in computational time, from a maximum of 318.21 seconds down to just 0.16 seconds per experiment, making it highly suitable for real-time maritime applications.
☆ Behavior-Aware Online Prediction of Obstacle Occupancy using Zonotopes
Predicting the motion of surrounding vehicles is key to safe autonomous driving, especially in unstructured environments without prior information. This paper proposes a novel online method to accurately predict the occupancy sets of surrounding vehicles based solely on motion observations. The approach is divided into two stages: first, an Extended Kalman Filter and a Linear Programming (LP) problem are used to estimate a compact zonotopic set of control actions; then, a reachability analysis propagates this set to predict future occupancy. The effectiveness of the method has been validated through simulations in an urban environment, showing accurate and compact predictions without relying on prior assumptions or prior training data.
comment: 64th IEEE Conference on Decision and Control
☆ A Multifunctional Capacitive Sensing Platform for Wireless Vascular and Heart Monitoring
We present a multifunctional, antenna-integrated capacitive sensing (MAiCaS) platform for passive, wireless, and real-time cardiovascular monitoring. Unlike conventional systems that require separate sensors and wireless modules, our device unifies sensing, telemetry, and mechanical functionality into a compact and scalable design by exploiting the parasitic capacitance of an inductive antenna as a strain-sensitive element. The sensor is fabricated using a cleanroom-free, single-step UV laser patterning process on a flexible PDMS substrate, reducing manufacturing complexity and enabling high reproducibility. The MAiCaS is suitable for three different applications: as a sensor for epicardial strain measurement, a stent as a sensor, and a vascular graft sensor. We demonstrate MAiCaS's versatility by validating its wireless resonance-based response to strain, pressure, and deformation across unrolled and rolled forms. In vitro experiments demonstrated consistent resonance frequency shifts under physiological conditions, with stable performance on skin, in PBS, human serum, and simulated vascular environments. Repeatability and aging tests confirmed its long-term reliability and elasticity under cyclic loading. Calibration curves revealed high sensitivity across all configurations, with wireless interrogation achieved through S11 parameter measurements and resonance frequency shift as the output metric. The sensitivity of the device was measured to be 2.9 MHz per 1% strain, 0.43 MHz/mmHg, and 309.6kHz/\textmu m for epicardial patch, graft, and stent integrated sensor, respectively. The operation of MAiCaS was evaluated in a human experiment. This monolithic sensor architecture provides a scalable and cost-effective solution for battery-free monitoring of vascular dynamics, with potential for remote diagnostics, post-surgical follow-up, and continuous cardiovascular health management.
comment: 28 pages, 7 figures,
☆ Balancing Specialization and Centralization: A Multi-Agent Reinforcement Learning Benchmark for Sequential Industrial Control
Autonomous control of multi-stage industrial processes requires both local specialization and global coordination. Reinforcement learning (RL) offers a promising approach, but its industrial adoption remains limited due to challenges such as reward design, modularity, and action space management. Many academic benchmarks differ markedly from industrial control problems, limiting their transferability to real-world applications. This study introduces an enhanced industry-inspired benchmark environment that combines tasks from two existing benchmarks, SortingEnv and ContainerGym, into a sequential recycling scenario with sorting and pressing operations. We evaluate two control strategies: a modular architecture with specialized agents and a monolithic agent governing the full system, while also analyzing the impact of action masking. Our experiments show that without action masking, agents struggle to learn effective policies, with the modular architecture performing better. When action masking is applied, both architectures improve substantially, and the performance gap narrows considerably. These results highlight the decisive role of action space constraints and suggest that the advantages of specialization diminish as action complexity is reduced. The proposed benchmark thus provides a valuable testbed for exploring practical and robust multi-agent RL solutions in industrial automation, while contributing to the ongoing debate on centralization versus specialization.
comment: Preprint (submitted version) to be presented at the 13th International Conference on Industrial Engineering and Applications (ICIEA-EU), Milan, 2026. The final Version of Record will appear in the official conference proceedings
☆ Interlacing in Controllers Implementation: Frequency Analysis
The main goal of this contribution is to explain how to use interlacing techniques for LTI controllers implementation and analyze different struc- tures in this environment. These considerations lead to an important com- putation saving in constrained resource environments. It has been also intro- duced new procedures for obtaining the blocks related to different real and complex controllers poles. The resultant time-varying system is modeled using proper discrete lifting techniques and a new and efficient dual-rate fre- quency response computation allows to determine the characteristics of the control loop with interlaced controller. Examples illustrate the theoretical proposals.
comment: 18 pages, 6 figures, next version will be submitted to a journal
☆ On MIMO Stability Analysis Methods Applied to Inverter-Based Resources Connected to Power Systems
This paper presents a critical review of methods commonly employed in the literature for small signal stability analysis of inverter based resources (IBRs). It discusses the intended purposes of these methods and outlines both their proper and improper implementations. The paper provides insights into the applicability of these techniques, clarifies their inherent limitations, and discusses and illustrates common sources of misinterpretation.
☆ Multi-layer Optimized Coordination of Smart Building Resources in Active Power Distribution Systems
This paper proposes a multi-actor coordination platform for the optimal utilization of smart buildings resources, including roof top PV generation and battery energy storage system (BESS), in active power distribution systems. The proposed multi-actor coordination includes the Smart Building Coordinator (SBC), Micro-Grid Coordinator (MGC) and Distribution System Coordinator (DSC). The coordinators operate independently and only exchange limited information with each other to reach an optimal solution. In the proposed platform, a hierarchical optimization problem is solved to optimally determine the operating point of all distribution system resources. The proposed platform fully preserves the confidentiality of the behind the meter (BTM) data of the buildings since no information about the status of the PV system, BESS, and load of the building is shared with the owner of the power system. The proposed platform has a flexible and scalable architecture where the computational task of coordinating microgrids and smart buildings with distribution grid is performed locally at the MGC and SBC layers, respectively. Numerical simulations show the efficacy of the proposed platform in coordinating the BTM resources with the rest of the distribution system.
☆ Observer-based Differentiators for Noisy Signals
We present a collection of different types of observation systems that work as differentiators. These observer-based differentiators can produce estimates for derivatives of a given signal, even though the given signal is prone to noise.
comment: 9 pages, 6 figures, technical report
☆ From Bundles to Backstepping: Geometric Control Barrier Functions for Safety-Critical Control on Manifolds
Control barrier functions (CBFs) have a well-established theory in Euclidean spaces, yet still lack general formulations and constructive synthesis tools for systems evolving on manifolds common in robotics and aerospace applications. In this paper, we develop a general theory of geometric CBFs on bundles and, for control-affine systems, recover the standard optimization-based CBF controllers and their smooth analogues. Then, by generalizing kinetic energy-based CBF backstepping to Riemannian manifolds, we provide a constructive CBF synthesis technique for geometric mechanical systems, as well as easily verifiable conditions under which it succeeds. Further, this technique utilizes mechanical structure to avoid computations on higher-order tangent bundles. We demonstrate its application to an underactuated satellite on SO(3).
comment: 8 pages, 3 figures, Submitted to American Control Conference (ACC) 2026
☆ Soft Switching Expert Policies for Controlling Systems with Uncertain Parameters
This paper proposes a simulation-based reinforcement learning algorithm for controlling systems with uncertain and varying system parameters. While simulators are useful for safely learning control policies for physical systems, mitigating the reality gap remains a major challenge. To address the challenge, we propose a two-stage algorithm. In the first stage, multiple control policies are learned for systems with different parameters in a simulator. In the second stage, for a real system, the control policies learned in the first stage are smoothly switched using an online convex optimization algorithm based on observations. Our proposed algorithm is demonstrated through numerical experiments.
comment: 6 pages, 8 figures. Submitted to an International Conference
☆ Design Optimization and Global Impact Assessment of Solar-Thermal Direct Air Carbon Capture
The dual challenge of decarbonizing the economy and meeting rising global energy demand underscores the need for scalable and cost-effective carbon dioxide removal technologies. Direct air capture (DAC) is among the most promising approaches, but its high energy intensity, particularly the thermal energy required for sorbent regeneration, remains a critical barrier to cost reduction and sustainable deployment. This study explores solar-thermal DAC systems that combine concentrated solar thermal technology with low-cost sand-based thermal energy storage to meet this demand. We analyze the techno-economic performance of such systems in both grid-connected and stand-alone configurations. Results show that solar-thermal DAC can achieve annual capacity factors exceeding 80% and CO2 removal costs as low as 160-200 USD per ton, making it competitive with leading DAC technologies. The proposed system operates most efficiently with short-cycle sorbents that align with solar availability. The stand-alone Solar-DAC systems, which rely solely on solar energy for both electricity and thermal energy, are particularly promising in regions with high solar capacity and sandy terrain, exhibiting minimal ambient sensitivity from temperature and humidity. An optimal 6000 ton/yr modular system design takes <1 km2 land-use requirement and potentially >26 Gt/year DAC capacity is identified for sandy terrain alone globally. In areas with sedimentary basins suitable for CO2 storage, solar-powered DAC offers a lower-cost alternative to geothermal heating, which often faces geological and economic constraints.
comment: 4 figures
☆ Interpolatory Approximations of PMU Data: Dimension Reduction and Pilot Selection
This work investigates the reduction of phasor measurement unit (PMU) data through low-rank matrix approximations. To reconstruct a PMU data matrix from fewer measurements, we propose the framework of interpolatory matrix decompositions (IDs). In contrast to methods relying on principal component analysis or singular value decomposition, IDs recover the complete data matrix using only a few of its rows (PMU datastreams) and/or a few of its columns (snapshots in time). This compression enables the real-time monitoring of power transmission systems using a limited number of measurements, thereby minimizing communication bandwidth. The ID perspective gives a rigorous error bound on the quality of the data compression. We propose selecting rows and columns used in an ID via the discrete empirical interpolation method (DEIM), a greedy algorithm that aims to control the error bound. This bound leads to a computable estimate for the reconstruction error during online operations. A violation of this estimate suggests a change in the system's operating conditions, and thus serves as a tool for fault detection. Numerical tests using synthetic PMU data illustrate DEIM's excellent performance for data compression, and validate the proposed DEIM-based fault-detection method.
☆ SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance
Speech is essential for human communication, yet millions of people face impairments such as dysarthria, stuttering, and aphasia conditions that often lead to social isolation and reduced participation. Despite recent progress in automatic speech recognition (ASR) and text-to-speech (TTS) technologies, accessible web and mobile infrastructures for users with impaired speech remain limited, hindering the practical adoption of these advances in daily communication. To bridge this gap, we present SpeechAgent, a mobile SpeechAgent designed to facilitate people with speech impairments in everyday communication. The system integrates large language model (LLM)- driven reasoning with advanced speech processing modules, providing adaptive support tailored to diverse impairment types. To ensure real-world practicality, we develop a structured deployment pipeline that enables real-time speech processing on mobile and edge devices, achieving imperceptible latency while maintaining high accuracy and speech quality. Evaluation on real-world impaired speech datasets and edge-device latency profiling confirms that SpeechAgent delivers both effective and user-friendly performance, demonstrating its feasibility for personalized, day-to-day assistive communication.
☆ Modeling to Generate Alternatives for Robustness of Mixed Integer DC Optimal Power Flow
Transmission system operators face a variety of discrete operational decisions, such as switching of branches and/or devices. Incorporating these decisions into optimal power flow (OPF) results in mixed-integer non-linear programming problems (MINLPs), which can't presently be solved at scale in the required time. Various linearizations of the OPF exist, most famously the DC-OPF, which can be leveraged to find integer decisions. However, these linearizations can yield very poor integer solutions in some edge cases, making them challenging to incorporate into control rooms. This paper introduces the use of modeling to generate alternatives (MGA) to find alternative solutions to the linearized problems, reducing the chance of finding no AC feasible solutions. We test this approach using 13 networks where the DC linearization results in infeasible integer decisions, and MGA finds a solution in all cases. The MGA search criteria selected drastically affects the number and quality of solutions found, so network specific search functions may be necessary.
☆ Lyapunov-Based Physics-Informed Deep Neural Networks with Skew Symmetry Considerations
Deep neural networks (DNNs) are powerful black-box function approximators which have been shown to yield improved performance compared to traditional neural network (NN) architectures. However, black-box algorithms do not incorporate known physics of the system and can yield results which are physically implausible. Physics-informed neural networks (PINNs) have grown in popularity due to their ability to leverage known physical principles in the learning process which has been empirically shown to improve performance compared to traditional black-box methods. This paper introduces the first physics-informed DNN controller for an Euler-Lagrange dynamic system where the adaptation laws are designed using a Lyapunov-based stability analysis to account for the skew-symmetry property of the inertia matrix and centripetal-Coriolis matrix. A Lyapunov-based stability analysis is provided to guarantee asymptotic convergence of the tracking error and the skew-symmetric prediction error. Simulations indicate that the developed update law demonstrates improvement in individual and overall function approximation capabilities when compared to a physics-informed adaptation law which does not incorporate knowledge of system symmetries.
☆ House Thermal Model Estimation: Robustness Across Seasons and Setpoints
Achieving the flexibility from house heating, cooling, and ventilation systems (HVAC) has the potential to enable large-scale demand response by aggregating HVAC load adjustments across many homes. This demand response strategy helps distribution grid to flexibly ramp-up or ramp-down local load demand so that it can optimally match the bulk power system generation profile. However, achieving this capability requires house thermal models that are both computationally efficient and robust to operating conditions. In this work, parameters of the Resistance-Capacitance (RC) network thermal model for houses are estimated using three optimization algorithms: Nonlinear Least Squares (NLS), Batch Estimation (BE), and Maximum Likelihood Estimation (MLE). The resulting models are evaluated through a Forward-Simulation across four different seasons and three setpoints. The results illustrate a principled way of selecting reduced order models and estimation methods with respect to the robustness offered to seasonal and setpoint variations in training-testing datasets
comment: This manuscript is a version of our paper accepted at the 57th North American Power Symposium (NAPS) 2025
☆ A Connectively Stable and Robust DAPI Control Scheme for Islanded Networks of Microgrids
The transition towards clean energy and the introduction of Distributed Energy Resources (DERs) are giving rise to the emergence of Microgrids (MGs) and Networks of MGs (NMGs). MGs and NMGs can operate autonomously in islanded mode. However, they face challenges in terms of secondary level frequency and voltage regulation, due to the variable nature of Renewable Energy Sources (RES) and loads. Distributed-Averaging Proportional-Integral (DAPI) control has been proposed in the literature for distributed frequency and voltage control of droop-controlled DERs, but it is not robust to operational or structural perturbations. To address this, we propose a robust DAPI frequency and voltage control scheme that ensures robustness using the concept of connective stability, along with the invariant ellipsoid technique for disturbance rejection. Simulation of an NMG model in MATLAB\textsuperscript{\textregistered}/Simulink\textsuperscript{\textregistered} consisting of 3 MGs and 5 DERs validates the effectiveness of the proposed method, and demonstrates that it can successfully mitigate the effects of major disturbances such as cyberattacks.
☆ Safety Monitor for Off-Road Planning with Uncertainty Bounded Bekker Costs
Reliable off-road autonomy requires operational constraints so that behavior stays predictable and safe when soil strength is uncertain. This paper presents a runtime assurance safety monitor that collaborates with any planner and uses a Bekker-based cost model with bounded uncertainty. The monitor builds an upper confidence traversal cost from a lightweight pressure sinkage model identified in field tests and checks each planned motion against two limits: maximum sinkage and rollover margin. If the risk of crossing either limit is too high, the monitor switches to a certified fallback that reduces vehicle speed, increases standoff from soft ground, or stops on firmer soil. This separation lets the planner focus on efficiency while the monitor keeps the vehicle within clear safety limits on board. Wheel geometry, wheel load estimate, and a soil raster serve as inputs, which tie safety directly to vehicle design and let the monitor set clear limits on speed, curvature, and stopping at run time. The method carries uncertainty analytically into the upper confidence cost and applies simple intervention rules. Tuning of the sinkage limit, rollover margin, and risk window trades efficiency for caution while keeping the monitor light enough for embedded processors. Results from a simulation environment spanning loam to sand include intervention rates, violation probability, and path efficiency relative to the nominal plan, and a benchtop static loading check provides initial empirical validation.
☆ Aircraft Collision Avoidance Systems: Technological Challenges and Solutions on the Path to Regulatory Acceptance
Aircraft collision avoidance systems is critical to modern aviation. These systems are designed to predict potential collisions between aircraft and recommend appropriate avoidance actions. Creating effective collision avoidance systems requires solutions to a variety of technical challenges related to surveillance, decision making, and validation. These challenges have sparked significant research and development efforts over the past several decades that have resulted in a variety of proposed solutions. This article provides an overview of these challenges and solutions with an emphasis on those that have been put through a rigorous validation process and accepted by regulatory bodies. The challenges posed by the collision avoidance problem are often present in other domains, and aircraft collision avoidance systems can serve as case studies that provide valuable insights for a wide range of safety-critical systems.
comment: 32 pages, 9 figures
♻ ☆ Prognostic Framework for Robotic Manipulators Operating Under Dynamic Task Severities
Robotic manipulators are critical in many applications but are known to degrade over time. This degradation is influenced by the nature of the tasks performed by the robot. Tasks with higher severity, such as handling heavy payloads, can accelerate the degradation process. One way this degradation is reflected is in the position accuracy of the robot's end-effector. In this paper, we present a prognostic modeling framework that predicts a robotic manipulator's Remaining Useful Life (RUL) while accounting for the effects of task severity. Our framework represents the robot's position accuracy as a Brownian motion process with a random drift parameter that is influenced by task severity. The dynamic nature of task severity is modeled using a continuous-time Markov chain (CTMC). To evaluate RUL, we discuss two approaches -- (1) a novel closed-form expression for Remaining Lifetime Distribution (RLD), and (2) Monte Carlo simulations, commonly used in prognostics literature. Theoretical results establish the equivalence between these RUL computation approaches. We validate our framework through experiments using two distinct physics-based simulators for planar and spatial robot fleets. Our findings show that robots in both fleets experience shorter RUL when handling a higher proportion of high-severity tasks.
comment: Accepted for Publication in IEEE Transactions on Systems, Man, and Cybernetics: Systems
♻ ☆ Deep Learning for Continuous-time Stochastic Control with Jumps
In this paper, we introduce a model-based deep-learning approach to solve finite-horizon continuous-time stochastic control problems with jumps. We iteratively train two neural networks: one to represent the optimal policy and the other to approximate the value function. Leveraging a continuous-time version of the dynamic programming principle, we derive two different training objectives based on the Hamilton-Jacobi-Bellman equation, ensuring that the networks capture the underlying stochastic dynamics. Empirical evaluations on different problems illustrate the accuracy and scalability of our approach, demonstrating its effectiveness in solving complex, high-dimensional stochastic control tasks.
♻ ☆ Revisiting Functional Derivatives in Multi-object Tracking
Probability generating functionals (PGFLs) are efficient and powerful tools for tracking independent objects in clutter. It was shown that PGFLs could be used for the elegant derivation of practical multi-object tracking algorithms, e.g., the probability hypothesis density (PHD) filter. However, derivations using PGFLs use the so-called functional derivatives whose definitions usually appear too complicated or heuristic, involving Dirac delta ``functions''. This paper begins by comparing different definitions of functional derivatives and exploring their relationships and implications for practical applications. It then proposes a rigorous definition of the functional derivative, utilizing straightforward yet precise mathematics for clarity. Key properties of the functional derivative are revealed and discussed.
comment: submitted to SIAM Journal on Control and Optimization
♻ ☆ Convergence in On-line Learning of Static and Dynamic Systems
The paper derives analytical expressions for the asymptotic average updating direction of the adaptive moment generation (ADAM) algorithm when applied to recursive identification of nonlinear systems. It is proved that the standard hyper-parameter setting results in the same asymptotic average updating direction as a diagonally power normalized stochastic gradient algorithm. With the internal filtering turned off, the asymptotic average updating direction is instead equivalent to that of a sign-sign stochastic gradient algorithm. Global convergence to an invariant set follows, where a subset of parameters contain those that give a correct input-output description of the system. The paper also exploits a nonlinear dynamic model to embed structure in recurrent neural networks. A Monte-Carlo simulation study validates the results.
♻ ☆ Constrained Trajectory Optimization for Hybrid Dynamical Systems
Hybrid dynamical systems pose significant challenges for effective planning and control, especially when additional constraints such as obstacle avoidance, state boundaries, and actuation limits are present. In this letter, we extend the recently proposed Hybrid iLQR method [1] to handle state and input constraints within an indirect optimization framework, aiming to preserve computational efficiency and ensure dynamic feasibility. Specifically, we incorporate two constraint handling mechanisms into the Hybrid iLQR: Discrete Barrier State and Augmented Lagrangian methods. Comprehensive simulations across various operational situations are conducted to evaluate and compare the performance of these extended methods in terms of convergence and their ability to handle infeasible starting trajectories. Results indicate that while the Discrete Barrier State approach is more computationally efficient, the Augmented Lagrangian method outperforms it in complex and real-world scenarios with infeasible initial trajectories.
comment: 6 pages 4 figures
♻ ☆ Online Regularized Statistical Learning in Reproducing Kernel Hilbert Space With Non-Stationary Data
We study recursive regularized learning algorithms in the reproducing kernel Hilbert space (RKHS) with non-stationary online data streams. We introduce the concept of random Tikhonov regularization path and decompose the tracking error of the algorithm's output for the regularization path into random difference equations in RKHS. We show that the tracking error vanishes in mean square if the regularization path is slowly time-varying. Then, leveraging the monotonicity of inverse operators and the spectral decomposition of compact operators, and introducing the RKHS persistence of excitation condition, we develop a dominated convergence method to prove the mean square consistency between the regularization path and the unknown function to be learned. Especially, for independent and non-identically distributed data streams, the mean square consistency between the algorithm's output and the unknown function is achieved if the input data's marginal probability measures are slowly time-varying and the average measure over each fixed-length time period has a uniformly strictly positive lower bound.
♻ ☆ Bayesian Optimization of Process Parameters of a Sensor-Based Sorting System using Gaussian Processes as Surrogate Models
Sensor-based sorting systems enable the physical separation of a material stream into two fractions. The sorting decision is based on the image data evaluation of the sensors used and is carried out using actuators. Various process parameters must be set depending on the properties of the material stream, the dimensioning of the system, and the required sorting accuracy. However, continuous verification and re-adjustment are necessary due to changing requirements and material stream compositions. In this paper, we introduce an approach for optimizing, recurrently monitoring and adjusting the process parameters of a sensor-based sorting system. Based on Bayesian Optimization, Gaussian process regression models are used as surrogate models to achieve specific requirements for system behavior with the uncertainties contained therein. This method minimizes the number of necessary experiments while simultaneously considering two possible optimization targets based on the requirements for both material output streams. In addition, uncertainties are considered during determining sorting accuracies in the model calculation. We evaluated the method with three example process parameters.
comment: Accepted at the IEEE 30th International Conference on Emerging Technologies and Factory Automation (ETFA)
♻ ☆ Scalable Distributed Least Squares Algorithm for Linear Algebraic Equations via Periodic Scheduling
In this work, we propose a novel discrete-time distributed algorithm for finding least-squares solutions of linear algebraic equations with a scheduling protocol to further enhance its scalability. Each agent in the network is assumed to know some rows of the coefficient matrix and the corresponding entries in the observation vector. Unlike typical distributed algorithms, our approach considers communication bandwidth limits, allowing agents to transmit only a portion of their ``guessed" solution, independent of its dimension. A cyclic scheduling protocol determines which portion is transmitted at each iteration. Assuming a small fixed step size and a diagonalizable algorithm matrix, we prove that agents' ``guessed" solutions converge exponentially to a least squares solution. For cases where the observation vectors are time-varying, a modified algorithm guarantees practical convergence, with tracking error bounded by the single-step variation in the observation vector. Simulations and comparisons with state-of-the-art algorithms validate our algorithm's feasibility and scalability.
comment: Submitted to IEEE TAC
♻ ☆ Towards Machine Learning-based Model Predictive Control for HVAC Control in Multi-Context Buildings at Scale via Ensemble Learning
The building thermodynamics model, which predicts real-time indoor temperature changes under potential HVAC (Heating, Ventilation, and Air Conditioning) control operations, is crucial for optimizing HVAC control in buildings. While pioneering studies have attempted to develop such models for various building environments, these models often require extensive data collection periods and rely heavily on expert knowledge, making the modeling process inefficient and limiting the reusability of the models. This paper explores a model ensemble perspective that utilizes existing developed models as base models to serve a target building environment, thereby providing accurate predictions while reducing the associated efforts. Given that building data streams are non-stationary and the number of base models may increase, we propose a Hierarchical Reinforcement Learning (HRL) approach to dynamically select and weight the base models. Our approach employs a two-tiered decision-making process: the high-level focuses on model selection, while the low-level determines the weights of the selected models. We thoroughly evaluate the proposed approach through offline experiments and an on-site case study, and the experimental results demonstrate the effectiveness of our method.
♻ ☆ Online Learning for Dynamic Vickrey-Clarke-Groves Mechanism in Unknown Environments
We consider the problem of online dynamic mechanism design for sequential auctions in unknown environments, where the underlying market and, thus, the bidders' values vary over time as interactions between the seller and the bidders progress. We model the sequential auctions as an infinite-horizon average-reward Markov decision process (MDP). In each round, the seller determines an allocation and sets a payment for each bidder, while each bidder receives a private reward and submits a sealed bid to the seller. The state, which represents the underlying market, evolves according to an unknown transition kernel and the seller's allocation policy without episodic resets. We first extend the Vickrey-Clarke-Groves (VCG) mechanism to sequential auctions, thereby obtaining a dynamic counterpart that preserves the desired properties: efficiency, truthfulness, and individual rationality. We then focus on the online setting and develop a reinforcement learning algorithm for the seller to learn the underlying MDP and implement a mechanism that closely resembles the dynamic VCG mechanism. We show that the learned mechanism approximately satisfies efficiency, truthfulness, and individual rationality and achieves guaranteed performance in terms of various notions of regret.
comment: 20 pages
♻ ☆ MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees NeurIPS 2025
Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of $2\times$ cost savings compared to existing LLM routing techniques.
comment: NeurIPS 2025. Code: https://github.com/laminair/mess-plus
Graphics 5
☆ Optimizing Feature Ordering in Radar Charts for Multi-Profile Comparison
Radar charts are widely used to visualize multivariate data and compare multiple profiles across features. However, the visual clarity of radar charts can be severely compromised when feature values alternate drastically in magnitude around the circle, causing areas to collapse, which misrepresents relative differences. In the present work we introduce a permutation optimization strategy that reorders features to minimize polygon ``spikiness'' across multiple profiles simultaneously. The method is combinatorial (exhaustive search) for moderate numbers of features and uses a lexicographic minimax criterion that first considers overall smoothness (mean jump) and then the largest single jump as a tie-breaker. This preserves more global information and produces visually balanced arrangements. We discuss complexity, practical bounds, and relations to existing approaches that either change the visualization (e.g., OrigamiPlot) or learn orderings (e.g., Versatile Ordering Network). An example with two profiles and $p=6$ features (before/after ordering) illustrates the qualitative improvement. Keywords: data visualization, radar charts, combinatorial optimization, minimax optimization, feature ordering
☆ From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail
In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation: geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.
☆ LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation
Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: https://github.com/Dogter521/LSF-Animation.
☆ Comparative Analysis of Procedural Planet Generators
This paper presents the development of two distinct real-time procedural planet generators within the Godot engine: one employing Fractal Brownian Motion (FBM) with Perlin Noise, and another adapting Minecraft-inspired layered noise techniques. We detail their implementation, including a quadtree-based Level of Detail (LOD) system and solutions for planetary mesh generation. A comparative user study (N=15) was conducted where participants explored unique instances generated by our two algorithms alongside two existing procedural planet projects.
comment: Published in the Proceedings of GAME-ON 2025
♻ ☆ GenLit: Reformulating Single-Image Relighting as Video Generation
Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible -- one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing. . Project page: https://genlit.is.tue.mpg.de/.
Computation and Language 136
☆ Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict
☆ On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?
With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.
☆ Real Deep Research for AI, Robotics and Beyond
With the rapid growth of research in AI and robotics now producing over 10,000 papers annually it has become increasingly difficult for researchers to stay up to date. Fast evolving trends, the rise of interdisciplinary work, and the need to explore domains beyond one's expertise all contribute to this challenge. To address these issues, we propose a generalizable pipeline capable of systematically analyzing any research area: identifying emerging trends, uncovering cross domain opportunities, and offering concrete starting points for new inquiry. In this work, we present Real Deep Research (RDR) a comprehensive framework applied to the domains of AI and robotics, with a particular focus on foundation models and robotics advancements. We also briefly extend our analysis to other areas of science. The main paper details the construction of the RDR pipeline, while the appendix provides extensive results across each analyzed topic. We hope this work sheds light for researchers working in the field of AI and beyond.
comment: website: https://realdeepresearch.github.io
☆ Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples
Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected -- eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data -- both for computing the indicative gradients and for measuring the final accuracy -- suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets -- entirely without fine-tuning.
☆ Simple Context Compression: Mean-Pooling and Multi-Ratio Training
A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
comment: Code available at https://github.com/lil-lab/simple-context-compression
☆ BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method targeting latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
☆ Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
comment: 19 pages, 5 figures
☆ A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text CIKM '25
Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.
comment: 24 pages with 3 figures, to appear in Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)
☆ Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost NeurIPS 2025
Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.
comment: NeurIPS 2025
☆ Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations
We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges.
☆ Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems
We present a multi-agent, human-in-the-loop workflow that co-designs quantum codes with prescribed transversal diagonal gates. It builds on the Subset-Sum Linear Programming (SSLP) framework (arXiv:2504.20847), which partitions basis strings by modular residues and enforces $Z$-marginal Knill-Laflamme (KL) equalities via small LPs. The workflow is powered by GPT-5 and implemented within TeXRA (https://texra.ai)-a multi-agent research assistant platform that supports an iterative tool-use loop agent and a derivation-then-edit workflow reasoning agent. We work in a LaTeX-Python environment where agents reason, edit documents, execute code, and synchronize their work to Git/Overleaf. Within this workspace, three roles collaborate: a Synthesis Agent formulates the problem; a Search Agent sweeps/screens candidates and exactifies numerics into rationals; and an Audit Agent independently checks all KL equalities and the induced logical action. As a first step we focus on distance $d=2$ with nondegenerate residues. For code dimension $K\in\{2,3,4\}$ and $n\le6$ qubits, systematic sweeps yield certificate-backed tables cataloging attainable cyclic logical groups-all realized by new codes-e.g., for $K=3$ we obtain order $16$ at $n=6$. From verified instances, Synthesis Agent abstracts recurring structures into closed-form families and proves they satisfy the KL equalities for all parameters. It further demonstrates that SSLP accommodates residue degeneracy by exhibiting a new $((6,4,2))$ code implementing the transversal controlled-phase $diag(1,1,1,i)$. Overall, the workflow recasts diagonal-transversal feasibility as an analytical pipeline executed at scale, combining systematic enumeration with exact analytical reconstruction. It yields reproducible code constructions, supports targeted extensions to larger $K$ and higher distances, and leads toward data-driven classification.
comment: 29 pages, 2 figures
☆ Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing
Objective: Fluoropyrimidines are widely prescribed for colorectal and breast cancers, but are associated with toxicities such as hand-foot syndrome and cardiotoxicity. Since toxicity documentation is often embedded in clinical notes, we aimed to develop and evaluate natural language processing (NLP) methods to extract treatment and toxicity information. Materials and Methods: We constructed a gold-standard dataset of 236 clinical notes from 204,165 adult oncology patients. Domain experts annotated categories related to treatment regimens and toxicities. We developed rule-based, machine learning-based (Random Forest, Support Vector Machine [SVM], Logistic Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language models (LLM)-based NLP approaches (zero-shot and error-analysis prompting). Models used an 80:20 train-test split. Results: Sufficient data existed to train and evaluate 5 annotated categories. Error-analysis prompting achieved optimal precision, recall, and F1 scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot prompting reached F1=1.000 for treatment and F1=0.876 for toxicities extraction.LR and SVM ranked second for toxicities (F1=0.937). Deep learning underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods served as our baseline with F1 scores of 0.857 in treatment and 0.858 in toxicities. Discussion: LMM-based approaches outperformed all others, followed by machine learning methods. Machine and deep learning approaches were limited by small training data and showed limited generalizability, particularly for rare categories. Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine treatment and toxicity information from clinical notes, and has strong potential to support oncology research and pharmacovigilance.
☆ User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios
Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs' ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users' perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users' evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs' ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users' perceived privacy and utility.
☆ Structure-Conditional Minimum Bayes Risk Decoding EMNLP 2025
Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative to traditional generation strategies. While MBR has proven effective in machine translation, where the variability of a language model's outcome space is naturally constrained, it may face challenges in more open-ended tasks such as dialogue or instruction-following. We hypothesise that in such settings, applying MBR with standard similarity-based utility functions may result in selecting responses that are broadly representative of the model's distribution, yet sub-optimal with respect to any particular grouping of generations that share an underlying latent structure. In this work, we introduce three lightweight adaptations to the utility function, designed to make MBR more sensitive to structural variability in the outcome space. To test our hypothesis, we curate a dataset capturing three representative types of latent structure: dialogue act, emotion, and response structure (e.g., a sentence, a paragraph, or a list). We further propose two metrics to evaluate the structural optimality of MBR. Our analysis demonstrates that common similarity-based utility functions fall short by these metrics. In contrast, our proposed adaptations considerably improve structural optimality. Finally, we evaluate our approaches on real-world instruction-following benchmarks, AlpacaEval and MT-Bench, and show that increased structural sensitivity improves generation quality by up to 13.7 percentage points in win rate.
comment: EMNLP 2025 Camera-Ready
☆ Neural Diversity Regularizes Hallucinations in Small Models
Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. Inspired by portfolio theory, where uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination probability is bounded by representational correlation: $P(H) \leq f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language models need an optimal amount of neurodiversity. To validate this, we introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces hallucinations by up to 25.6% (and 14.6% on average) without degrading general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational analyses indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different amounts of optimal neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.
☆ Analyticup E-commerce Product Search Competition Technical Report from Team Tredence_AICOE
This study presents the multilingual e-commerce search system developed by the Tredence_AICOE team. The competition features two multilingual relevance tasks: Query-Category (QC) Relevance, which evaluates how well a user's search query aligns with a product category, and Query-Item (QI) Relevance, which measures the match between a multilingual search query and an individual product listing. To ensure full language coverage, we performed data augmentation by translating existing datasets into languages missing from the development set, enabling training across all target languages. We fine-tuned Gemma-3 12B and Qwen-2.5 14B model for both tasks using multiple strategies. The Gemma-3 12B (4-bit) model achieved the best QC performance using original and translated data, and the best QI performance using original, translated, and minority class data creation. These approaches secured 4th place on the final leaderboard, with an average F1-score of 0.8857 on the private test set.
☆ \textsc{CantoNLU}: A benchmark for Cantonese natural language understanding
Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.
comment: 13 pages, 1 figure
☆ The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI
Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM's reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting "Lost in Translation," where translation steps lead to errors that would have been avoided by question's language reasoning.
comment: 14 pages, 13 figures, 5 tables
☆ Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model
Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model's reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.
☆ BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection
This paper details our submission to the Ara- GenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, se- cured 5th place. We investigated the effec- tiveness of three pre-trained transformer mod- els: AraELECTRA, CAMeLBERT, and XLM- RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a sur- prising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the spe- cialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capa- bilities of multilingual models.
☆ What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation
Evaluating large language models (LLMs) on final-answer correctness is the dominant paradigm. This approach, however, provides a coarse signal for model improvement and overlooks the quality of the underlying reasoning process. We argue that a more granular evaluation of reasoning offers a more effective path to building robust models. We decompose reasoning quality into two dimensions: relevance and coherence. Relevance measures if a step is grounded in the problem; coherence measures if it follows logically from prior steps. To measure these aspects reliably, we introduce causal stepwise evaluation (CaSE). This method assesses each reasoning step using only its preceding context, which avoids hindsight bias. We validate CaSE against human judgments on our new expert-annotated benchmarks, MRa-GSM8K and MRa-MATH. More importantly, we show that curating training data with CaSE-evaluated relevance and coherence directly improves final task performance. Our work provides a scalable framework for analyzing, debugging, and improving LLM reasoning, demonstrating the practical value of moving beyond validity checks.
☆ Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks
Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology exhibits bias against different demographic groups, such as gender and race, remains unclear. To fill this gap, this paper investigates ChatGPT-based automated coding of communication data using a typical coding framework for collaborative problem solving, examining differences across gender and racial groups. The analysis draws on data from three types of collaborative tasks: negotiation, problem solving, and decision making. Our results show that ChatGPT-based coding exhibits no significant bias across gender and racial groups, paving the road for its adoption in large-scale assessment of collaboration and communication.
comment: 38 pages, 4 figures
☆ Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search
The retrieval-ranking paradigm has long dominated e-commerce search, but its reliance on query-item matching fundamentally misaligns with multi-stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross-platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi-Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF's significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi-constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi-agent cognitive systems in redefining e-commerce search.
☆ GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.
comment: 8 pages, 3 figures, 4 tables
☆ The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts
When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.
☆ ARC-Encoder: learning compressed text representations for large language models
Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs $x$-times fewer continuous representations (typically $x\!\in\!\{4,8\}$) than text tokens. We evaluate ARC-Encoder across a variety of LLM usage scenarios, ranging from in-context learning to context window extension, on both instruct and base decoders. Results show that ARC-Encoder achieves state-of-the-art performance on several benchmarks while improving computational efficiency at inference. Finally, we demonstrate that our models can be adapted to multiple decoders simultaneously, allowing a single encoder to generalize across different decoder LLMs. This makes ARC-Encoder a flexible and efficient solution for portable encoders that work seamlessly with multiple LLMs. We release a training code at https://github.com/kyutai-labs/ARC-Encoder , fine-tuning dataset and pretrained models are available at https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047 .
☆ Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment ICASSP 2026
Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech
comment: Submitted to ICASSP 2026. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech
☆ Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset
The political biases of Large Language Models (LLMs) are usually assessed by simulating their answers to English surveys. In this work, we propose an alternative framing of political biases, relying on principles of fairness in multilingual translation. We systematically compare the translation quality of speeches in the European Parliament (EP), observing systematic differences with majority parties from left, center, and right being better translated than outsider parties. This study is made possible by a new, 21-way multiparallel version of EuroParl, the parliamentary proceedings of the EP, which includes the political affiliations of each speaker. The dataset consists of 1.5M sentences for a total of 40M words and 249M characters. It covers three years, 1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of national parties.
☆ Hierarchical Sequence Iteration for Heterogeneous Question Answering
Retrieval-augmented generation (RAG) remains brittle on multi-step questions and heterogeneous evidence sources, trading accuracy against latency and token/tool budgets. This paper introducesHierarchical Sequence (HSEQ) Iteration for Heterogeneous Question Answering, a unified framework that (i) linearize documents, tables, and knowledge graphs into a reversible hierarchical sequence with lightweight structural tags, and (ii) perform structure-aware iteration to collect just-enough evidence before answer synthesis. A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes canonicalized evidence to genearte the final answer, with an optional refinement loop to resolve detected contradictions. Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency. Besides, HSEQ exhibits three key advantages: (1) a format-agnostic unification that enables a single policy to operate across text, tables, and KGs without per-dataset specialization; (2) guided, budget-aware iteration that reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and (3) evidence canonicalization for reliable QA, improving answers consistency and auditability.
comment: 22 pages, 3 figures
☆ Robust Preference Alignment via Directional Neighborhood Consensus ICLR 2026
Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user's original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.
comment: Under review at ICLR 2026. 10 pages, 5 figures. Code and data available at https://github.com/rcmao/robust-preference-alignment
☆ Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. However, this gap can only be observed by removing the evaluation cue. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
☆ RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging
We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose RECALL, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.
☆ Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs
We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model's ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model's morphological generalization capabilities. Our submission beats the baseline in the strict-small track.
comment: Submission to the 2025 BabyLM Challenge
☆ Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.
☆ LM-mixup: Text Data Augmentation via Language Model based Mixup
Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.
☆ Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction EMNLP 2025
Multi-turn dialogues between a child and a caregiver are characterized by a property called contingency - that is, prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a teacher-student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive teacher decoding strategies show limited additional gains. ContingentChat demonstrates the benefits of targeted post-training for dialogue quality and indicates that contingency remains a challenging goal for BabyLMs.
comment: Outstanding Paper Award, EMNLP 2025 BabyLM Workshop - Oral presentation, Suzhou, China
☆ Relative-Based Scaling Law for Neural Language Models
Scaling laws aim to accurately predict model performance across different scales. Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric. However, cross-entropy provides only a partial view of performance: it measures the absolute probability assigned to the correct token, but ignores the relative ordering between correct and incorrect tokens. Yet, relative ordering is crucial for language models, such as in greedy-sampling scenario. To address this limitation, we investigate scaling from the perspective of relative ordering. We first propose the Relative-Based Probability (RBP) metric, which quantifies the probability that the correct token is ranked among the top predictions. Building on this metric, we establish the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size. Through extensive experiments on four datasets and four model families spanning five orders of magnitude, we demonstrate the robustness and accuracy of this law. Finally, we illustrate the broad application of this law with two examples, namely providing a deeper explanation of emergence phenomena and facilitating finding fundamental theories of scaling laws. In summary, the Relative-Based Scaling Law complements the cross-entropy perspective and contributes to a more complete understanding of scaling large language models. Thus, it offers valuable insights for both practical development and theoretical exploration.
☆ NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew
Since their initial release, BERT models have demonstrated exceptional performance on a variety of tasks, despite their relatively small size (BERT-base has ~100M parameters). Nevertheless, the architectural choices used in these models are outdated compared to newer transformer-based models such as Llama3 and Qwen3. In recent months, several architectures have been proposed to close this gap. ModernBERT and NeoBERT both show strong improvements on English benchmarks and significantly extend the supported context window. Following their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual: BERT-style models trained using the same architecture as NeoBERT, with a dedicated focus on Hebrew texts. These models outperform existing ones on almost all Hebrew benchmarks and provide a strong foundation for downstream tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. In this paper, we describe the training process and report results across various benchmarks. We release the models to the community as part of our goal to advance research and development in Hebrew NLP.
☆ VLSP 2025 MLQA-TSR Challenge: Vietnamese Multimodal Legal Question Answering on Traffic Sign Regulation SP 2025
This paper presents the VLSP 2025 MLQA-TSR - the multimodal legal question answering on traffic sign regulation shared task at VLSP 2025. VLSP 2025 MLQA-TSR comprises two subtasks: multimodal legal retrieval and multimodal question answering. The goal is to advance research on Vietnamese multimodal legal text processing and to provide a benchmark dataset for building and evaluating intelligent systems in multimodal legal domains, with a focus on traffic sign regulation in Vietnam. The best-reported results on VLSP 2025 MLQA-TSR are an F2 score of 64.55% for multimodal legal retrieval and an accuracy of 86.30% for multimodal question answering.
comment: VLSP 2025 MLQA-TSR Share Task
☆ IKnow: Instruction-Knowledge-Aware Continual Pretraining for Effective Domain Adaptation
Continual pretraining promises to adapt large language models (LLMs) to new domains using only unlabeled test-time data, but naively applying standard self-supervised objectives to instruction-tuned models is known to degrade their instruction-following capability and semantic representations. Existing fixes assume access to the original base model or rely on knowledge from an external domain-specific database - both of which pose a realistic barrier in settings where the base model weights are withheld for safety reasons or reliable external corpora are unavailable. In this work, we propose Instruction-Knowledge-Aware Continual Adaptation (IKnow), a simple and general framework that formulates novel self-supervised objectives in the instruction-response dialogue format. Rather than depend- ing on external resources, IKnow leverages domain knowledge embedded within the text itself and learns to encode it at a deeper semantic level.
☆ The Impact of Negated Text on Hallucination with Large Language Models EMNLP 2025
Recent studies on hallucination in large language models (LLMs) have been actively progressing in natural language processing. However, the impact of negated text on hallucination with LLMs remains largely unexplored. In this paper, we set three important yet unanswered research questions and aim to address them. To derive the answers, we investigate whether LLMs can recognize contextual shifts caused by negation and still reliably distinguish hallucinations comparable to affirmative cases. We also design the NegHalu dataset by reconstructing existing hallucination detection datasets with negated expressions. Our experiments demonstrate that LLMs struggle to detect hallucinations in negated text effectively, often producing logically inconsistent or unfaithful judgments. Moreover, we trace the internal state of LLMs as they process negated inputs at the token level and reveal the challenges of mitigating their unintended effects.
comment: Accepted to the EMNLP 2025
☆ Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)
We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce "more communicative" text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.
☆ FreeChunker: A Cross-Granularity Chunking Framework
Chunking strategies significantly impact the effectiveness of Retrieval-Augmented Generation (RAG) systems. Existing methods operate within fixed-granularity paradigms that rely on static boundary identification, limiting their adaptability to diverse query requirements. This paper presents FreeChunker, a Cross-Granularity Encoding Framework that fundamentally transforms the traditional chunking paradigm: the framework treats sentences as atomic units and shifts from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations. This paradigm shift not only significantly reduces the computational overhead required for semantic boundary detection but also enhances adaptability to complex queries. Experimental evaluation on LongBench V2 demonstrates that FreeChunker achieves superior retrieval performance compared to traditional chunking methods, while significantly outperforming existing approaches in computational efficiency.
comment: Submitted to arXiv, October 2025
☆ Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models
Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs' apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.
☆ Teaching Language Models to Reason with Tools NIPS2025
Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model's internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT's effectiveness, yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30\% for the 32B model and 50\% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: https://github.com/ChengpengLi1003/CoRT.
comment: NIPS2025 Accepted
☆ Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering
Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.
☆ Citation Failure: Definition, Analysis and Efficient Mitigation
Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.
comment: Under review. Paper repository: https://github.com/UKPLab/arxiv2025-citation-failure
☆ Context-level Language Modeling by Learning Predictive Context Embeddings
Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model's capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.
comment: 16pages,6 figures
☆ ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases
The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at https://github.com/safety-research/impossiblebench.
☆ Calibrating Multimodal Consensus for Emotion Recognition
In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CMC.
☆ Tri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders
Depression and post traumatic stress disorder (PTSD) often co-occur with connected symptoms, complicating automated assessment, which is often binary and disorder specific. Clinically useful diagnosis needs severity aware cross disorder estimates and decision support explanations. Our unified tri modal affective severity framework synchronizes and fuses interview text with sentence level transformer embeddings, audio with log Mel statistics with deltas, and facial signals with action units, gaze, head and pose descriptors to output graded severities for diagnosing both depression (PHQ-8; 5 classes) and PTSD (3 classes). Standardized features are fused via a calibrated late fusion classifier, yielding per disorder probabilities and feature-level attributions. This severity aware tri-modal affective fusion approach is demoed on multi disorder concurrent depression and PTSD assessment. Stratified cross validation on DAIC derived corpora outperforms unimodal/ablation baselines. The fused model matches the strongest unimodal baseline on accuracy and weighted F1, while improving decision curve utility and robustness under noisy or missing modalities. For PTSD specifically, fusion reduces regression error and improves class concordance. Errors cluster between adjacent severities; extreme classes are identified reliably. Ablations show text contributes most to depression severity, audio and facial cues are critical for PTSD, whereas attributions align with linguistic and behavioral markers. Our approach offers reproducible evaluation and clinician in the loop support for affective clinical decision making.
☆ Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel "induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses.
☆ Decoding-Free Sampling Strategies for LLM Marginalization
Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage. A side effect of this is that, during inference, models are evaluated by measuring the probability of only the specific tokenization produced as the output, despite there being many possible ways to represent the same text with a subword vocabulary. Recent studies have argued instead for evaluating LLMs by marginalization - the probability mass of all tokenizations of a given text. Marginalization is difficult due to the number of possible tokenizations of a text, so often approximate marginalization is done via sampling. However, a downside of sampling is that an expensive generation step must be performed by the LLM for each sample, which limits the number of samples that can be acquired given a runtime budget, and therefore also the accuracy of the approximation. Since computing the probability of a sequence given the tokenization is relatively cheap compared to actually generating it, we investigate sampling strategies that are decoding-free - they require no generation from the LLM, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic. We investigate the approximation quality and speed of decoding-free sampling strategies for a number of open models to find that they provide sufficiently accurate marginal estimates at a small fraction of the runtime cost and demonstrate its use on a set of downstream inference tasks.
comment: 10 pages, 3 figures
☆ Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models
This paper explores the spatial reasoning capability of large language models (LLMs) over textual input through a suite of five tasks aimed at probing their spatial understanding and computational abilities. The models were tested on both fundamental spatial reasoning and multi-step problem-solving within structured grid-based environments using tasks such as quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding. Each task was scaled in complexity through increasing grid dimensions, requiring models to extend beyond simple pattern recognition into abstract spatial reasoning. Our results reveal that while LLMs demonstrate moderate success in all tasks with small complexity and size, performance drops off rapidly as scale increases, with an average loss in accuracy of 42.7%, and reaching as high as 84%. Every test that began with over 50% accuracy showed a loss of at least 48%, illustrating the consistent nature of the deterioration. Furthermore, their struggles with scaling complexity hint at a lack of robust spatial representations in their underlying architectures. This paper underscores the gap between linguistic and spatial reasoning in LLMs, offering insights into their current limitations, and laying the groundwork for future integrative benchmarks at the intersection of language and geometry.
comment: 20 pages, 24 figures
☆ Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.
comment: In Proceedings of the 2nd ACM Workshop in AI-powered Question and Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3746274.3760393
☆ Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values
We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.
comment: 15 pages, 4 figures
☆ Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding
Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.
comment: 18 pages, 4 figures
☆ DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking
Current search agents fundamentally lack the ability to simultaneously perform \textit{deep} reasoning over multi-hop retrieval and \textit{wide}-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.
☆ Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ? EMNLP 2025
Large Language Models inherit stereotypes from their pretraining data, leading to biased behavior toward certain social groups in many Natural Language Processing tasks, such as hateful speech detection or sentiment analysis. Surprisingly, the evaluation of this kind of bias in stance detection methods has been largely overlooked by the community. Stance Detection involves labeling a statement as being against, in favor, or neutral towards a specific target and is among the most sensitive NLP tasks, as it often relates to political leanings. In this paper, we focus on the bias of Large Language Models when performing stance detection in a zero-shot setting. We automatically annotate posts in pre-existing stance detection datasets with two attributes: dialect or vernacular of a specific group and text complexity/readability, to investigate whether these attributes influence the model's stance detection decisions. Our results show that LLMs exhibit significant stereotypes in stance detection tasks, such as incorrectly associating pro-marijuana views with low text complexity and African American dialect with opposition to Donald Trump.
comment: Accepted in EMNLP 2025 (Main)
☆ BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
☆ AI PB: A Grounded Generative Agent for Personalized Investment Insights
We present AI PB, a production-scale generative agent deployed in real retail finance. Unlike reactive chatbots that answer queries passively, AI PB proactively generates grounded, compliant, and user-specific investment insights. It integrates (i) a component-based orchestration layer that deterministically routes between internal and external LLMs based on data sensitivity, (ii) a hybrid retrieval pipeline using OpenSearch and the finance-domain embedding model, and (iii) a multi-stage recommendation mechanism combining rule heuristics, sequential behavioral modeling, and contextual bandits. Operating fully on-premises under Korean financial regulations, the system employs Docker Swarm and vLLM across 24 X NVIDIA H100 GPUs. Through human QA and system metrics, we demonstrate that grounded generation with explicit routing and layered safety can deliver trustworthy AI insights in high-stakes finance.
comment: Under Review
☆ Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning
Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.
☆ BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.
comment: Project page: https://imageomics.github.io/biocap/
☆ CreativityPrism: A Holistic Benchmark for Large Language Model Creativity
Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.
♻ ☆ Language Models use Lookbacks to Track Beliefs
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovers a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
comment: 31 pages, 33 figures. Code and data at https://belief.baulab.info/
♻ ☆ Text2Mem: A Unified Memory Operation Language for Memory Operating System
Large language model agents increasingly depend on memory to sustain long horizon interaction, but existing frameworks remain limited. Most expose only a few basic primitives such as encode, retrieve, and delete, while higher order operations like merge, promote, demote, split, lock, and expire are missing or inconsistently supported. Moreover, there is no formal and executable specification for memory commands, leaving scope and lifecycle rules implicit and causing unpredictable behavior across systems. We introduce Text2Mem, a unified memory operation language that provides a standardized pathway from natural language to reliable execution. Text2Mem defines a compact yet expressive operation set aligned with encoding, storage, and retrieval. Each instruction is represented as a JSON based schema instance with required fields and semantic invariants, which a parser transforms into typed operation objects with normalized parameters. A validator ensures correctness before execution, while adapters map typed objects either to a SQL prototype backend or to real memory frameworks. Model based services such as embeddings or summarization are integrated when required. All results are returned through a unified execution contract. This design ensures safety, determinism, and portability across heterogeneous backends. We also outline Text2Mem Bench, a planned benchmark that separates schema generation from backend execution to enable systematic evaluation. Together, these components establish the first standardized foundation for memory control in agents.
comment: 12 pages, 3 figures, 2 tables
♻ ☆ FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts NeurIPS 2025
Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
comment: NeurIPS 2025 accepted paper
♻ ☆ Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex
Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.
comment: 26 pages, 4 figures
♻ ☆ Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding
Discrete diffusion language models have shown strong potential for text generation, yet standard supervised fine-tuning (SFT) misaligns with their semi-autoregressive inference: training randomly masks tokens across the entire response, while inference generates fixed-size blocks sequentially. This mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away from the desired blockwise likelihood. We propose Blockwise SFT, which partitions responses into fixed-size blocks, selects one active block per step for stochastic masking, freezes all preceding tokens, and fully hides future ones. Loss is computed only over the active block, directly mirroring the blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show consistent gains over classical SFT under equal compute or token budgets. Block size consistency studies and ablations confirm that improvements stem from faithful training-inference alignment rather than incidental masking effects. Our results highlight the importance of matching supervision granularity to the decoding procedure in diffusion-based language models.
♻ ☆ Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning
When applying reinforcement learning--typically through GRPO--to large vision-language model reasoning struggles to effectively scale reasoning length or generates verbose outputs across all tasks with only marginal gains in accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. Inspired by these observations, we introduce two complementary metrics to estimate the difficulty of the questions, guiding the model to determine when fast or slow thinking is more appropriate. Next, we incorporate adaptive length-based rewards and difficulty-aware KL divergence into the GRPO algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10\% relative improvement compared to the base model, while reducing token usage by 32.7-67.3\% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.
♻ ☆ On the Emergence of Linear Analogies in Word Embeddings NeurIPS 2025
Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.
comment: Main: 10 pages, 3 figures. Appendices: 11 pages, 7 figures. Accepted at NeurIPS 2025 as a poster
♻ ☆ Superposition Yields Robust Neural Scaling NeurIPS 2025
The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law, that loss decreases as a power law with model size, remains unclear. We propose that representation superposition, meaning that LLMs represent more features than they have dimensions, can be a key contributor to loss and cause neural scaling. Based on Anthropic's toy model, we use weight decay to control the degree of superposition, allowing us to systematically study how loss scales with model size. When superposition is weak, the loss follows a power law only if data feature frequencies are power-law distributed. In contrast, under strong superposition, the loss generically scales inversely with model dimension across a broad class of frequency distributions, due to geometric overlaps between representation vectors. We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling like one over the model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.
comment: Accepted at NeurIPS 2025
♻ ☆ ReDit: Reward Dithering for Improved LLM Policy Optimization
DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.
comment: 34 pages, 19 figures
♻ ☆ X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation
Large Language Models (LLMs) have been shown to enhance the effectiveness of enriching item descriptions, thereby improving the accuracy of recommendation systems. However, most existing approaches either rely on text-only prompting or employ basic multimodal strategies that do not fully exploit the complementary information available from both textual and visual modalities. This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, designed to address these limitations by prompting Multimodal Large Language Models (MLLMs) to explicitly identify and reconcile supportive and conflicting information between text and images. By capturing nuanced insights from both modalities, this approach generates more comprehensive and contextually rich item representations. Extensive experiments conducted on two widely used benchmarks demonstrate that our method outperforms existing prompting baselines in downstream recommendation accuracy. Furthermore, we identify a U-shaped relationship between text-image dissimilarity and recommendation performance, suggesting the benefit of applying multimodal prompting selectively. To support efficient real-time inference, we also introduce X-Reflect-keyword, a lightweight variant that summarizes image content using keywords and replaces the base model with a smaller backbone, achieving nearly 50% reduction in input length while maintaining competitive performance. This work underscores the importance of integrating multimodal information and presents an effective solution for improving item understanding in multimodal recommendation systems.
♻ ☆ BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning NeurIPS 2025
Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.
comment: NeurIPS 2025 Spotlight; Project page: https://imageomics.github.io/bioclip-2/
♻ ☆ Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.
comment: 30 pages, 27 figures
♻ ☆ Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons NeurIPS 2025
Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about $5\%$ safety neurons, and by only patching their activations we can restore over $90\%$ of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax'' phenomenon by revealing that the key neurons for model safety and helpfulness significantly overlap, yet they require different activation patterns for the same neurons. Furthermore, we demonstrate an application of our findings in safeguarding LLMs by detecting unsafe outputs before generation. The source code is available at https://github.com/THU-KEG/SafetyNeuron.
comment: NeurIPS 2025
♻ ☆ Benchmarking GPT-5 for biomedical natural language processing
Biomedical literature and clinical narratives pose multifaceted challenges for natural language understanding, from precise entity extraction and document synthesis to multi-step diagnostic reasoning. This study extends a unified benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across five core biomedical NLP tasks: named entity recognition, relation extraction, multi-label document classification, summarization, and simplification, and nine expanded biomedical QA datasets covering factual knowledge, clinical reasoning, and multimodal visual understanding. Using standardized prompts, fixed decoding parameters, and consistent inference pipelines, we assessed model performance, latency, and token-normalized cost under official pricing. GPT-5 consistently outperformed GPT-4o, with the largest gains on reasoning-intensive datasets such as MedXpertQA and DiagnosisArena and stable improvements in multimodal QA. In core tasks, GPT-5 achieved better chemical NER and ChemProt scores but remained below domain-tuned baselines for disease NER and summarization. Despite producing longer outputs, GPT-5 showed comparable latency and 30 to 50 percent lower effective cost per correct prediction. Fine-grained analyses revealed improvements in diagnosis, treatment, and reasoning subtypes, whereas boundary-sensitive extraction and evidence-dense summarization remain challenging. Overall, GPT-5 approaches deployment-ready performance for biomedical QA while offering a favorable balance of accuracy, interpretability, and economic efficiency. The results support a tiered prompting strategy: direct prompting for large-scale or cost-sensitive applications, and chain-of-thought scaffolds for analytically complex or high-stakes scenarios, highlighting the continued need for hybrid solutions where precision and factual fidelity are critical.
♻ ☆ XtraGPT: Context-Aware and Controllable Academic Paper Revision
Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited to support high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, such as conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision centered on criteria-guided intent alignment and context-aware modeling. To validate the framework, we curate a dataset of 7,000 research papers from top-tier venues annotated with 140,000 instruction-response pairs that reflect realistic, section-level scientific revisions. We instantiate the framework in XtraGPT, the first suite of open-source LLMs (1.5B to 14B parameters) for context-aware, instruction-guided writing assistance. Extensive experiments validate that XtraGPT significantly outperforms same-scale baselines and approaches the quality of proprietary systems. Both automated preference assessments and human evaluations confirm the effectiveness of XtraGPT in improving scientific drafts.
comment: Preprint. The model report is available at https://arxiv.org/abs/2505.11336v1
♻ ☆ Unlocking Multi-View Insights in Knowledge-Dense Retrieval-Augmented Generation
While Retrieval-Augmented Generation (RAG) plays a crucial role in the application of Large Language Models (LLMs), existing retrieval methods in knowledge-dense domains like law and medicine still suffer from a lack of multi-perspective views, which are essential for improving interpretability and reliability. Previous research on multi-view retrieval often focused solely on different semantic forms of queries, neglecting the expression of specific domain knowledge perspectives. This paper introduces a novel multi-view RAG framework, MVRAG, tailored for knowledge-dense domains that utilizes intention-aware query rewriting from multiple domain viewpoints to enhance retrieval precision, thereby improving the effectiveness of the final inference. Experiments conducted on legal and medical case retrieval demonstrate significant improvements in recall and precision rates with our framework. Our multi-perspective retrieval approach unleashes the potential of multi-view information enhancing RAG tasks, accelerating the further application of LLMs in knowledge-intensive fields.
♻ ☆ Neural Attention Search
We present Neural Attention Search (NAtS), a framework that automatically evaluates the importance of each token within a sequence and determines if the corresponding token can be dropped after several steps. This approach can efficiently reduce the KV cache sizes required by transformer-based models during inference and thus reduce inference costs. In this paper, we design a search space that contains three token types: (i) Global Tokens will be preserved and queried by all the following tokens. (ii) Local Tokens survive until the next global token appears. (iii) Sliding Window Tokens have an impact on the inference of a fixed size of the next following tokens. Similar to the One-Shot Neural Architecture Search approach, this token-type information can be learned jointly with the architecture weights via a learnable attention mask. Experiments on both training a new transformer from scratch and fine-tuning existing large language models show that NAtS can efficiently reduce the KV cache size required for the models while maintaining the models' performance.
comment: 35 pages, 11 figures
♻ ☆ Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference
Artificial Intelligence (AI) conferences are essential for advancing research, sharing knowledge, and fostering academic community. However, their rapid expansion has rendered the centralized conference model increasingly unsustainable. This paper offers a data-driven diagnosis of a structural crisis that threatens the foundational goals of scientific dissemination, equity, and community well-being. We identify four key areas of strain: (1) scientifically, with per-author publication rates more than doubling over the past decade to over 4.5 papers annually; (2) environmentally, with the carbon footprint of a single conference exceeding the daily emissions of its host city; (3) psychologically, with 71% of online community discourse reflecting negative sentiment and 35% referencing mental health concerns; and (4) logistically, with attendance at top conferences such as NeurIPS 2024 beginning to outpace venue capacity. These pressures point to a system that is misaligned with its core mission. In response, we propose the Community-Federated Conference (CFC) model, which separates peer review, presentation, and networking into globally coordinated but locally organized components, offering a more sustainable, inclusive, and resilient path forward for AI research.
comment: Preprint
♻ ☆ Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders EMNLP 2025
Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending on the aggressiveness. Crucially, standard NLP benchmark scores upon steering remain stable, indicating that the model's knowledge and general abilities are preserved. We further show that feature-splitting in wider SAEs hampers safety interventions, underscoring the importance of disentangled feature learning. Our findings highlight both the promise and the current limitations of SAE-based causal interventions for LLM detoxification, further suggesting practical guidelines for safer language-model deployment.
comment: EMNLP 2025
♻ ☆ MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance EMNLP 2025
Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to scalable content moderation. MoMoE orchestrates four operators -- Allocate, Predict, Aggregate, Explain -- and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.
comment: EMNLP 2025 (Oral)
♻ ☆ MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations
Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called MultiHal framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale improvement by approximately 0.12 to 0.36 points for the semantic similarity score, 0.16 to 0.36 for NLI entailment and 0.29 to 0.42 for hallucination detection in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.
♻ ☆ Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization
LLM-powered embodied agents have shown success on conventional object-rearrangement tasks, but providing personalized assistance that leverages user-specific knowledge from past interactions presents new challenges. We investigate these challenges through the lens of agents' memory utilization along two critical dimensions: object semantics (identifying objects based on personal meaning) and user patterns (recalling sequences from behavioral routines). To assess these capabilities, we construct MEMENTO, an end-to-end two-stage evaluation framework comprising single-memory and joint-memory tasks. Our experiments reveal that current agents can recall simple object semantics but struggle to apply sequential user patterns to planning. Through in-depth analysis, we identify two critical bottlenecks: information overload and coordination failures when handling multiple memories. Based on these findings, we explore memory architectural approaches to address these challenges. Given our observation that episodic memory provides both personalized knowledge and in-context learning benefits, we design a hierarchical knowledge graph-based user-profile memory module that separately manages personalized knowledge, achieving substantial improvements on both single and joint-memory tasks. Project website: https://connoriginal.github.io/MEMENTO
comment: Work in progress
♻ ☆ MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.
comment: Data available at https://huggingface.co/datasets/FBK-MT/MCIF | Evaluation and baselines available at https://github.com/hlt-mt/mcif
♻ ☆ Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants NeurIPS 2025
Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench includes a development set and a test set, each with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. We also explore which abilities of MLLMs need to be supplemented by specialist models. The dataset and evaluation code have been made publicly available at https://face-human-bench.github.io.
comment: 50 pages, 14 figures, 42 tables. NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.
comment: Accepted at MELT Workshop @ COLM 2025
♻ ☆ Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models
Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.
♻ ☆ Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
♻ ☆ MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning NeurIPS 2025
Embodied agents powered by large language models (LLMs), such as Voyager, promise open-ended competence in worlds such as Minecraft. However, when powered by open-weight LLMs they still falter on elementary tasks after domain-specific fine-tuning. We propose MindForge, a generative-agent framework for cultural lifelong learning through explicit perspective taking. We introduce three key innovations: (1) a structured theory of mind representation linking percepts, beliefs, desires, and actions; (2) natural inter-agent communication; and (3) a multi-component memory system. Following the cultural learning framework, we test MindForge in both instructive and collaborative settings within Minecraft. In an instructive setting with GPT-4, MindForge agents powered by open-weight LLMs significantly outperform their Voyager counterparts in basic tasks yielding $3\times$ more tech-tree milestones and collecting $2.3\times$ more unique items than the Voyager baseline. Furthermore, in fully \textit{collaborative} settings, we find that the performance of two underachieving agents improves with more communication rounds, echoing the Condorcet Jury Theorem. MindForge agents demonstrate sophisticated behaviors, including expert-novice knowledge transfer, collaborative problem solving, and adaptation to out-of-distribution tasks through accumulated cultural experiences.
comment: Accepted to NeurIPS 2025 main track as poster
♻ ☆ HauntAttack: When Attack Follows Reasoning as a Shadow
Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of 70\%, achieving up to 12 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.
♻ ☆ SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment NeurIPS 2025
Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.
comment: Accepted at NeurIPS 2025. Code and models are available at https://ai-isl.github.io/safepath
♻ ☆ Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning NeurIPS 2025
Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
comment: To appear on NeurIPS 2025 (spotlight)
♻ ☆ S-DAT: A Multilingual, GenAI-Driven Framework for Automated Divergent Thinking Assessment
This paper introduces S-DAT (Synthetic-Divergent Association Task), a scalable, multilingual framework for automated assessment of divergent thinking (DT) -a core component of human creativity. Traditional creativity assessments are often labor-intensive, language-specific, and reliant on subjective human ratings, limiting their scalability and cross-cultural applicability. In contrast, S-DAT leverages large language models and advanced multilingual embeddings to compute semantic distance -- a language-agnostic proxy for DT. We evaluate S-DAT across eleven diverse languages, including English, Spanish, German, Russian, Hindi, and Japanese (Kanji, Hiragana, Katakana), demonstrating robust and consistent scoring across linguistic contexts. Unlike prior DAT approaches, the S-DAT shows convergent validity with other DT measures and correct discriminant validity with convergent thinking. This cross-linguistic flexibility allows for more inclusive, global-scale creativity research, addressing key limitations of earlier approaches. S-DAT provides a powerful tool for fairer, more comprehensive evaluation of cognitive flexibility in diverse populations and can be freely assessed online: https://sdat.iol.zib.de/.
♻ ☆ ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding NeurIPS 2025
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
comment: NeurIPS 2025
♻ ☆ Adapting Multilingual Models to Code-Mixed Tasks via Model Merging
We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2--5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.
comment: 9 pages, 5 tables, CODS 2025
♻ ☆ Less is More: Compact Clue Selection for Efficient Retrieval-Augmented Generation Reasoning
Current RAG retrievers are designed primarily for human readers, emphasizing complete, readable, and coherent paragraphs. However, LLMs benefit more from precise, compact, and well-structured input, which enhances reasoning quality and efficiency. Existing methods often rely on reranking or summarization to identify key sentences, but may suffer from semantic breaks and unfaithfulness. Thus, efficiently extracting and organizing answer-relevant clues from large-scale documents while reducing LLM reasoning costs remains a challenge for RAG. Inspired by Occam's razor, we frame LLM-centric retrieval as a MinMax optimization: maximizing the extraction of potential clues and reranking them for well-organization, while minimizing reasoning costs by truncating to the smallest sufficient clues set. In this paper, we propose CompSelect, a Compact clue Selection mechanism for LLM-centric RAG, consisting of a clue extractor, a reranker, and a truncator. (1) The clue extractor first uses answer-containing sentences as fine-tuning targets, aiming to extract sufficient potential clues; (2) The reranker is trained to prioritize effective clues based on real LLM feedback; (3) The truncator uses the truncated text containing the minimum sufficient clues for answering the question as fine-tuning targets, thereby enabling efficient RAG reasoning. Experiments on three QA datasets show that CompSelect improves QA performance by approximately 11\% and reduces Total Latency and Online Latency by approximately 17\% and 67\% compared to various baseline methods on both LLaMA3 and Qwen3. Further analysis confirms its robustness to unreliable retrieval and generalization across different scenarios, offering a scalable and cost-efficient solution for web-scale RAG applications.
comment: 12 pages, 7 figures, 12 tables, under review
♻ ☆ Bi-Mamba: Towards Accurate 1-Bit State Space Models
The typical Selective State-Space Model (SSM) used in Mamba addresses several limitations of Transformers, such as the quadratic computational complexity with respect to sequence length and the significant memory requirements during inference due to the key-value (KV) cache. However, the increasing size of Mamba models continues to pose challenges for training and deployment, particularly due to their substantial computational demands during both training and inference. In this work, we introduce $\texttt{Bi-Mamba}$, a scalable and powerful 1-bit Mamba architecture designed to enable more efficient large language models (LLMs), with model sizes of 780M, 1.3B, and 2.7B parameters. $\texttt{Bi-Mamba}$ models are trained from scratch on a standard LLM-scale dataset using an autoregressive distillation loss. Extensive experiments on language modeling benchmarks demonstrate that $\texttt{Bi-Mamba}$ achieves performance comparable to its full-precision (FP16 or BF16) counterparts, while outperforming post-training binarization (PTB) Mamba and binarization-aware training (BAT) Transformer baselines. Moreover, $\texttt{Bi-Mamba}$ drastically reduces memory usage and computational cost compared to the original Mamba. Our work pioneers a new line of linear-complexity LLMs under low-bit representation and provides the way for the design of specialized hardware optimized for efficient 1-bit Mamba-based models. Code and the pre-trained weights are available at https://github.com/Tangshengku/Bi-Mamba.
comment: Accepted in TMLR 2025
♻ ☆ Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning
Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.
♻ ☆ MLMA: Towards Multilingual ASR With Mamba-based Architectures ICASSP 2026
Multilingual automatic speech recognition (ASR) remains a challenging task, especially when balancing performance across high- and low-resource languages. Recent advances in sequence modeling suggest that architectures beyond Transformers may offer better scalability and efficiency. In this work, we introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new approach that leverages the Mamba architecture -- an efficient state-space model optimized for long-context sequence processing -- for multilingual ASR. Using Mamba, MLMA implicitly incorporates language-aware conditioning and shared representations to support robust recognition across diverse languages. Experiments on standard multilingual benchmarks show that MLMA achieves competitive performance compared to Transformer-based architectures. These results highlight Mamba's potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.
comment: The paper is under review at ICASSP 2026
♻ ☆ Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution
Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents sd.npu, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.
♻ ☆ Diagnosing Representation Dynamics in NER Model Extension
Extending Named Entity Recognition (NER) models to new PII entities in noisy spoken-language data is a common need. We find that jointly fine-tuning a BERT model on standard semantic entities (PER, LOC, ORG) and new pattern-based PII (EMAIL, PHONE) results in minimal degradation for original classes. We investigate this "peaceful coexistence," hypothesizing that the model uses independent semantic vs. morphological feature mechanisms. Using an incremental learning setup as a diagnostic tool, we measure semantic drift and find two key insights. First, the LOC (location) entity is uniquely vulnerable due to a representation overlap with new PII, as it shares pattern-like features (e.g., postal codes). Second, we identify a "reverse O-tag representation drift." The model, initially trained to map PII patterns to 'O', blocks new learning. This is resolved only by unfreezing the 'O' tag's classifier, allowing the background class to adapt and "release" these patterns. This work provides a mechanistic diagnosis of NER model adaptation, highlighting feature independence, representation overlap, and 'O' tag plasticity. Work done based on data gathered by https://www.papernest.com
♻ ☆ More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Retrieval-Augmented Generation (RAG) enhances the accuracy of Large Language Model (LLM) responses by leveraging relevant external documents during generation. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for most LLMs, reducing performance by up to 20%. However, Qwen2.5 maintained consistent results across increasing document counts, indicating better multi-document handling capability. Finally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .
comment: Preprint
♻ ☆ A New Benchmark Dataset and Mixture-of-Experts Language Models for Adversarial Natural Language Inference in Vietnamese
Existing Vietnamese Natural Language Inference (NLI) datasets lack adversarial complexity, limiting their ability to evaluate model robustness against challenging linguistic phenomena. In this article, we address the gap in robust Vietnamese NLI resources by introducing ViANLI, the first adversarial NLI dataset for Vietnamese, and propose NLIMoE, a Mixture-of-Experts model to tackle its complexity. We construct ViANLI using an adversarial human-and-machine-in-the-loop approach with rigorous verification. NLIMoE integrates expert subnetworks with a learned dynamic routing mechanism on top of a shared transformer encoder. ViANLI comprises over 10,000 premise-hypothesis pairs and challenges state-of-the-art models, with XLM-R Large achieving only 45.5% accuracy, while NLIMoE reaches 47.3%. Training with ViANLI improves performance on other benchmark Vietnamese NLI datasets including ViNLI, VLSP2021-NLI, and VnNewsNLI. ViANLI is released for enhancing research into model robustness and enriching resources for future Vietnamese and multilingual NLI research.
comment: Accepted by Expert Systems with Applications
♻ ☆ Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities NeurIPS 2025
Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of $\textit{retrieval}$ and $\textit{copying}$ tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings. Empirically, we observe an $\textit{induction-versus-anti-induction}$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain transformer capabilities, but does not overcome fundamental length-generalization limits.
comment: NeurIPS 2025
♻ ☆ "You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations
With the proliferation of the internet and the rapid advancement of Artificial Intelligence, leading technology companies face an urgent annual demand for a considerable number of software and algorithm engineers. To efficiently and effectively identify high-potential candidates from thousands of applicants, these firms have established a multi-stage selection process, which crucially includes a standardized hiring evaluation designed to assess job-specific competencies. Motivated by the demonstrated prowess of Large Language Models (LLMs) in coding and reasoning tasks, this paper investigates a critical question: Can LLMs successfully pass these hiring evaluations? To this end, we conduct a comprehensive examination of a widely used professional assessment questionnaire. We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance. Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions. Our empirical findings lead to a striking conclusion: All evaluated LLMs fails to pass the hiring evaluation.
comment: Technical Report, 14 pages, 8 figures
♻ ☆ Stress-Testing Model Specs Reveals Character Differences among Language Models
Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.
♻ ☆ LFD: Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) incorporates external knowledge into large language models (LLMs), improving their adaptability to downstream tasks and enabling information updates. Surprisingly, recent empirical evidence demonstrates that injecting noise into retrieved relevant documents paradoxically facilitates exploitation of external knowledge and improves generation quality. Although counterintuitive and challenging to apply in practice, this phenomenon enables granular control and rigorous analysis of how LLMs integrate external knowledge. Therefore, in this paper, we intervene on noise injection and establish a layer-specific functional demarcation within the LLM: shallow layers specialize in local context modeling, intermediate layers focus on integrating long-range external factual knowledge, and deeper layers primarily rely on parametric internal knowledge. Building on this insight, we propose Layer Fused Decoding (LFD), a simple decoding strategy that directly combines representations from an intermediate layer with final-layer decoding outputs to fully exploit the external factual knowledge. To identify the optimal intermediate layer, we introduce an internal knowledge score (IKS) criterion that selects the layer with the lowest IKS value in the latter half of layers. Experimental results across multiple benchmarks demonstrate that LFD helps RAG systems more effectively surface retrieved context knowledge with minimal cost.
♻ ☆ Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation
Medical English-Vietnamese machine translation (En-Vi MT) is essential for healthcare access and communication in Vietnam, yet Vietnamese remains a low-resource and under-studied language. We systematically evaluate prompting strategies for six multilingual LLMs (0.5B-9B parameters) on the MedEV dataset, comparing zero-shot, few-shot, and dictionary-augmented prompting with Meddict, an English-Vietnamese medical lexicon. Results show that model scale is the primary driver of performance: larger LLMs achieve strong zero-shot results, while few-shot prompting yields only marginal improvements. In contrast, terminology-aware cues and embedding-based example retrieval consistently improve domain-specific translation. These findings underscore both the promise and the current limitations of multilingual LLMs for medical En-Vi MT.
comment: This version has been withdrawn after receiving the conference review results. We are currently extending and reorganizing the work into a new study
♻ ☆ ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining EMNLP 2025
The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative despite inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been explored for domain adaptation, its utility in commercial settings remains under-examined. In this study, we validate the effectiveness of a DACP-based recipe across diverse foundation models and service domains, producing DACP-applied sLLMs (ixi-GEN). Through extensive experiments and real-world evaluations, we demonstrate that ixi-GEN models achieve substantial gains in target-domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.
comment: Accepted at EMNLP 2025 Industry Track
♻ ☆ Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
comment: 20 pages, 13 figures
♻ ☆ Toward Metaphor-Fluid Conversation Design for Voice User Interfaces
Metaphors play a critical role in shaping user experiences with Voice User Interfaces (VUIs), yet existing designs often rely on static, human-centric metaphors that fail to adapt to diverse contexts and user needs. This paper introduces Metaphor-Fluid Design, a novel approach that dynamically adjusts metaphorical representations based on conversational use-contexts. We compare this approach to a Default VUI, which characterizes the present implementation of commercial VUIs commonly designed around the persona of an assistant, offering a uniform interaction style across contexts. In Study 1 (N=130), metaphors were mapped to four key use-contexts-commands, information seeking, sociality, and error recovery-along the dimensions of formality and hierarchy, revealing distinct preferences for task-specific metaphorical designs. Study 2 (N=91) evaluates a Metaphor-Fluid VUI against a Default VUI, showing that the Metaphor-Fluid VUI enhances perceived intention to adopt, enjoyment, and likability by aligning better with user expectations for different contexts. However, individual differences in metaphor preferences highlight the need for personalization. These findings challenge the one-size-fits-all paradigm of VUI design and demonstrate the potential of Metaphor-Fluid Design to create more adaptive and engaging human-AI interactions.
♻ ☆ TianHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios
Domain-specific LLMs in TCM face limitations in research settings due to constrained adaptability, insufficient evaluation datasets, and limited computational resources. This study presents TianHui, a specialized TCM LLM built through contextual data integration and domain knowledge fusion. We constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage 2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW) and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC, ADTG). Optimal configuration was identified as LoRA rank=128, alpha=256, epoch=4, dropout=0.2, max length=2048. TianHui enables systematic preservation and scalable application of TCM knowledge. All resources are open-sourced.
comment: 46 pages, 5 figures,3 tables
♻ ☆ Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models NeurIPS 2025
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking". To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance-creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from "more thinking" are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.
comment: Accepted at NeurIPS 2025
♻ ☆ LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation
Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs' legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs' consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs' legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.
♻ ☆ MLP Memory: A Retriever-Pretrained Memory for Large Language Models
Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers from high inference latency and shallow integration, while parametric fine-tuning methods like LoRA risk catastrophic forgetting and degraded general capabilities. In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access. By pretraining an MLP to imitate a $k$NN retriever's behavior on the entire pretraining dataset, we create a differentiable memory component that captures the benefits of retrieval-based knowledge access in a fully parametric form. Our architecture integrates this pretrained MLP Memory with Transformer decoders through simple probability interpolation, yielding 17.5\% and 24.1\% scaling gains on WikiText-103 and Web datasets, respectively. It further achieves 12.3\% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by up to 10 points on HaluEval. Moreover, MLP Memory delivers 2.5$\times$ faster inference than RAG with superior accuracy. Our findings show that learning retrieval patterns parametrically bridges the gap between efficient inference and effective knowledge access, offering a practical alternative to both RAG and fine-tuning approaches.
♻ ☆ Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? NeurIPS 2025
Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD's effectiveness remain unclear. In this work, we disentangle MAD into two key components--Majority Voting and inter-agent Debate--and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents' belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in https://github.com/deeplearning-wisc/debate-or-vote.
comment: NeurIPS 2025 Spotlight
♻ ☆ Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction
We investigate transformer-based language models, including RoBERTa, T5, Llama-3, and MatSciBERT, for predicting the band gaps of semiconductor materials directly from textual descriptions. The inputs encode key material features, such as chemical composition, crystal system, space group, and other structural and electronic properties. Unlike shallow machine learning models, which require extensive feature engineering, or Graph Neural Networks, which rely on graph representations derived from atomic coordinates, pretrained language models can process textual inputs directly, eliminating the need for manual feature preprocessing or structure-based encoding. Material descriptions were constructed in two formats: structured strings with a consistent template and natural language narratives generated via the ChatGPT API. Each model was augmented with a custom regression head and finetuned for band gap prediction task. Language models of different architectures and parameter sizes were all able to predict band gaps from human-readable text with strong accuracy, achieving MAEs in the range of 0.25-0.33 eV, highlighting the success of this approach for scientific regression tasks. Finetuned Llama-3, with 1.2 billion parameters, achieved the highest accuracy (MAE 0.248 eV, R2 0.891). MatSciBERT, pretrained on materials science literature, reached comparable performance (MAE 0.288 eV, R2 0.871) with significantly fewer parameters (110 million), emphasizing the importance of domain-specific pretraining. Attention analysis shows that both models selectively focus on compositional and spin-related features while de-emphasizing geometric features, reflecting the difficulty of capturing spatial information from text. These results establish that pretrained language models can effectively extract complex feature-property relationships from textual material descriptions.
♻ ☆ LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration AAAI 2025
Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.
comment: Accepted to AAAI 2025 (Oral Presentation)
♻ ☆ KAT-Coder Technical Report
Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.
♻ ☆ Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer. In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps - abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest a strong association between these Miracle Steps and memorization, where the model appears to recall the answer directly rather than deriving it. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The generative RRM provides fine-grained, calibrated rewards (0-1) that explicitly penalize logical flaws and encourage rigorous deduction. When integrated into a reinforcement learning pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building models that are not only more accurate but also more reliable.
comment: 25 pages, 11 figures, 6 Tables
♻ ☆ A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is hindered by a lack of comprehensive understanding of how benchmarks and solutions interconnect. This survey addresses this gap by providing the first holistic analysis of LLM-powered software engineering, offering insights into evaluation methodologies and solution paradigms. We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair. Our analysis highlights the evolution from simple prompt engineering to sophisticated agentic systems incorporating capabilities like planning, reasoning, memory mechanisms, and tool augmentation. To contextualize this progress, we present a unified pipeline illustrating the workflow from task specification to deliverables, detailing how different solution paradigms address various complexity levels. Unlike prior surveys that focus narrowly on specific aspects, this work connects 50+ benchmarks to their corresponding solution strategies, enabling researchers to identify optimal approaches for diverse evaluation criteria. We also identify critical research gaps and propose future directions, including multi-agent collaboration, self-evolving systems, and formal verification integration. This survey serves as a foundational guide for advancing LLM-driven software engineering. We maintain a GitHub repository that continuously updates the reviewed and related papers at https://github.com/lisaGuojl/LLM-Agent-SE-Survey.
comment: 22 pages
♻ ☆ MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning? NeurIPS 2025
The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc.
comment: 39 pages, 11 figures. The paper is accepted at NeurIPS 2025 Datasets & Benchmarks Track, and the latest version adds modifications in camera-ready
♻ ☆ Sherlock: Self-Correcting Reasoning in Vision-Language Models NeurIPS 2025
Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.
comment: Published at NeurIPS 2025, 27 pages
♻ ☆ RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning NeurIPS 2025
Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
comment: NeurIPS 2025. The first two authors contributed equally
♻ ☆ Quantitative LLM Judges
LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain using regression models. The models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in practice. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
♻ ☆ Annotation Guidelines-Based Knowledge Augmentation: Towards Enhancing Large Language Models for Educational Text Classification
Various machine learning approaches have gained significant popularity for the automated classification of educational text to identify indicators of learning engagement -- i.e. learning engagement classification (LEC). LEC can offer comprehensive insights into human learning processes, attracting significant interest from diverse research communities, including Natural Language Processing (NLP), Learning Analytics, and Educational Data Mining. Recently, Large Language Models (LLMs), such as ChatGPT, have demonstrated remarkable performance in various NLP tasks. However, their comprehensive evaluation and improvement approaches in LEC tasks have not been thoroughly investigated. In this study, we propose the Annotation Guidelines-based Knowledge Augmentation (AGKA) approach to improve LLMs. AGKA employs GPT 4.0 to retrieve label definition knowledge from annotation guidelines, and then applies the random under-sampler to select a few typical examples. Subsequently, we conduct a systematic evaluation benchmark of LEC, which includes six LEC datasets covering behavior classification (question and urgency level), emotion classification (binary and epistemic emotion), and cognition classification (opinion and cognitive presence). The study results demonstrate that AGKA can enhance non-fine-tuned LLMs, particularly GPT 4.0 and Llama 3 70B. GPT 4.0 with AGKA few-shot outperforms full-shot fine-tuned models such as BERT and RoBERTa on simple binary classification datasets. However, GPT 4.0 lags in multi-class tasks that require a deep understanding of complex semantic information. Notably, Llama 3 70B with AGKA is a promising combination based on open-source LLM, because its performance is on par with closed-source GPT 4.0 with AGKA. In addition, LLMs struggle to distinguish between labels with similar names in multi-class classification.
comment: The manuscript has been accepted for publication in IEEE Transactions on Learning Technologies. https://doi.org/10.1109/TLT.2025.3570775
♻ ☆ Hybrid Latent Reasoning via Reinforcement Learning NeurIPS 2025
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
comment: NeurIPS 2025
♻ ☆ PersonaMatrix: A Recipe for Persona-Aware Evaluation of Legal Summarization
Legal documents are often long, dense, and difficult to comprehend, not only for laypeople but also for legal experts. While automated document summarization has great potential to improve access to legal knowledge, prevailing task-based evaluators overlook divergent user and stakeholder needs. Tool development is needed to encompass the technicality of a case summary for a litigator yet be accessible for a self-help public researching for their lawsuit. We introduce PersonaMatrix, a persona-by-criterion evaluation framework that scores summaries through the lens of six personas, including legal and non-legal users. We also introduce a controlled dimension-shifted pilot dataset of U.S. civil rights case summaries that varies along depth, accessibility, and procedural detail as well as Diversity-Coverage Index (DCI) to expose divergent optima of legal summary between persona-aware and persona-agnostic judges. This work enables refinement of legal AI summarization systems for both expert and non-expert users, with the potential to increase access to legal knowledge. The code base and data are publicly available in GitHub.
comment: Accepted for publication in JURIX 2025 (Legal Knowledge and Information Systems, FAIA series, IOS Press). Long Paper
♻ ☆ AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science
Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models' ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems. Our data and code are publicly available here: https://github.com/jeremyxianx/Assisted-DS
♻ ☆ RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing
Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a \textbf{character-centric} approach, simplify user-character interactions to isolated Q&A tasks, and fail to reflect real-world applications. To address this limitation, we introduce RMTBench, a comprehensive \textbf{user-centric} bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds. RMTBench includes custom characters with detailed backgrounds and abstract characters defined by simple traits, enabling evaluation across various user scenarios. Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications. Furthermore, we construct an authentic multi-turn dialogue simulation mechanism. With carefully selected evaluation dimensions and LLM-based scoring, this mechanism captures the complex intention of conversations between the user and the character. By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements, offering a more effective framework for assessing role-playing capabilities in LLMs. All code and datasets will be released soon. We release the datasets at https://huggingface.co/datasets/xiangh/RMTBENCH.
♻ ☆ Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities NAACL 2025
Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety. Our code is available at: https://github.com/SunChungEn/ADV-LLM
comment: Accepted to NAACL 2025 Main (Oral)
♻ ☆ Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning ICLR2025
Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests demonstrate that our head-level KV cache compression significantly outperforms strong baselines, particularly in low-resource settings (KV size = 64 & 128). Notably, our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark. Codes are available at https://github.com/FYYFU/HeadKV
comment: Accepted to ICLR2025
♻ ☆ Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models EMNLP 2025
As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SafeBench, a benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 20 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety. Our benchmark and code are available at https://github.com/yeonjun-in/U-SafeBench.
comment: EMNLP 2025 Findings
Information Retrieval 25
☆ Generative Reasoning Recommendation via LLMs
Despite their remarkable reasoning capabilities across diverse domains, large language models (LLMs) face fundamental challenges in natively functioning as generative reasoning recommendation models (GRRMs), where the intrinsic modeling gap between textual semantics and collaborative filtering signals, combined with the sparsity and stochasticity of user feedback, presents significant obstacles. This work explores how to build GRRMs by adapting pre-trained LLMs, which achieves a unified understanding-reasoning-prediction manner for recommendation tasks. We propose GREAM, an end-to-end framework that integrates three components: (i) Collaborative-Semantic Alignment, which fuses heterogeneous textual evidence to construct semantically consistent, discrete item indices and auxiliary alignment tasks that ground linguistic representations in interaction semantics; (ii) Reasoning Curriculum Activation, which builds a synthetic dataset with explicit Chain-of-Thought supervision and a curriculum that progresses through behavioral evidence extraction, latent preference modeling, intent inference, recommendation formulation, and denoised sequence rewriting; and (iii) Sparse-Regularized Group Policy Optimization (SRPO), which stabilizes post-training via Residual-Sensitive Verifiable Reward and Bonus-Calibrated Group Advantage Estimation, enabling end-to-end optimization under verifiable signals despite sparse successes. GREAM natively supports two complementary inference modes: Direct Sequence Recommendation for high-throughput, low-latency deployment, and Sequential Reasoning Recommendation that first emits an interpretable reasoning chain for causal transparency. Experiments on three datasets demonstrate consistent gains over strong baselines, providing a practical path toward verifiable-RL-driven LLM recommenders.
☆ RAGRank: Using PageRank to Counter Poisoning in CTI LLM Pipelines
Retrieval-Augmented Generation (RAG) has emerged as the dominant architectural pattern to operationalize Large Language Model (LLM) usage in Cyber Threat Intelligence (CTI) systems. However, this design is susceptible to poisoning attacks, and previously proposed defenses can fail for CTI contexts as cyber threat information is often completely new for emerging attacks, and sophisticated threat actors can mimic legitimate formats, terminology, and stylistic conventions. To address this issue, we propose that the robustness of modern RAG defenses can be accelerated by applying source credibility algorithms on corpora, using PageRank as an example. In our experiments, we demonstrate quantitatively that our algorithm applies a lower authority score to malicious documents while promoting trusted content, using the standardized MS MARCO dataset. We also demonstrate proof-of-concept performance of our algorithm on CTI documents and feeds.
☆ Analyticup E-commerce Product Search Competition Technical Report from Team Tredence_AICOE
This study presents the multilingual e-commerce search system developed by the Tredence_AICOE team. The competition features two multilingual relevance tasks: Query-Category (QC) Relevance, which evaluates how well a user's search query aligns with a product category, and Query-Item (QI) Relevance, which measures the match between a multilingual search query and an individual product listing. To ensure full language coverage, we performed data augmentation by translating existing datasets into languages missing from the development set, enabling training across all target languages. We fine-tuned Gemma-3 12B and Qwen-2.5 14B model for both tasks using multiple strategies. The Gemma-3 12B (4-bit) model achieved the best QC performance using original and translated data, and the best QI performance using original, translated, and minority class data creation. These approaches secured 4th place on the final leaderboard, with an average F1-score of 0.8857 on the private test set.
☆ Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets
We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.
☆ Rotate Both Ways: Time-and-Order RoPE for Generative Recommendation
Generative recommenders, typically transformer-based autoregressive models, predict the next item or action from a user's interaction history. Their effectiveness depends on how the model represents where an interaction event occurs in the sequence (discrete index) and when it occurred in wall-clock time. Prevailing approaches inject time via learned embeddings or relative attention biases. In this paper, we argue that RoPE-based approaches, if designed properly, can be a stronger alternative for jointly modeling temporal and sequential information in user behavior sequences. While vanilla RoPE in LLMs considers only token order, generative recommendation requires incorporating both event time and token index. To address this, we propose Time-and-Order RoPE (TO-RoPE), a family of rotary position embedding designs that treat index and time as angle sources shaping the query-key geometry directly. We present three instantiations: early fusion, split-by-dim, and split-by-head. Extensive experiments on both publicly available datasets and a proprietary industrial dataset show that TO-RoPE variants consistently improve accuracy over existing methods for encoding time and index. These results position rotary embeddings as a simple, principled, and deployment-friendly foundation for generative recommendation.
☆ From Generation to Attribution: Music AI Agent Architectures for the Post-Streaming Era NeurIPS 2025
Generative AI is reshaping music creation, but its rapid growth exposes structural gaps in attribution, rights management, and economic models. Unlike past media shifts, from live performance to recordings, downloads, and streaming, AI transforms the entire lifecycle of music, collapsing boundaries between creation, distribution, and monetization. However, existing streaming systems, with opaque and concentrated royalty flows, are ill-equipped to handle the scale and complexity of AI-driven production. We propose a content-based Music AI Agent architecture that embeds attribution directly into the creative workflow through block-level retrieval and agentic orchestration. Designed for iterative, session-based interaction, the system organizes music into granular components (Blocks) stored in BlockDB; each use triggers an Attribution Layer event for transparent provenance and real-time settlement. This framework reframes AI from a generative tool into infrastructure for a Fair AI Media Platform. By enabling fine-grained attribution, equitable compensation, and participatory engagement, it points toward a post-streaming paradigm where music functions not as a static catalog but as a collaborative and adaptive ecosystem.
comment: Accepted to the NeurIPS 2025 AI4Music Workshop
☆ Balancing Fine-tuning and RAG: A Hybrid Strategy for Dynamic LLM Recommendation Updates RecSys 2025
Large Language Models (LLMs) empower recommendation systems through their advanced reasoning and planning capabilities. However, the dynamic nature of user interests and content poses a significant challenge: While initial fine-tuning aligns LLMs with domain knowledge and user preferences, it fails to capture such real-time changes, necessitating robust update mechanisms. This paper investigates strategies for updating LLM-powered recommenders, focusing on the trade-offs between ongoing fine-tuning and Retrieval-Augmented Generation (RAG). Using an LLM-powered user interest exploration system as a case study, we perform a comparative analysis of these methods across dimensions like cost, agility, and knowledge incorporation. We propose a hybrid update strategy that leverages the long-term knowledge adaptation of periodic fine-tuning with the agility of low-cost RAG. We demonstrate through live A/B experiments on a billion-user platform that this hybrid approach yields statistically significant improvements in user satisfaction, offering a practical and cost-effective framework for maintaining high-quality LLM-powered recommender systems.
comment: RecSys 2025 Industry Track
☆ Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.
comment: In Proceedings of the 2nd ACM Workshop in AI-powered Question and Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3746274.3760393
☆ Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.
☆ From Questions to Queries: An AI-powered Multi-Agent Framework for Spatial Text-to-SQL
The complexity of Structured Query Language (SQL) and the specialized nature of geospatial functions in tools like PostGIS present significant barriers to non-experts seeking to analyze spatial data. While Large Language Models (LLMs) offer promise for translating natural language into SQL (Text-to-SQL), single-agent approaches often struggle with the semantic and syntactic complexities of spatial queries. To address this, we propose a multi-agent framework designed to accurately translate natural language questions into spatial SQL queries. The framework integrates several innovative components, including a knowledge base with programmatic schema profiling and semantic enrichment, embeddings for context retrieval, and a collaborative multi-agent pipeline as its core. This pipeline comprises specialized agents for entity extraction, metadata retrieval, query logic formulation, SQL generation, and a review agent that performs programmatic and semantic validation of the generated SQL to ensure correctness (self-verification). We evaluate our system using both the non-spatial KaggleDBQA benchmark and a new, comprehensive SpatialQueryQA benchmark that includes diverse geometry types, predicates, and three levels of query complexity. On KaggleDBQA, the system achieved an overall accuracy of 81.2% (221 out of 272 questions) after the review agent's review and corrections. For spatial queries, the system achieved an overall accuracy of 87.7% (79 out of 90 questions), compared with 76.7% without the review agent. Beyond accuracy, results also show that in some instances the system generates queries that are more semantically aligned with user intent than those in the benchmarks. This work makes spatial analysis more accessible, and provides a robust, generalizable foundation for spatial Text-to-SQL systems, advancing the development of autonomous GIS.
☆ Communication Platform for Non-verbal Autistic children in Oman using Android mobile
This paper discusses the issue regarding Non-verbal Autism Spectrum Disorder. It has been observed that this mental disorder is listed in major parts of the world including the US, UK, and India. To mitigate this type of disorder, a wide range of smartphones, computers, and artificial intelligence technologies have been used. This technology has helped the population cope with socialization and communication needs. Many applications have been developed to enhance the communication capabilities of non-verbal autistic children. This thesis project proposes the development of a platform that includes a web panel and an Android mobile application to assist non-verbal autistic children in communication, especially in Oman. Different interventions have been merged to improve the quality of life for people on the autism spectrum. The main problem identified in this case is that fragmented approaches are not suitable for autistic children. The augmented reality framework provides the capability to engage autistic children in creative play and self-reflection through interactive screen-based activities.
☆ Gaussian Mixture Flow Matching with Domain Alignment for Multi-Domain Sequential Recommendation
Users increasingly interact with content across multiple domains, resulting in sequential behaviors marked by frequent and complex transitions. While Cross-Domain Sequential Recommendation (CDSR) models two-domain interactions, Multi-Domain Sequential Recommendation (MDSR) introduces significantly more domain transitions, compounded by challenges such as domain heterogeneity and imbalance. Existing approaches often overlook the intricacies of domain transitions, tend to overfit to dense domains while underfitting sparse ones, and struggle to scale effectively as the number of domains increases. We propose \textit{GMFlowRec}, an efficient generative framework for MDSR that models domain-aware transition trajectories via Gaussian Mixture Flow Matching. GMFlowRec integrates: (1) a unified dual-masked Transformer to disentangle domain-invariant and domain-specific intents, (2) a Gaussian Mixture flow field to capture diverse behavioral patterns, and (3) a domain-aligned prior to support frequent and sparse transitions. Extensive experiments on JD and Amazon datasets demonstrate that GMFlowRec achieves state-of-the-art performance with up to 44\% improvement in NDCG@5, while maintaining high efficiency via a single unified backbone, making it scalable for real-world multi-domain sequential recommendation.
☆ A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model
Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GD&T symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GD&T frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.
comment: This draft has been submitted to the 13th International Conference on Industrial Engineering and Applications (ICIEA 2026)
♻ ☆ X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation
Large Language Models (LLMs) have been shown to enhance the effectiveness of enriching item descriptions, thereby improving the accuracy of recommendation systems. However, most existing approaches either rely on text-only prompting or employ basic multimodal strategies that do not fully exploit the complementary information available from both textual and visual modalities. This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, designed to address these limitations by prompting Multimodal Large Language Models (MLLMs) to explicitly identify and reconcile supportive and conflicting information between text and images. By capturing nuanced insights from both modalities, this approach generates more comprehensive and contextually rich item representations. Extensive experiments conducted on two widely used benchmarks demonstrate that our method outperforms existing prompting baselines in downstream recommendation accuracy. Furthermore, we identify a U-shaped relationship between text-image dissimilarity and recommendation performance, suggesting the benefit of applying multimodal prompting selectively. To support efficient real-time inference, we also introduce X-Reflect-keyword, a lightweight variant that summarizes image content using keywords and replaces the base model with a smaller backbone, achieving nearly 50% reduction in input length while maintaining competitive performance. This work underscores the importance of integrating multimodal information and presents an effective solution for improving item understanding in multimodal recommendation systems.
♻ ☆ CoRECT: A Framework for Evaluating Embedding Compression Techniques at Scale
Dense retrieval systems have proven to be effective across various benchmarks, but require substantial memory to store large search indices. Recent advances in embedding compression show that index sizes can be greatly reduced with minimal loss in ranking quality. However, existing studies often overlook the role of corpus complexity -- a critical factor, as recent work shows that both corpus size and document length strongly affect dense retrieval performance. In this paper, we introduce CoRECT (Controlled Retrieval Evaluation of Compression Techniques), a framework for large-scale evaluation of embedding compression methods, supported by a newly curated dataset collection. To demonstrate its utility, we benchmark eight representative types of compression methods. Notably, we show that non-learned compression achieves substantial index size reduction, even on up to 100M passages, with statistically insignificant performance loss. However, selecting the optimal compression method remains challenging, as performance varies across models. Such variability highlights the necessity of CoRECT to enable consistent comparison and informed selection of compression methods. All code, data, and results are available on GitHub and HuggingFace.
♻ ☆ HoMer: Addressing Heterogeneities by Modeling Sequential and Set-wise Contexts for CTR Prediction
Click-through rate (CTR) prediction, which models behavior sequence and non-sequential features (e.g., user/item profiles or cross features) to infer user interest, underpins industrial recommender systems. However, most methods face three forms of heterogeneity that degrade predictive performance: (i) Feature Heterogeneity persists when limited sequence side features provide less granular interest representation compared to extensive non-sequential features, thereby impairing sequence modeling performance; (ii) Context Heterogeneity arises because a user's interest in an item will be influenced by other items, yet point-wise prediction neglects cross-item interaction context from the entire item set; (iii) Architecture Heterogeneity stems from the fragmented integration of specialized network modules, which compounds the model's effectiveness, efficiency and scalability in industrial deployments. To tackle the above limitations, we propose HoMer, a Homogeneous-Oriented TransforMer for modeling sequential and set-wise contexts. First, we align sequence side features with non-sequential features for accurate sequence modeling and fine-grained interest representation. Second, we shift the prediction paradigm from point-wise to set-wise, facilitating cross-item interaction in a highly parallel manner. Third, HoMer's unified encoder-decoder architecture achieves dual optimization through structural simplification and shared computation, ensuring computational efficiency while maintaining scalability with model size. Without arduous modification to the prediction pipeline, HoMer successfully scales up and outperforms our industrial baseline by 0.0099 in the AUC metric, and enhances online business metrics like CTR/RPM by 1.99%/2.46%. Additionally, HoMer saves 27% of GPU resources via preliminary engineering optimization, further validating its superiority and practicality.
comment: 10 pages, 6 figures
♻ ☆ On Function-Correcting Codes in the Lee Metric
Function-correcting codes are a coding framework designed to minimize redundancy while ensuring that specific functions or computations of encoded data can be reliably recovered, even in the presence of errors. The choice of metric is crucial in designing such codes, as it determines which computations must be protected and how errors are measured and corrected. Previous work by Liu and Liu [6] studied function-correcting codes over $\mathbb{Z}_{2^l},\ l\geq 2$ using the homogeneous metric, which coincides with the Lee metric over $\mathbb{Z}_4$. In this paper, we extend the study to codes over $\mathbb{Z}_m,$ for any positive integer $m\geq 2$ under the Lee metric and aim to determine their optimal redundancy. To achieve this, we introduce irregular Lee distance codes and derive upper and lower bounds on the optimal redundancy by characterizing the shortest possible length of such codes. These general bounds are then simplified and applied to specific classes of functions, including Lee-local functions, Lee weight functions, and Lee weight distribution functions. We extend the bounds established by Liu and Liu [6] for codes over $\mathbb{Z}_4$ in the Lee metric to the more general setting of $\mathbb{Z}_m$. Additionally, we explicitly derive a Plotkin-like bound for linear function-correcting codes in the Lee metric. As the Lee metric coincides with the Hamming metric over the binary field, we demonstrate that our bound naturally reduces to a Plotkin-type bound for function-correcting codes under the Hamming metric over $\mathbb{Z}_2$. Furthermore, when the underlying function is bijective, function-correcting codes reduce to classical error-correcting codes. In parallel, our bound correspondingly reduces to the classical Plotkin bound for error-correcting codes, both for the Lee metric over $\mathbb{Z}_m$ and for the Hamming metric over $\mathbb{Z}_2$.
♻ ☆ Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval
Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment $\Delta_{ij}$ between text $t_i$ and video $v_j$, redistributing gradients to relieve optimization tension and absorb noise. We derive $\Delta_{ij}$ via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize $\Delta$ through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation. Code is available at https://github.com/musicman217/GARE-text-video-retrieval.
♻ ☆ Learning to Hash for Recommendation: A Survey
With the explosive growth of users and items, Recommender Systems are facing unprecedented challenges in terms of retrieval efficiency and storage overhead. Learning to Hash techniques have emerged as a promising solution to these issues by encoding high-dimensional data into compact hash codes. As a result, hashing-based recommendation methods (HashRec) have garnered growing attention for enabling large-scale and efficient recommendation services. This survey provides a comprehensive overview of state-of-the-art HashRec algorithms. Specifically, we begin by introducing the common two-tower architecture used in the recall stage and by detailing two predominant hash search strategies. Then, we categorize existing works into a three-tier taxonomy based on: (i) learning objectives, (ii) optimization strategies, and (iii) recommendation scenarios. Additionally, we summarize widely adopted evaluation metrics for assessing both the effectiveness and efficiency of HashRec algorithms. Finally, we discuss current limitations in the field and outline promising directions for future research. We index these HashRec methods at the repository \href{https://github.com/Luo-Fangyuan/HashRec}{https://github.com/Luo-Fangyuan/HashRec}.
♻ ☆ Scalable Dynamic Embedding Size Search for Streaming Recommendation CIKM 2024
Recommender systems typically represent users and items by learning their embeddings, which are usually set to uniform dimensions and dominate the model parameters. However, real-world recommender systems often operate in streaming recommendation scenarios, where the number of users and items continues to grow, leading to substantial storage resource consumption for these embeddings. Although a few methods attempt to mitigate this by employing embedding size search strategies to assign different embedding dimensions in streaming recommendations, they assume that the embedding size grows with the frequency of users/items, which eventually still exceeds the predefined memory budget over time. To address this issue, this paper proposes to learn Scalable Lightweight Embeddings for streaming recommendation, called SCALL, which can adaptively adjust the embedding sizes of users/items within a given memory budget over time. Specifically, we propose to sample embedding sizes from a probabilistic distribution, with the guarantee to meet any predefined memory budget. By fixing the memory budget, the proposed embedding size sampling strategy can increase and decrease the embedding sizes in accordance to the frequency of the corresponding users or items. Furthermore, we develop a reinforcement learning-based search paradigm that models each state with mean pooling to keep the length of the state vectors fixed, invariant to the changing number of users and items. As a result, the proposed method can provide embedding sizes to unseen users and items. Comprehensive empirical evaluations on two public datasets affirm the advantageous effectiveness of our proposed method.
comment: accepted to CIKM 2024 Code is available at https://github.com/qykcq/Scalable-Dynamic-Embedding-Size-Search-for-Streaming-Recommendation
♻ ☆ Budgeted Embedding Table For Recommender Systems WSDM 2024
At the heart of contemporary recommender systems (RSs) are latent factor models that provide quality recommendation experience to users. These models use embedding vectors, which are typically of a uniform and fixed size, to represent users and items. As the number of users and items continues to grow, this design becomes inefficient and hard to scale. Recent lightweight embedding methods have enabled different users and items to have diverse embedding sizes, but are commonly subject to two major drawbacks. Firstly, they limit the embedding size search to optimizing a heuristic balancing the recommendation quality and the memory complexity, where the trade-off coefficient needs to be manually tuned for every memory budget requested. The implicitly enforced memory complexity term can even fail to cap the parameter usage, making the resultant embedding table fail to meet the memory budget strictly. Secondly, most solutions, especially reinforcement learning based ones derive and optimize the embedding size for each each user/item on an instance-by-instance basis, which impedes the search efficiency. In this paper, we propose Budgeted Embedding Table (BET), a novel method that generates table-level actions (i.e., embedding sizes for all users and items) that is guaranteed to meet pre-specified memory budgets. Furthermore, by leveraging a set-based action formulation and engaging set representation learning, we present an innovative action search strategy powered by an action fitness predictor that efficiently evaluates each table-level action. Experiments have shown state-of-the-art performance on two real-world datasets when BET is paired with three popular recommender models under different memory budgets.
comment: Accepted to WSDM 2024. Code is available at https://github.com/qykcq/Budgeted-Embedding-Table-For-Recommender-Systems
♻ ☆ Continuous Input Embedding Size Search For Recommender Systems SIGIR'23
Latent factor models are the most popular backbones for today's recommender systems owing to their prominent performance. Latent factor models represent users and items as real-valued embedding vectors for pairwise similarity computation, and all embeddings are traditionally restricted to a uniform size that is relatively large (e.g., 256-dimensional). With the exponentially expanding user base and item catalog in contemporary e-commerce, this design is admittedly becoming memory-inefficient. To facilitate lightweight recommendation, reinforcement learning (RL) has recently opened up opportunities for identifying varying embedding sizes for different users/items. However, challenged by search efficiency and learning an optimal RL policy, existing RL-based methods are restricted to highly discrete, predefined embedding size choices. This leads to a largely overlooked potential of introducing finer granularity into embedding sizes to obtain better recommendation effectiveness under a given memory budget. In this paper, we propose continuous input embedding size search (CIESS), a novel RL-based method that operates on a continuous search space with arbitrary embedding sizes to choose from. In CIESS, we further present an innovative random walk-based exploration strategy to allow the RL policy to efficiently explore more candidate embedding sizes and converge to a better decision. CIESS is also model-agnostic and hence generalizable to a variety of latent factor RSs, whilst experiments on two real-world datasets have shown state-of-the-art performance of CIESS under different memory budgets when paired with three popular recommendation models.
comment: To appear in SIGIR'23. Code is available at https://github.com/qykcq/Continuous-Input-Embedding-Size-Search-For-Recommender-Systems
♻ ☆ Efficient Multimodal Streaming Recommendation via Expandable Side Mixture-of-Experts CIKM 2025
Streaming recommender systems (SRSs) are widely deployed in real-world applications, where user interests shift and new items arrive over time. As a result, effectively capturing users' latest preferences is challenging, as interactions reflecting recent interests are limited and new items often lack sufficient feedback. A common solution is to enrich item representations using multimodal encoders (e.g., BERT or ViT) to extract visual and textual features. However, these encoders are pretrained on general-purpose tasks: they are not tailored to user preference modeling, and they overlook the fact that user tastes toward modality-specific features such as visual styles and textual tones can also drift over time. This presents two key challenges in streaming scenarios: the high cost of fine-tuning large multimodal encoders, and the risk of forgetting long-term user preferences due to continuous model updates. To tackle these challenges, we propose Expandable Side Mixture-of-Experts (XSMoE), a memory-efficient framework for multimodal streaming recommendation. XSMoE attaches lightweight side-tuning modules consisting of expandable expert networks to frozen pretrained encoders and incrementally expands them in response to evolving user feedback. A gating router dynamically combines expert and backbone outputs, while a utilization-based pruning strategy maintains model compactness. By learning new patterns through expandable experts without overwriting previously acquired knowledge, XSMoE effectively captures both cold start and shifting preferences in multimodal features. Experiments on three real-world datasets demonstrate that XSMoE outperforms state-of-the-art baselines in both recommendation quality and computational efficiency.
comment: Accepted to CIKM 2025. Code is available at https://github.com/qykcq/Efficient-Multimodal-Streaming-Recommendation-via-Expandable-Side-Mixture-of-Experts
♻ ☆ Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.
comment: This paper contains fundamental errors and will not be replaced
♻ ☆ Efficiently Constructing Sparse Navigable Graphs
Graph-based nearest neighbor search methods have seen a surge of popularity in recent years, offering state-of-the-art performance across a wide variety of applications. Central to these methods is the task of constructing a sparse navigable search graph for a given dataset endowed with a distance function. Unfortunately, doing so is computationally expensive, so heuristics are universally used in practice. In this work, we initiate the study of fast algorithms with provable guarantees for search graph construction. For a dataset with $n$ data points, the problem of constructing an optimally sparse navigable graph can be framed as $n$ separate but highly correlated minimum set cover instances. This yields a naive $O(n^3)$ time greedy algorithm that returns a navigable graph whose sparsity is at most $O(\log n)$ higher than optimal. We improve significantly on this baseline, taking advantage of correlation between the set cover instances to leverage techniques from streaming and sublinear-time set cover algorithms. By also introducing problem-specific pre-processing techniques, we obtain an $\tilde{O}(n^2)$ time algorithm for constructing an $O(\log n)$-approximate sparsest navigable graph under any distance function. The runtime of our method is optimal up to logarithmic factors under the Strong Exponential Time Hypothesis via a reduction from Monochromatic Closest Pair. Moreover, we prove that, as with general set cover, obtaining better than an $O(\log n)$-approximation is NP-hard, despite the significant additional structure present in the navigable graph problem. Finally, we show that our approach can also beat cubic time for the closely related and practically important problems of constructing $\alpha$-shortcut reachable and $\tau$-monotonic graphs, which are also used for nearest neighbor search. For such graphs, we obtain $\tilde{O}(n^{2.5})$ time or better algorithms.
Multimedia 10
☆ Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
☆ Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval
Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.
comment: ACM Multimedia 2025
☆ DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering
Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC$^3$) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51\% and 46.04\% on the \textit{normal} and \textit{indirect} splits of EgoTaskQA, and 13.2\% on QAEGO4D, both reaching the state-of-the-art performance.
☆ Causal Debiasing for Visual Commonsense Reasoning
Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.
☆ GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection
Video anomaly detection (VAD) is a challenging task that detects anomalous frames in continuous surveillance videos. Most previous work utilizes the spatio-temporal correlation of visual features to distinguish whether there are abnormalities in video snippets. Recently, some works attempt to introduce multi-modal information, like text feature, to enhance the results of video anomaly detection. However, these works merely incorporate text features into video snippets in a coarse manner, overlooking the significant amount of redundant information that may exist within the video snippets. Therefore, we propose to leverage the diversity among multi-modal information to further refine the extracted features, reducing the redundancy in visual features, and we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD). Specifically, we generate more grained multi-modal feature based on the video snippet, which summarizes the main content, and text features based on the captions of original video will be introduced to further enhance the visual features of highlighted portions. Experiments show that the proposed GMFVAD achieves state-of-the-art performance on four mainly datasets. Ablation experiments also validate that the improvement of GMFVAD is due to the reduction of redundant information.
☆ Calibrating Multimodal Consensus for Emotion Recognition
In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CMC.
☆ Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations
Multimodal large language models (MLLMs) have achieved remarkable progress, yet remain critically vulnerable to adversarial attacks that exploit weaknesses in cross-modal processing. We present a systematic study of multimodal jailbreaks targeting both vision-language and audio-language models, showing that even simple perceptual transformations can reliably bypass state-of-the-art safety filters. Our evaluation spans 1,900 adversarial prompts across three high-risk safety categories harmful content, CBRN (Chemical, Biological, Radiological, Nuclear), and CSEM (Child Sexual Exploitation Material) tested against seven frontier models. We explore the effectiveness of attack techniques on MLLMs, including FigStep-Pro (visual keyword decomposition), Intelligent Masking (semantic obfuscation), and audio perturbations (Wave-Echo, Wave-Pitch, Wave-Speed). The results reveal severe vulnerabilities: models with almost perfect text-only safety (0\% ASR) suffer >75\% attack success under perceptually modified inputs, with FigStep-Pro achieving up to 89\% ASR in Llama-4 variants. Audio-based attacks further uncover provider-specific weaknesses, with even basic modality transfer yielding 25\% ASR for technical queries. These findings expose a critical gap between text-centric alignment and multimodal threats, demonstrating that current safeguards fail to generalize across cross-modal attacks. The accessibility of these attacks, which require minimal technical expertise, suggests that robust multimodal AI safety will require a paradigm shift toward broader semantic-level reasoning to mitigate possible risks.
☆ Can Current Detectors Catch Face-to-Voice Deepfake Attacks? ACSA
The rapid advancement of generative models has enabled the creation of increasingly stealthy synthetic voices, commonly referred to as audio deepfakes. A recent technique, FOICE [USENIX'24], demonstrates a particularly alarming capability: generating a victim's voice from a single facial image, without requiring any voice sample. By exploiting correlations between facial and vocal features, FOICE produces synthetic voices realistic enough to bypass industry-standard authentication systems, including WeChat Voiceprint and Microsoft Azure. This raises serious security concerns, as facial images are far easier for adversaries to obtain than voice samples, dramatically lowering the barrier to large-scale attacks. In this work, we investigate two core research questions: (RQ1) can state-of-the-art audio deepfake detectors reliably detect FOICE-generated speech under clean and noisy conditions, and (RQ2) whether fine-tuning these detectors on FOICE data improves detection without overfitting, thereby preserving robustness to unseen voice generators such as SpeechT5. Our study makes three contributions. First, we present the first systematic evaluation of FOICE detection, showing that leading detectors consistently fail under both standard and noisy conditions. Second, we introduce targeted fine-tuning strategies that capture FOICE-specific artifacts, yielding significant accuracy improvements. Third, we assess generalization after fine-tuning, revealing trade-offs between specialization to FOICE and robustness to unseen synthesis pipelines. These findings expose fundamental weaknesses in today's defenses and motivate new architectures and training protocols for next-generation audio deepfake detection.
comment: 8 pages, Accepted at Workshop on AI for Cyber Threat Intelligence, co-located with ACSAC 2025
♻ ☆ Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval
Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment $\Delta_{ij}$ between text $t_i$ and video $v_j$, redistributing gradients to relieve optimization tension and absorb noise. We derive $\Delta_{ij}$ via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize $\Delta$ through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation. Code is available at https://github.com/musicman217/GARE-text-video-retrieval.
♻ ☆ Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.
comment: This paper contains fundamental errors and will not be replaced
Image and Video Processing 6
☆ GUSL-Dehaze: A Green U-Shaped Learning Approach to Image Dehazing
Image dehazing is a restoration task that aims to recover a clear image from a single hazy input. Traditional approaches rely on statistical priors and the physics-based atmospheric scattering model to reconstruct the haze-free image. While recent state-of-the-art methods are predominantly based on deep learning architectures, these models often involve high computational costs and large parameter sizes, making them unsuitable for resource-constrained devices. In this work, we propose GUSL-Dehaze, a Green U-Shaped Learning approach to image dehazing. Our method integrates a physics-based model with a green learning (GL) framework, offering a lightweight, transparent alternative to conventional deep learning techniques. Unlike neural network-based solutions, GUSL-Dehaze completely avoids deep learning. Instead, we begin with an initial dehazing step using a modified Dark Channel Prior (DCP), which is followed by a green learning pipeline implemented through a U-shaped architecture. This architecture employs unsupervised representation learning for effective feature extraction, together with feature-engineering techniques such as the Relevant Feature Test (RFT) and the Least-Squares Normal Transform (LNT) to maintain a compact model size. Finally, the dehazed image is obtained via a transparent supervised learning strategy. GUSL-Dehaze significantly reduces parameter count while ensuring mathematical interpretability and achieving performance on par with state-of-the-art deep learning models.
☆ Efficient Meningioma Tumor Segmentation Using Ensemble Learning MICCAI
Meningiomas represent the most prevalent form of primary brain tumors, comprising nearly one-third of all diagnosed cases. Accurate delineation of these tumors from MRI scans is crucial for guiding treatment strategies, yet remains a challenging and time-consuming task in clinical practice. Recent developments in deep learning have accelerated progress in automated tumor segmentation; however, many advanced techniques are hindered by heavy computational demands and long training schedules, making them less accessible for researchers and clinicians working with limited hardware. In this work, we propose a novel ensemble-based segmentation approach that combines three distinct architectures: (1) a baseline SegResNet model, (2) an attention-augmented SegResNet with concatenative skip connections, and (3) a dual-decoder U-Net enhanced with attention-gated skip connections (DDUNet). The ensemble aims to leverage architectural diversity to improve robustness and accuracy while significantly reducing training demands. Each baseline model was trained for only 20 epochs and Evaluated on the BraTS-MEN 2025 dataset. The proposed ensemble model achieved competitive performance, with average Lesion-Wise Dice scores of 77.30%, 76.37% and 73.9% on test dataset for Enhancing Tumor (ET), Tumor Core (TC) and Whole Tumor (WT) respectively. These results highlight the effectiveness of ensemble learning for brain tumor segmentation, even under limited hardware constraints. Our proposed method provides a practical and accessible tool for aiding the diagnosis of meningioma, with potential impact in both clinical and research settings.
comment: 2nd Place Winner in the BraTS 2025 MICCAI Challenge (Task 2: Meningioma Tumor Segmentation)
☆ Eye-Tracking as a Tool to Quantify the Effects of CAD Display on Radiologists' Interpretation of Chest Radiographs
Rationale and Objectives: Computer-aided detection systems for chest radiographs are widely used, and concurrent reader displays, such as bounding-box (BB) highlights, may influence the reading process. This pilot study used eye tracking to conduct a preliminary experiment to quantify which aspects of visual search were affected. Materials and Methods: We sampled 180 chest radiographs from the VinDR-CXR dataset: 120 with solitary pulmonary nodules or masses and 60 without. The BBs were configured to yield an overall display sensitivity and specificity of 80%. Three radiologists (with 11, 5, and 1 years of experience, respectively) interpreted each case twice - once with BBs visible and once without - after a washout of >= 2 weeks. Eye movements were recorded using an EyeTech VT3 Mini. Metrics included interpretation time, time to first fixation on the lesion, lesion dwell time, total gaze-path length, and lung-field coverage ratio. Outcomes were modeled using a linear mixed model, with reading condition as a fixed effect and case and reader as random intercepts. The primary analysis was restricted to true positives (n=96). Results: Concurrent BB display prolonged interpretation time by 4.9 s (p<0.001) and increased lesion dwell time by 1.3 s (p<0.001). Total gaze-path length increased by 2,076 pixels (p<0.001), and lung-field coverage ratio increased by 10.5% (p<0.001). Time to first fixation on the lesion was reduced by 1.3 s (p<0.001). Conclusion: Eye tracking captured measurable alterations in search behavior associated with concurrent BB displays during chest radiograph interpretation. These findings support the feasibility of this approach and highlight the need for larger studies to confirm effects and explore implications across modalities and clinical contexts.
♻ ☆ Quantization-Aware Neuromorphic Architecture for Efficient Skin Disease Classification on Resource-Constrained Devices
Accurate and efficient skin lesion classification on edge devices is critical for accessible dermatological care but remains challenging due to computational, energy, and privacy constraints. We introduce QANA, a novel quantization-aware neuromorphic architecture for incremental skin lesion classification on resource-limited hardware. QANA effectively integrates ghost modules, efficient channel attention, and squeeze-and-excitation blocks for robust feature representation with low-latency and energy-efficient inference. Its quantization-aware head and spike-compatible transformations enable seamless conversion to spiking neural networks (SNNs) and deployment on neuromorphic platforms. Evaluation on the large-scale HAM10000 benchmark and a real-world clinical dataset shows that QANA achieves 91.6% Top-1 accuracy and 82.4% macro F1 on HAM10000, and 90.8%/81.7% on the clinical dataset, significantly outperforming state-of-the-art CNN-to-SNN models under fair comparison. Deployed on BrainChip Akida hardware, QANA achieves 1.5 ms inference latency and 1.7,mJ energy per image, reducing inference latency and energy use by over 94.6%/98.6% compared to GPU-based CNNs surpassing state-of-the-art CNN-to-SNN conversion baselines. These results demonstrate the effectiveness of QANA for accurate, real-time, and privacy-sensitive medical analysis in edge environments.
♻ ☆ Grids Often Outperform Implicit Neural Representation at Compressing Dense Signals
Implicit Neural Representations (INRs) have recently shown impressive results, but their fundamental capacity, implicit biases, and scaling behavior remain poorly understood. We investigate the performance of diverse INRs across a suite of 2D and 3D real and synthetic signals with varying effective bandwidth, as well as both overfitting and generalization tasks including tomography, super-resolution, and denoising. By stratifying performance according to model size as well as signal type and bandwidth, our results shed light on how different INR and grid representations allocate their capacity. We find that, for most tasks and signals, a simple regularized grid with interpolation trains faster and to higher quality than any INR with the same number of parameters. We also find limited settings--namely fitting binary signals such as shape contours--where INRs outperform grids, to guide future development and use of INRs towards the most advantageous applications.
comment: Our analysis are available at https://github.com/voilalab/INR-benchmark
♻ ☆ A robust and versatile deep learning model for prediction of the arterial input function in dynamic small animal $\left[^{18}\text{F}\right]$FDG PET imaging
Dynamic positron emission tomography (PET) and kinetic modeling are pivotal in advancing tracer development research in small animal studies. Accurate kinetic modeling requires precise input function estimation, traditionally achieved via arterial blood sampling. However, arterial cannulation in small animals like mice, involves intricate, time-consuming, and terminal procedures, precluding longitudinal studies. This work proposes a non-invasive, fully convolutional deep learning-based approach (FC-DLIF) to predict input functions directly from PET imaging, potentially eliminating the need for blood sampling in dynamic small-animal PET. The proposed FC-DLIF model includes a spatial feature extractor acting on the volumetric time frames of the PET sequence, extracting spatial features. These are subsequently further processed in a temporal feature extractor that predicts the arterial input function. The proposed approach is trained and evaluated using images and arterial blood curves from [$^{18}$F]FDG data using cross validation. Further, the model applicability is evaluated on imaging data and arterial blood curves collected using two additional radiotracers ([$^{18}$F]FDOPA, and [$^{68}$Ga]PSMA). The model was further evaluated on data truncated and shifted in time, to simulate shorter, and shifted, PET scans. The proposed FC-DLIF model reliably predicts the arterial input function with respect to mean squared error and correlation. Furthermore, the FC-DLIF model is able to predict the arterial input function even from truncated and shifted samples. The model fails to predict the AIF from samples collected using different radiotracers, as these are not represented in the training data. Our deep learning-based input function offers a non-invasive and reliable alternative to arterial blood sampling, proving robust and flexible to temporal shifts and different scan durations.
comment: 21 pages, 14 figures
Computer Vision and Pattern Recognition 100
☆ Is This Tracker On? A Benchmark Protocol for Dynamic Tracking
We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes -- factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.
comment: Project page: https://glab-caltech.github.io/ITTO/
☆ olmOCR 2: Unit Test Rewards for Document OCR
We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.
comment: https://olmocr.allen.ai/
☆ How to Evaluate Monocular Depth Estimation?
Monocular depth estimation is an important task with rapid progress, but how to evaluate it remains an open question, as evidenced by a lack of standardization in existing literature and a large selection of evaluation metrics whose trade-offs and behaviors are not well understood. This paper contributes a novel, quantitative analysis of existing metrics in terms of their sensitivity to various types of perturbations of ground truth, emphasizing comparison to human judgment. Our analysis reveals that existing metrics are severely under-sensitive to curvature perturbation such as making flat surfaces wavy. To remedy this, we introduce a new metric based on relative surface normals, along with new depth visualization tools and a principled method to create composite metrics with better human alignment. Code and data are available at: https://github.com/princeton-vl/evalmde.
☆ Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
☆ Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models
Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.
☆ OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation
This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.
☆ Adaptive Distribution-aware Quantization for Mixed-Precision Neural Networks
Quantization-Aware Training (QAT) is a critical technique for deploying deep neural networks on resource-constrained devices. However, existing methods often face two major challenges: the highly non-uniform distribution of activations and the static, mismatched codebooks used in weight quantization. To address these challenges, we propose Adaptive Distribution-aware Quantization (ADQ), a mixed-precision quantization framework that employs a differentiated strategy. The core of ADQ is a novel adaptive weight quantization scheme comprising three key innovations: (1) a quantile-based initialization method that constructs a codebook closely aligned with the initial weight distribution; (2) an online codebook adaptation mechanism based on Exponential Moving Average (EMA) to dynamically track distributional shifts; and (3) a sensitivity-informed strategy for mixed-precision allocation. For activations, we integrate a hardware-friendly non-uniform-to-uniform mapping scheme. Comprehensive experiments validate the effectiveness of our method. On ImageNet, ADQ enables a ResNet-18 to achieve 71.512% Top-1 accuracy with an average bit-width of only 2.81 bits, outperforming state-of-the-art methods under comparable conditions. Furthermore, detailed ablation studies on CIFAR-10 systematically demonstrate the individual contributions of each innovative component, validating the rationale and effectiveness of our design.
comment: 16 pages, 10 figures
☆ A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
comment: 22 pages,2 figures
☆ Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning NeurIPS 2025
To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo's effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.
comment: Accepted for Spotlight Presentation at NeurIPS 2025
☆ LyTimeT: Towards Robust and Interpretable State-Variable Discovery
Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.
☆ Explainable Face Presentation Attack Detection via Ensemble-CAM
Presentation attacks represent a critical security threat where adversaries use fake biometric data, such as face, fingerprint, or iris images, to gain unauthorized access to protected systems. Various presentation attack detection (PAD) systems have been designed leveraging deep learning (DL) models to mitigate this type of threat. Despite their effectiveness, most of the DL models function as black boxes - their decisions are opaque to their users. The purpose of explainability techniques is to provide detailed information about the reason behind the behavior or decision of DL models. In particular, visual explanation is necessary to better understand the decisions or predictions of DL-based PAD systems and determine the key regions due to which a biometric image is considered real or fake by the system. In this work, a novel technique, Ensemble-CAM, is proposed for providing visual explanations for the decisions made by deep learning-based face PAD systems. Our goal is to improve DL-based face PAD systems by providing a better understanding of their behavior. Our provided visual explanations will enhance the transparency and trustworthiness of DL-based face PAD systems.
☆ Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation
Unpaired image-to-image translation has emerged as a crucial technique in medical imaging, enabling cross-modality synthesis, domain adaptation, and data augmentation without costly paired datasets. Yet, existing approaches often distort fine curvilinear structures, such as microvasculature, undermining both diagnostic reliability and quantitative analysis. This limitation is consequential in ophthalmic and vascular imaging, where subtle morphological changes carry significant clinical meaning. We propose Curvilinear Structure-preserving Translation (CST), a general framework that explicitly preserves fine curvilinear structures during unpaired translation by integrating structure consistency into the training. Specifically, CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly incorporated into existing methods. We integrate it into CycleGAN and UNSB as two representative backbones. Comprehensive evaluation across three imaging modalities: optical coherence tomography angiography, color fundus and X-ray coronary angiography demonstrates that CST improves translation fidelity and achieves state-of-the-art performance. By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.
☆ I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs
Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms -- originally developed to study human perception -- to test whether MLLMs exhibit the ``pop-out'' effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.
comment: Preprint
☆ From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.
comment: Accepted by NuerIPS 2025 (Poster)
☆ MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom
General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model's diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1
comment: The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1
☆ Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning ICCV 2025
Existing Moment Retrieval methods face three critical bottlenecks: (1) data scarcity forces models into shallow keyword-feature associations; (2) boundary ambiguity in transition regions between adjacent events; (3) insufficient discrimination of fine-grained semantics (e.g., distinguishing ``kicking" vs. ``throwing" a ball). In this paper, we propose a zero-external-dependency Augmented Moment Retrieval framework, AMR, designed to overcome local optima caused by insufficient data annotations and the lack of robust boundary and semantic discrimination capabilities. AMR is built upon two key insights: (1) it resolves ambiguous boundary information and semantic confusion in existing annotations without additional data (avoiding costly manual labeling), and (2) it preserves boundary and semantic discriminative capabilities enhanced by training while generalizing to real-world scenarios, significantly improving performance. Furthermore, we propose a two-stage training framework with cold-start and distillation adaptation. The cold-start stage employs curriculum learning on augmented data to build foundational boundary/semantic awareness. The distillation stage introduces dual query sets: Original Queries maintain DETR-based localization using frozen Base Queries from the cold-start model, while Active Queries dynamically adapt to real-data distributions. A cross-stage distillation loss enforces consistency between Original and Base Queries, preventing knowledge forgetting while enabling real-world generalization. Experiments on multiple benchmarks show that AMR achieves improved performance over prior state-of-the-art approaches.
comment: This work is accepted by ICCV 2025
☆ Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism NeurIPS 2025
Multi-agent collaboration enhances the perception capabilities of individual agents through information sharing. However, in real-world applications, differences in sensors and models across heterogeneous agents inevitably lead to domain gaps during collaboration. Existing approaches based on adaptation and reconstruction fail to support pragmatic heterogeneous collaboration due to two key limitations: (1) Intrusive retraining of the encoder or core modules disrupts the established semantic consistency among agents; and (2) accommodating new agents incurs high computational costs, limiting scalability. To address these challenges, we present a novel Generative Communication mechanism (GenComm) that facilitates seamless perception across heterogeneous multi-agent systems through feature generation, without altering the original network, and employs lightweight numerical alignment of spatial information to efficiently integrate new agents at minimal cost. Specifically, a tailored Deformable Message Extractor is designed to extract spatial message for each collaborator, which is then transmitted in place of intermediate features. The Spatial-Aware Feature Generator, utilizing a conditional diffusion model, generates features aligned with the ego agent's semantic space while preserving the spatial information of the collaborators. These generated features are further refined by a Channel Enhancer before fusion. Experiments conducted on the OPV2V-H, DAIR-V2X and V2X-Real datasets demonstrate that GenComm outperforms existing state-of-the-art methods, achieving an 81\% reduction in both computational cost and parameter count when incorporating new agents. Our code is available at https://github.com/jeffreychou777/GenComm.
comment: 19 pages, 10 figures, accepted to NeurIPS 2025
☆ Beyond sparse denoising in frames: minimax estimation with a scattering transform
A considerable amount of research in harmonic analysis has been devoted to non-linear estimators of signals contaminated by additive Gaussian noise. They are implemented by thresholding coefficients in a frame, which provide a sparse signal representation, or by minimising their $\ell^1$ norm. However, sparse estimators in frames are not sufficiently rich to adapt to complex signal regularities. For cartoon images whose edges are piecewise $\bf C^\alpha$ curves, wavelet, curvelet and Xlet frames are suboptimal if the Lipschitz exponent $\alpha \leq 2$ is an unknown parameter. Deep convolutional neural networks have recently obtained much better numerical results, which reach the minimax asymptotic bounds for all $\alpha$. Wavelet scattering coefficients have been introduced as simplified convolutional neural network models. They are computed by transforming the modulus of wavelet coefficients with a second wavelet transform. We introduce a denoising estimator by jointly minimising and maximising the $\ell^1$ norms of different subsets of scattering coefficients. We prove that these $\ell^1$ norms capture different types of geometric image regularity. Numerical experiments show that this denoising estimator reaches the minimax asymptotic bound for cartoon images for all Lipschitz exponents $\alpha \leq 2$. We state this numerical result as a mathematical conjecture. It provides a different harmonic analysis approach to suppress noise from signals, and to specify the geometric regularity of functions. It also opens a mathematical bridge between harmonic analysis and denoising estimators with deep convolutional network.
☆ XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography
Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention
☆ CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization
Image Forgery Localization (IFL) is a crucial task in image forensics, aimed at accurately identifying manipulated or tampered regions within an image at the pixel level. Existing methods typically generate a single deterministic localization map, which often lacks the precision and reliability required for high-stakes applications such as forensic analysis and security surveillance. To enhance the credibility of predictions and mitigate the risk of errors, we introduce an advanced Conditional Bernoulli Diffusion Model (CBDiff). Given a forged image, CBDiff generates multiple diverse and plausible localization maps, thereby offering a richer and more comprehensive representation of the forgery distribution. This approach addresses the uncertainty and variability inherent in tampered regions. Furthermore, CBDiff innovatively incorporates Bernoulli noise into the diffusion process to more faithfully reflect the inherent binary and sparse properties of forgery masks. Additionally, CBDiff introduces a Time-Step Cross-Attention (TSCAttention), which is specifically designed to leverage semantic feature guidance with temporal steps to improve manipulation detection. Extensive experiments on eight publicly benchmark datasets demonstrate that CBDiff significantly outperforms existing state-of-the-art methods, highlighting its strong potential for real-world deployment.
☆ Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.
comment: Project page: https://www.jshyun.me/projects/decaf
☆ Digitizing Paper ECGs at Scale: An Open-Source Algorithm for Clinical Research
Millions of clinical ECGs exist only as paper scans, making them unusable for modern automated diagnostics. We introduce a fully automated, modular framework that converts scanned or photographed ECGs into digital signals, suitable for both clinical and research applications. The framework is validated on 37,191 ECG images with 1,596 collected at Akershus University Hospital, where the algorithm obtains a mean signal-to-noise ratio of 19.65 dB on scanned papers with common artifacts. It is further evaluated on the Emory Paper Digitization ECG Dataset, comprising 35,595 images, including images with perspective distortion, wrinkles, and stains. The model improves on the state-of-the-art in all subcategories. The full software is released as open-source, promoting reproducibility and further development. We hope the software will contribute to unlocking retrospective ECG archives and democratize access to AI-driven diagnostics.
☆ Uncertainty evaluation of segmentation models for Earth observation
This paper investigates methods for estimating uncertainty in semantic segmentation predictions derived from satellite imagery. Estimating uncertainty for segmentation presents unique challenges compared to standard image classification, requiring scalable methods producing per-pixel estimates. While most research on this topic has focused on scene understanding or medical imaging, this work benchmarks existing methods specifically for remote sensing and Earth observation applications. Our evaluation focuses on the practical utility of uncertainty measures, testing their ability to identify prediction errors and noise-corrupted input image regions. Experiments are conducted on two remote sensing datasets, PASTIS and ForTy, selected for their differences in scale, geographic coverage, and label confidence. We perform an extensive evaluation featuring several models, such as Stochastic Segmentation Networks and ensembles, in combination with a number of neural architectures and uncertainty metrics. We make a number of practical recommendations based on our findings.
☆ Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.
comment: Under review. Both the dataset and code will be published
☆ Addressing the Depth-of-Field Constraint: A New Paradigm for High Resolution Multi-Focus Image Fusion
Multi-focus image fusion (MFIF) addresses the depth-of-field (DOF) limitations of optical lenses, where only objects within a specific range appear sharp. Although traditional and deep learning methods have advanced the field, challenges persist, including limited training data, domain gaps from synthetic datasets, and difficulties with regions lacking information. We propose VAEEDOF, a novel MFIF method that uses a distilled variational autoencoder for high-fidelity, efficient image reconstruction. Our fusion module processes up to seven images simultaneously, enabling robust fusion across diverse focus points. To address data scarcity, we introduce MattingMFIF, a new syntetic 4K dataset, simulating realistic DOF effects from real photographs. Our method achieves state-of-the-art results, generating seamless artifact-free fused images and bridging the gap between synthetic and real-world scenarios, offering a significant step forward in addressing complex MFIF challenges. The code, and weights are available here:
☆ Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration
Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.
comment: Accepted at the Machine Learning journal, CfP: Discovery Science 2024
☆ VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction
Feed-forward surround-view autonomous driving scene reconstruction offers fast, generalizable inference ability, which faces the core challenge of ensuring generalization while elevating novel view quality. Due to the surround-view with minimal overlap regions, existing methods typically fail to ensure geometric consistency and reconstruction quality for novel views. To tackle this tension, we claim that geometric information must be learned explicitly, and the resulting features should be leveraged to guide the elevating of semantic quality in novel views. In this paper, we introduce \textbf{Visual Gaussian Driving (VGD)}, a novel feed-forward end-to-end learning framework designed to address this challenge. To achieve generalizable geometric estimation, we design a lightweight variant of the VGGT architecture to efficiently distill its geometric priors from the pre-trained VGGT to the geometry branch. Furthermore, we design a Gaussian Head that fuses multi-scale geometry tokens to predict Gaussian parameters for novel view rendering, which shares the same patch backbone as the geometry branch. Finally, we integrate multi-scale features from both geometry and Gaussian head branches to jointly supervise a semantic refinement model, optimizing rendering quality through feature-consistent learning. Experiments on nuScenes demonstrate that our approach significantly outperforms state-of-the-art methods in both objective metrics and subjective quality under various settings, which validates VGD's scalability and high-fidelity surround-view reconstruction.
comment: 10 pages, 7 figures
☆ Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection
As object detection models are increasingly deployed in cyber-physical systems such as autonomous vehicles (AVs) and surveillance platforms, ensuring their security against adversarial threats is essential. While prior work has explored adversarial attacks in the image domain, those attacks in the video domain remain largely unexamined, especially in the no-box setting. In this paper, we present {\alpha}-Cloak, the first no-box adversarial attack on object detectors that operates entirely through the alpha channel of RGBA videos. {\alpha}-Cloak exploits the alpha channel to fuse a malicious target video with a benign video, resulting in a fused video that appears innocuous to human viewers but consistently fools object detectors. Our attack requires no access to model architecture, parameters, or outputs, and introduces no perceptible artifacts. We systematically study the support for alpha channels across common video formats and playback applications, and design a fusion algorithm that ensures visual stealth and compatibility. We evaluate {\alpha}-Cloak on five state-of-the-art object detectors, a vision-language model, and a multi-modal large language model (Gemini-2.0-Flash), demonstrating a 100% attack success rate across all scenarios. Our findings reveal a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.
☆ HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking
RGB cameras excel at capturing rich texture details with high spatial resolution, whereas event cameras offer exceptional temporal resolution and a high dynamic range (HDR). Leveraging their complementary strengths can substantially enhance object tracking under challenging conditions, such as high-speed motion, HDR environments, and dynamic background interference. However, a significant spatio-temporal asymmetry exists between these two modalities due to their fundamentally different imaging mechanisms, hindering effective multi-modal integration. To address this issue, we propose {Hierarchical Asymmetric Distillation} (HAD), a multi-modal knowledge distillation framework that explicitly models and mitigates spatio-temporal asymmetries. Specifically, HAD proposes a hierarchical alignment strategy that minimizes information loss while maintaining the student network's computational efficiency and parameter compactness. Extensive experiments demonstrate that HAD consistently outperforms state-of-the-art methods, and comprehensive ablation studies further validate the effectiveness and necessity of each designed component. The code will be released soon.
☆ A Matter of Time: Revealing the Structure of Time in Vision-Language Models
Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.
☆ The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models
Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data.
☆ [De|Re]constructing VLMs' Reasoning in Counting
Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.
comment: This work has been submitted to the IEEE for possible publication
☆ PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis NeurIPS 2025
Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision. Most existing methods struggle with image pairs that have small or no overlap. Recent approaches attempt to address this by synthesizing intermediate frames using video interpolation and selecting key frames via a self-consistency score. However, the generated frames are often blurry due to small overlap inputs, and the selection strategies are slow and not explicitly aligned with pose estimation. To solve these cases, we propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video interpolation model with a pose-conditioned novel view synthesis model, where we also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results. Extensive experiments on Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI demonstrate that, compared to existing SOTA methods, PoseCrafter can obviously enhance the pose estimation performances, especially on examples with small or no overlap.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ CARES: Context-Aware Resolution Selector for VLMs
Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.
☆ Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts
Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain-specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model's over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction. Cauvis achieves state-of-the-art performance with 15.9-31.4% gains over existing domain generalization methods on SDGOD datasets, while exhibiting significant robustness advantages in complex interference environments.
comment: 10 pages, 5 figures
☆ Mitigating representation bias caused by missing pixels in methane plume detection ACL
Most satellite images have systematically missing pixels (i.e., missing data not at random (MNAR)) due to factors such as clouds. If not addressed, these missing pixels can lead to representation bias in automated feature extraction models. In this work, we show that spurious association between the label and the number of missing values in methane plume detection can cause the model to associate the coverage (i.e., the percentage of valid pixels in an image) with the label, subsequently under-detecting plumes in low-coverage images. We evaluate multiple imputation approaches to remove the dependence between the coverage and a label. Additionally, we propose a weighted resampling scheme during training that removes the association between the label and the coverage by enforcing class balance in each coverage bin. Our results show that both resampling and imputation can significantly reduce the representation bias without hurting balanced accuracy, precision, or recall. Finally, we evaluate the capability of the debiased models using these techniques in an operational scenario and demonstrate that the debiased models have a higher chance of detecting plumes in low-coverage images.
comment: Accepted at the MACLEAN workshop at ECML-PKDD 2025
☆ PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation
Monocular 3D human pose estimation remains a fundamentally ill-posed inverse problem due to the inherent depth ambiguity in 2D-to-3D lifting. While contemporary video-based methods leverage temporal context to enhance spatial reasoning, they operate under a critical paradigm limitation: processing each sequence in isolation, thereby failing to exploit the strong structural regularities and repetitive motion patterns that pervade human movement across sequences. This work introduces the Pattern Reuse Graph Convolutional Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation. At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors. These priors are adaptively fused with hard-coded anatomical constraints through a memory-driven graph convolution, ensuring geometrical plausibility. To underpin this retrieval process with robust spatiotemporal features, we design a dual-stream hybrid architecture that synergistically combines the linear-complexity, local temporal modeling of Mamba-based state-space models with the global relational capacity of self-attention. Extensive evaluations on Human3.6M and MPI-INF-3DHP benchmarks demonstrate that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability. Our work posits that the long-overlooked mechanism of cross-sequence pattern reuse is pivotal to advancing the field, shifting the paradigm from per-sequence optimization towards cumulative knowledge learning.
comment: 29 pages, 6 figures, 6 tables
☆ Predicting before Reconstruction: A generative prior framework for MRI acceleration
Recent advancements in artificial intelligence have created transformative capabilities in image synthesis and generation, enabling diverse research fields to innovate at revolutionary speed and spectrum. In this study, we leverage this generative power to introduce a new paradigm for accelerating Magnetic Resonance Imaging (MRI), introducing a shift from image reconstruction to proactive predictive imaging. Despite being a cornerstone of modern patient care, MRI's lengthy acquisition times limit clinical throughput. Our novel framework addresses this challenge by first predicting a target contrast image, which then serves as a data-driven prior for reconstructing highly under-sampled data. This informative prior is predicted by a generative model conditioned on diverse data sources, such as other contrast images, previously scanned images, acquisition parameters, patient information. We demonstrate this approach with two key applications: (1) reconstructing FLAIR images using predictions from T1w and/or T2w scans, and (2) reconstructing T1w images using predictions from previously acquired T1w scans. The framework was evaluated on internal and multiple public datasets (total 14,921 scans; 1,051,904 slices), including multi-channel k-space data, for a range of high acceleration factors (x4, x8 and x12). The results demonstrate that our prediction-prior reconstruction method significantly outperforms other approaches, including those with alternative or no prior information. Through this framework we introduce a fundamental shift from image reconstruction towards a new paradigm of predictive imaging.
comment: 33 pages, 8figures
☆ PCP-GAN: Property-Constrained Pore-scale image reconstruction via conditional Generative Adversarial Networks
Obtaining truly representative pore-scale images that match bulk formation properties remains a fundamental challenge in subsurface characterization, as natural spatial heterogeneity causes extracted sub-images to deviate significantly from core-measured values. This challenge is compounded by data scarcity, where physical samples are only available at sparse well locations. This study presents a multi-conditional Generative Adversarial Network (cGAN) framework that generates representative pore-scale images with precisely controlled properties, addressing both the representativeness challenge and data availability constraints. The framework was trained on thin section samples from four depths (1879.50-1943.50 m) of a carbonate formation, simultaneously conditioning on porosity values and depth parameters within a single unified model. This approach captures both universal pore network principles and depth-specific geological characteristics, from grainstone fabrics with interparticle-intercrystalline porosity to crystalline textures with anhydrite inclusions. The model achieved exceptional porosity control (R^2=0.95) across all formations with mean absolute errors of 0.0099-0.0197. Morphological validation confirmed preservation of critical pore network characteristics including average pore radius, specific surface area, and tortuosity, with statistical differences remaining within acceptable geological tolerances. Most significantly, generated images demonstrated superior representativeness with dual-constraint errors of 1.9-11.3% compared to 36.4-578% for randomly extracted real sub-images. This capability provides transformative tools for subsurface characterization, particularly valuable for carbon storage, geothermal energy, and groundwater management applications where knowing the representative morphology of the pore space is critical for implementing digital rock physics.
☆ Exploring "Many in Few" and "Few in Many" Properties in Long-Tailed, Highly-Imbalanced IC Defect Classification
Despite significant advancements in deep classification techniques and in-lab automatic optical inspection models for long-tailed or highly imbalanced data, applying these approaches to real-world IC defect classification tasks remains challenging. This difficulty stems from two primary factors. First, real-world conditions, such as the high yield-rate requirements in the IC industry, result in data distributions that are far more skewed than those found in general public imbalanced datasets. Consequently, classifiers designed for open imbalanced datasets often fail to perform effectively in real-world scenarios. Second, real-world samples exhibit a mix of class-specific attributes and class-agnostic, domain-related features. This complexity adds significant difficulty to the classification process, particularly for highly imbalanced datasets. To address these challenges, this paper introduces the IC-Defect-14 dataset, a large, highly imbalanced IC defect image dataset sourced from AOI systems deployed in real-world IC production lines. This dataset is characterized by its unique "intra-class clusters" property, which presents two major challenges: large intra-class diversity and high inter-class similarity. These characteristics, rarely found simultaneously in existing public datasets, significantly degrade the performance of current state-of-the-art classifiers for highly imbalanced data. To tackle this challenge, we propose ReCAME-Net, which follows a multi-expert classifier framework and integrates a regional channel attention module, metric learning losses, a hard category mining strategy, and a knowledge distillation procedure. Extensive experimental evaluations demonstrate that ReCAME-Net outperforms previous state-of-the-art models on the IC-Defect-14 dataset while maintaining comparable performance and competitiveness on general public datasets.
☆ Automated Morphological Analysis of Neurons in Fluorescence Microscopy Using YOLOv8
Accurate segmentation and precise morphological analysis of neuronal cells in fluorescence microscopy images are crucial steps in neuroscience and biomedical imaging applications. However, this process is labor-intensive and time-consuming, requiring significant manual effort and expertise to ensure reliable outcomes. This work presents a pipeline for neuron instance segmentation and measurement based on a high-resolution dataset of stem-cell-derived neurons. The proposed method uses YOLOv8, trained on manually annotated microscopy images. The model achieved high segmentation accuracy, exceeding 97%. In addition, the pipeline utilized both ground truth and predicted masks to extract biologically significant features, including cell length, width, area, and grayscale intensity values. The overall accuracy of the extracted morphological measurements reached 75.32%, further supporting the effectiveness of the proposed approach. This integrated framework offers a valuable tool for automated analysis in cell imaging and neuroscience research, reducing the need for manual annotation and enabling scalable, precise quantification of neuron morphology.
comment: 7 pages, 2 figures and 2 tables
☆ Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis
Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.
comment: Accepted by ACM Multimedia 2025
☆ Multi-Camera Worker Tracking in Logistics Warehouse Considering Wide-Angle Distortion
With the spread of e-commerce, the logistics market is growing around the world. Therefore, improving the efficiency of warehouse operations is essential. To achieve this, various approaches have been explored, and among them, the use of digital twins is gaining attention. To make this approach possible, it is necessary to accurately collect the positions of workers in a warehouse and reflect them in a virtual space. However, a single camera has limitations in its field of view, therefore sensing with multiple cameras is necessary. In this study, we explored a method to track workers using 19 wide-angle cameras installed on the ceiling, looking down at the floor of the logistics warehouse. To understand the relationship between the camera coordinates and the actual positions in the warehouse, we performed alignment based on the floor surface. However, due to the characteristics of wide-angle cameras, significant distortion occurs at the edges of the image, particularly in the vertical direction. To address this, the detected worker positions from each camera were aligned based on foot positions, reducing the effects of image distortion, and enabling accurate position alignment across cameras. As a result, we confirmed an improvement of over 20% in tracking accuracy. Furthermore, we compared multiple methods for utilizing appearance features and validated the effectiveness of the proposed approach.
☆ GigaBrain-0: A World Model-Powered Vision-Language-Action Model
Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.
comment: https://gigabrain0.github.io/
☆ From See to Shield: ML-Assisted Fine-Grained Access Control for Visual Data
As the volume of stored data continues to grow, identifying and protecting sensitive information within large repositories becomes increasingly challenging, especially when shared with multiple users with different roles and permissions. This work presents a system architecture for trusted data sharing with policy-driven access control, enabling selective protection of sensitive regions while maintaining scalability. The proposed architecture integrates four core modules that combine automated detection of sensitive regions, post-correction, key management, and access control. Sensitive regions are secured using a hybrid scheme that employs symmetric encryption for efficiency and Attribute-Based Encryption for policy enforcement. The system supports efficient key distribution and isolates key storage to strengthen overall security. To demonstrate its applicability, we evaluate the system on visual datasets, where Privacy-Sensitive Objects in images are automatically detected, reassessed, and selectively encrypted prior to sharing in a data repository. Experimental results show that our system provides effective PSO detection, increases macro-averaged F1 score (5%) and mean Average Precision (10%), and maintains an average policy-enforced decryption time of less than 1 second per image. These results demonstrate the effectiveness, efficiency and scalability of our proposed solution for fine-grained access control.
comment: 10 pages, 3 figures, 6 tables. In submission
☆ Spatio-temporal Sign Language Representation and Translation
This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5\pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11\pm0.06$ BLEU points.
☆ Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.
comment: The project and benchmark are publicly available at https://github.com/microsoft/MV-RoboBench
☆ AegisRF: Adversarial Perturbations Guided with Sensitivity for Protecting Intellectual Property of Neural Radiance Fields BMVC 2025
As Neural Radiance Fields (NeRFs) have emerged as a powerful tool for 3D scene representation and novel view synthesis, protecting their intellectual property (IP) from unauthorized use is becoming increasingly crucial. In this work, we aim to protect the IP of NeRFs by injecting adversarial perturbations that disrupt their unauthorized applications. However, perturbing the 3D geometry of NeRFs can easily deform the underlying scene structure and thus substantially degrade the rendering quality, which has led existing attempts to avoid geometric perturbations or restrict them to explicit spaces like meshes. To overcome this limitation, we introduce a learnable sensitivity to quantify the spatially varying impact of geometric perturbations on rendering quality. Building upon this, we propose AegisRF, a novel framework that consists of a Perturbation Field, which injects adversarial perturbations into the pre-rendering outputs (color and volume density) of NeRF models to fool an unauthorized downstream target model, and a Sensitivity Field, which learns the sensitivity to adaptively constrain geometric perturbations, preserving rendering quality while disrupting unauthorized use. Our experimental evaluations demonstrate the generalized applicability of AegisRF across diverse downstream tasks and modalities, including multi-view image classification and voxel-based 3D localization, while maintaining high visual fidelity. Codes are available at https://github.com/wkim97/AegisRF.
comment: BMVC 2025
☆ DARE: A Deformable Adaptive Regularization Estimator for Learning-Based Medical Image Registration
Deformable medical image registration is a fundamental task in medical image analysis. While deep learning-based methods have demonstrated superior accuracy and computational efficiency compared to traditional techniques, they often overlook the critical role of regularization in ensuring robustness and anatomical plausibility. We propose DARE (Deformable Adaptive Regularization Estimator), a novel registration framework that dynamically adjusts elastic regularization based on the gradient norm of the deformation field. Our approach integrates strain and shear energy terms, which are adaptively modulated to balance stability and flexibility. To ensure physically realistic transformations, DARE includes a folding-prevention mechanism that penalizes regions with negative deformation Jacobian. This strategy mitigates non-physical artifacts such as folding, avoids over-smoothing, and improves both registration accuracy and anatomical plausibility
☆ Learning To Defer To A Population With Limited Demonstrations
This paper addresses the critical data scarcity that hinders the practical deployment of learning to defer (L2D) systems to the population. We introduce a context-aware, semi-supervised framework that uses meta-learning to generate expert-specific embeddings from only a few demonstrations. We demonstrate the efficacy of a dual-purpose mechanism, where these embeddings are used first to generate a large corpus of pseudo-labels for training, and subsequently to enable on-the-fly adaptation to new experts at test-time. The experiment results on three different datasets confirm that a model trained on these synthetic labels rapidly approaches oracle-level performance, validating the data efficiency of our approach. By resolving a key training bottleneck, this work makes adaptive L2D systems more practical and scalable, paving the way for human-AI collaboration in real-world environments. To facilitate reproducibility and address implementation details not covered in the main text, we provide our source code and training configurations at https://github.com/nil123532/learning-to-defer-to-a-population-with-limited-demonstrations.
comment: Accepted to IEEE DICTA 2025 (poster). 7 pages, 2 figures
☆ DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents
Mobile Phone Agents (MPAs) have emerged as a promising research direction due to their broad applicability across diverse scenarios. While Multimodal Large Language Models (MLLMs) serve as the foundation for MPAs, their effectiveness in handling multiple mobile phone tasks simultaneously remains limited. Although multitask supervised fine-tuning (SFT) is widely adopted for multitask learning, existing approaches struggle to determine optimal training data compositions for peak performance. To address this challenge, we propose DaMo (Data Mixture Optimizer) - a novel solution employing a trainable network that predicts optimal data mixtures by forecasting downstream task performance for any given dataset ratio. To support comprehensive evaluation, we introduce PhoneAgentBench, the first specialized benchmark to evaluate MLLMs on multimodal mobile phone tasks, comprising 1235 QA pairs spanning diverse real-world industrial mobile application scenarios. Demonstrating strong predictive capability (R^2=0.81) in small-scale pilot experiments, DaMo efficiently extrapolates optimal data mixing configurations. Our results show DaMo achieves a 3.38% performance improvement on PhoneAgentBench compared to alternative methods. Furthermore, extensive experiments across established benchmarks including BFCL-v3, MME-Reasoning, MME-Perception, and OCRBench reveal DaMo's superior generalization, outperforming other approaches by 2.57% in terms of average score. When used solely for MLLM optimization on the BFCL-v3 task, DaMo improves the metrics by 12.47% than other methods. Notably, DaMo maintains robust scalability, preserving its effectiveness when applied to other model architectures. The code and dataset are available at https://github.com/OPPO-Mente-Lab/DaMo.git
☆ A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP
This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP's text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.
☆ BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP
Decoding images from fMRI often involves mapping brain activity to CLIP's final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP's intermediate layers and contradicts the brain's functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system's functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7\%(Table.\ref{tab:compare_clip_vae}) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.
☆ Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization
Crowd localization plays a crucial role in visual scene understanding towards predicting each pedestrian location in a crowd, thus being applicable to various downstream tasks. However, existing approaches suffer from significant performance degradation due to discrepancies in head scale distributions (scale shift) between training and testing data, a challenge known as domain generalization (DG). This paper aims to comprehend the nature of scale shift within the context of domain generalization for crowd localization models. To this end, we address four critical questions: (i) How does scale shift influence crowd localization in a DG scenario? (ii) How can we quantify this influence? (iii) What causes this influence? (iv) How to mitigate the influence? Initially, we conduct a systematic examination of how crowd localization performance varies with different levels of scale shift. Then, we establish a benchmark, ScaleBench, and reproduce 20 advanced DG algorithms to quantify the influence. Through extensive experiments, we demonstrate the limitations of existing algorithms and underscore the importance and complexity of scale shift, a topic that remains insufficiently explored. To deepen our understanding, we provide a rigorous theoretical analysis on scale shift. Building on these insights, we further propose an effective algorithm called Causal Feature Decomposition and Anisotropic Processing (Catto) to mitigate the influence of scale shift in DG settings. Later, we also provide extensive analytical experiments, revealing four significant insights for future research. Our results emphasize the importance of this novel and applicable research direction, which we term Scale Shift Domain Generalization.
☆ Seabed-Net: A multi-task network for joint bathymetry estimation and seabed classification from remote sensing imagery in shallow waters SP
Accurate, detailed, and regularly updated bathymetry, coupled with complex semantic content, is essential for under-mapped shallow-water environments facing increasing climatological and anthropogenic pressures. However, existing approaches that derive either depth or seabed classes from remote sensing imagery treat these tasks in isolation, forfeiting the mutual benefits of their interaction and hindering the broader adoption of deep learning methods. To address these limitations, we introduce Seabed-Net, a unified multi-task framework that simultaneously predicts bathymetry and pixel-based seabed classification from remote sensing imagery of various resolutions. Seabed-Net employs dual-branch encoders for bathymetry estimation and pixel-based seabed classification, integrates cross-task features via an Attention Feature Fusion module and a windowed Swin-Transformer fusion block, and balances objectives through dynamic task uncertainty weighting. In extensive evaluations at two heterogeneous coastal sites, it consistently outperforms traditional empirical models and traditional machine learning regression methods, achieving up to 75\% lower RMSE. It also reduces bathymetric RMSE by 10-30\% compared to state-of-the-art single-task and multi-task baselines and improves seabed classification accuracy up to 8\%. Qualitative analyses further demonstrate enhanced spatial consistency, sharper habitat boundaries, and corrected depth biases in low-contrast regions. These results confirm that jointly modeling depth with both substrate and seabed habitats yields synergistic gains, offering a robust, open solution for integrated shallow-water mapping. Code and pretrained weights are available at https://github.com/pagraf/Seabed-Net.
comment: Submitted to ISPRS Journal of Photogrammetry and Remote Sensing
♻ ☆ Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework that bridges large language models with structured semantic reasoning. A small subset of human annotations is converted into high-confidence pseudo labels and organized into dataset-adaptive rubrics defining clear evaluation dimensions such as thematic relevance, action detail, and narrative progression. During inference, boundary scenes, including the opening and closing segments, are scored independently based on their own descriptions, while intermediate scenes incorporate concise summaries of adjacent segments to assess narrative continuity and redundancy. This design enables the language model to balance local salience with global coherence without any parameter tuning. Across three benchmarks, the proposed method achieves stable and competitive results, with F1 scores of 57.58 on SumMe, 63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85, +0.84, and +0.37, respectively. These outcomes demonstrate that rubric-guided pseudo labeling combined with contextual prompting effectively stabilizes LLM-based scoring and establishes a general, interpretable, and training-free paradigm for both generic and query-focused video summarization.
♻ ☆ Rethinking Backbone Design for Lightweight 3D Object Detection in LiDAR ICCV 2025
Recent advancements in LiDAR-based 3D object detection have significantly accelerated progress toward the realization of fully autonomous driving in real-world environments. Despite achieving high detection performance, most of the approaches still rely on a VGG-based or ResNet-based backbone for feature exploration, which increases the model complexity. Lightweight backbone design is well-explored for 2D object detection, but research on 3D object detection still remains limited. In this work, we introduce Dense Backbone, a lightweight backbone that combines the benefits of high processing speed, lightweight architecture, and robust detection accuracy. We adapt multiple SoTA 3d object detectors, such as PillarNet, with our backbone and show that with our backbone, these models retain most of their detection capability at a significantly reduced computational cost. To our knowledge, this is the first dense-layer-based backbone tailored specifically for 3D object detection from point cloud data. DensePillarNet, our adaptation of PillarNet, achieves a 29% reduction in model parameters and a 28% reduction in latency with just a 2% drop in detection accuracy on the nuScenes test set. Furthermore, Dense Backbone's plug-and-play design allows straightforward integration into existing architectures, requiring no modifications to other network components.
comment: Best Paper Award at the Embedded Vision Workshop ICCV 2025
♻ ☆ QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training NeurIPS 2025
Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.
comment: Accepted as Oral at NeurIPS 2025. Revision after camera ready
♻ ☆ Variable Rate Image Compression via N-Gram Context based Swin-transformer
This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86\% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.
comment: Accepted at ISVC 2025
♻ ☆ Video-R1: Reinforcing Video Reasoning in MLLMs NeurIPS 2025
Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.
comment: NeurIPS 2025, Project page: https://github.com/tulerfeng/Video-R1
♻ ☆ CNeuroMod-THINGS, a densely-sampled fMRI dataset for visual neuroscience
Data-hungry neuro-AI modelling requires ever larger neuroimaging datasets. CNeuroMod-THINGS meets this need by capturing neural representations for a wide set of semantic concepts using well-characterized images in a new densely-sampled, large-scale fMRI dataset. Importantly, CNeuroMod-THINGS exploits synergies between two existing projects: the THINGS initiative (THINGS) and the Courtois Project on Neural Modelling (CNeuroMod). THINGS has developed a common set of thoroughly annotated images broadly sampling natural and man-made objects which is used to acquire a growing collection of large-scale multimodal neural responses. Meanwhile, CNeuroMod is acquiring hundreds of hours of fMRI data from a core set of participants during controlled and naturalistic tasks, including visual tasks like movie watching and videogame playing. For CNeuroMod-THINGS, four CNeuroMod participants each completed 33-36 sessions of a continuous recognition paradigm using approximately 4000 images from the THINGS stimulus set spanning 720 categories. We report behavioural and neuroimaging metrics that showcase the quality of the data. By bridging together large existing resources, CNeuroMod-THINGS expands our capacity to model broad slices of the human visual experience.
comment: 16 pages manuscript, 5 figures, 9 pages supplementary material
♻ ☆ Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation
The field of Novel View Synthesis has been revolutionized by 3D Gaussian Splatting (3DGS), which enables high-quality scene reconstruction that can be rendered in real-time. 3DGS-based techniques typically suffer from high GPU memory and disk storage requirements which limits their practical application on consumer-grade devices. We propose Opti3DGS, a novel frequency-modulated coarse-to-fine optimization framework that aims to minimize the number of Gaussian primitives used to represent a scene, thus reducing memory and storage demands. Opti3DGS leverages image frequency modulation, initially enforcing a coarse scene representation and progressively refining it by modulating frequency details in the training images. On the baseline 3DGS, we demonstrate an average reduction of 62% in Gaussians, a 40% reduction in the training GPU memory requirements and a 20% reduction in optimization time without sacrificing the visual quality. Furthermore, we show that our method integrates seamlessly with many 3DGS-based techniques, consistently reducing the number of Gaussian primitives while maintaining, and often improving, visual quality. Additionally, Opti3DGS inherently produces a level-of-detail scene representation at no extra cost, a natural byproduct of the optimization pipeline. Results and code will be made publicly available.
♻ ☆ WikiVideo: Article Generation from Multiple Videos
We introduce the task of grounded article generation with the goal of creating a Wikipedia-style article from multiple diverse videos about real-world events -- from natural disasters to political elections -- where all the information in the article is supported by video evidence. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text while existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
comment: Repo can be found here: https://github.com/alexmartin1722/wikivideo
♻ ☆ 3D Visual Illusion Depth Estimation NeurIPS 2025
3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a 3D visual illusion depth estimation framework that uses common sense from the vision language model to adaptively fuse depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.
comment: NeurIPS 2025, Project: https://github.com/YaoChengTang/3D-Visual-Illusion-Depth-Estimation
♻ ☆ Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction
In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}. Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded $f$-divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded $f$-divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional generation and \textbf{\emph{1.38}} for conditional generation. On the ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.
♻ ☆ Discretized Gaussian Representation for Tomographic Reconstruction ICCV 2025
Computed Tomography (CT) enables detailed cross-sectional imaging but continues to face challenges in balancing reconstruction quality and computational efficiency. While deep learning-based methods have significantly improved image quality and noise reduction, they typically require large-scale training data and intensive computation. Recent advances in scene reconstruction, such as Neural Radiance Fields and 3D Gaussian Splatting, offer alternative perspectives but are not well-suited for direct volumetric CT reconstruction. In this work, we propose Discretized Gaussian Representation (DGR), a novel framework that reconstructs the 3D volume directly using a set of discretized Gaussian functions in an end-to-end manner. To further enhance efficiency, we introduce Fast Volume Reconstruction, a highly parallelized technique that aggregates Gaussian contributions into the voxel grid with minimal overhead. Extensive experiments on both real-world and synthetic datasets demonstrate that DGR achieves superior reconstruction quality and runtime performance across various CT reconstruction scenarios. Our code is publicly available at https://github.com/wskingdom/DGR.
comment: Accepted to ICCV 2025
♻ ☆ ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.
♻ ☆ SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries
Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their "in-place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real scenarios.In this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency. The code is available at https://github.com/MSunDYY/SparseWorld.
comment: Under Review
♻ ☆ On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification
The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.
comment: The code of this work will be publicly available at https://git.tu-berlin.de/rsim/xai4rs Accepted at IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
♻ ☆ Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors NeurIPS 2025
Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method called the Video-3D Geometry Large Language Model (VG LLM). Our approach utilizes a 3D visual geometry encoder to extract 3D prior information from video sequences. This information is then integrated with visual tokens and input into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.
comment: Accepted by NeurIPS 2025
♻ ☆ OmniNWM: Omniscient Driving Navigation World Models
Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at https://github.com/Arlo0o/OmniNWM.
comment: https://arlo0o.github.io/OmniNWM/
♻ ☆ LookUp3D: Data-Driven 3D Scanning
High speed, high-resolution, and accurate 3D scanning would open doors to many new applications in graphics, robotics, science, and medicine by enabling the accurate scanning of deformable objects during interactions. Past attempts to use structured light, time-of-flight, and stereo in high-speed settings have usually required tradeoffs in resolution or inaccuracy. In this paper, we introduce a method that enables, for the first time, 3D scanning at 450 frames per second at 1~Megapixel, or 1,450 frames per second at 0.4~Megapixel in an environment with controlled lighting. The key idea is to use a per-pixel lookup table that maps colors to depths, which is built using a linear stage. Imperfections, such as lens-distortion and sensor defects are baked into the calibration. We describe our method and test it on a novel hardware prototype. We compare the system with both ground-truth geometry as well as commercially available dynamic sensors like the Microsoft Kinect and Intel Realsense. Our results show the system acquiring geometry of objects undergoing high-speed deformations and oscillations and demonstrate the ability to recover physical properties from the reconstructions.
comment: Giancarlo Pereira, Yidan Gao, and Yurii Piadyk are joint first authors with equal contribution. 11 pages of main paper, 9 pages of supplemental text (all combined into a single document)
♻ ☆ Vectorization of Persistence Diagrams for Topological Data Analysis in R and Python Using TDAvec Package
Persistent homology is a widely-used tool in topological data analysis (TDA) for understanding the underlying shape of complex data. By constructing a filtration of simplicial complexes from data points, it captures topological features such as connected components, loops, and voids across multiple scales. These features are encoded in persistence diagrams (PDs), which provide a concise summary of the data's topological structure. However, the non-Hilbert nature of the space of PDs poses challenges for their direct use in machine learning applications. To address this, kernel methods and vectorization techniques have been developed to transform PDs into machine-learning-compatible formats. In this paper, we introduce a new software package designed to streamline the vectorization of PDs, offering an intuitive workflow and advanced functionalities. We demonstrate the necessity of the package through practical examples and provide a detailed discussion on its contributions to applied TDA. Definitions of all vectorization summaries used in the package are included in the appendix.
comment: 9 pages, 3 figures, 3 tables; minor changes: two more vectorizations are described
♻ ☆ Learning Differential Pyramid Representation for Tone Mapping
Existing tone mapping methods operate on downsampled inputs and rely on handcrafted pyramids to recover high-frequency details. These designs typically fail to preserve fine textures and structural fidelity in complex HDR scenes. Furthermore, most methods lack an effective mechanism to jointly model global tone consistency and local contrast enhancement, leading to globally flat or locally inconsistent outputs such as halo artifacts. We present the Differential Pyramid Representation Network (DPRNet), an end-to-end framework for high-fidelity tone mapping. At its core is a learnable differential pyramid that generalizes traditional Laplacian and Difference-of-Gaussian pyramids through content-aware differencing operations across scales. This allows DPRNet to adaptively capture high-frequency variations under diverse luminance and contrast conditions. To enforce perceptual consistency, DPRNet incorporates global tone perception and local tone tuning modules operating on downsampled inputs, enabling efficient yet expressive tone adaptation. Finally, an iterative detail enhancement module progressively restores the full-resolution output in a coarse-to-fine manner, reinforcing structure and sharpness. Experiments show that DPRNet achieves state-of-the-art results, improving PSNR by 2.39 dB on the 4K HDR+ dataset and 3.01 dB on the 4K HDRI Haven dataset, while producing perceptually coherent, detail-preserving results. \textit{We provide an anonymous online demo at https://xxxxxxdprnet.github.io/DPRNet/.
♻ ☆ Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition NeurIPS 2025
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layouts and variability in handwriting styles. Prior methods have faced performance bottlenecks by proposing isolated architectural modifications, making them difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves super state-of-the-art performance, outperforming the best lightweight specialized model SSAN by 16.31\% and the top-performing VLM Gemini2.5-flash by 24.42\% under zero-shot setting. Our datasets, models, and code are open-sourced at: {https://github.com/BFlameSwift/Uni-MuMER
comment: Accepted by NeurIPS 2025 as a spotlight
♻ ☆ Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
♻ ☆ ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4).Upon thorough examination, we observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding. While existing DGM4 methods pay rare attention to the cross-modal alignment, hampering the accuracy of manipulation detecting to step further. To remedy this issue, this work targets to advance the semantic alignment learning to promote this task. Particularly, we utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs, especially for the manipulated instances. Subsequently, a cross-modal alignment learning is performed to enhance the semantic alignment. Besides the explicit auxiliary clues, we further design a Manipulation-Guided Cross Attention (MGCA) to provide implicit guidance for augmenting the manipulation perceiving. With the grounding truth available during training, MGCA encourages the model to concentrate more on manipulated components while downplaying normal ones, enhancing the model's ability to capture manipulations. Extensive experiments are conducted on the DGM4 dataset, the results demonstrate that our model can surpass the comparison method with a clear margin.
comment: 12 pages, 6 figures
♻ ☆ Deep Linear Probe Generators for Weight Space Learning ICLR 2025
Weight space learning aims to extract information about a neural network, such as its training dataset or generalization error. Recent approaches learn directly from model weights, but this presents many challenges as weights are high-dimensional and include permutation symmetries between neurons. An alternative approach, Probing, represents a model by passing a set of learned inputs (probes) through the model, and training a predictor on top of the corresponding outputs. Although probing is typically not used as a stand alone approach, our preliminary experiment found that a vanilla probing baseline worked surprisingly well. However, we discover that current probe learning strategies are ineffective. We therefore propose Deep Linear Probe Generators (ProbeGen), a simple and effective modification to probing approaches. ProbeGen adds a shared generator module with a deep linear architecture, providing an inductive bias towards structured probes thus reducing overfitting. While simple, ProbeGen performs significantly better than the state-of-the-art and is very efficient, requiring between 30 to 1000 times fewer FLOPs than other top approaches.
comment: ICLR 2025. Project page: https://vision.huji.ac.il/probegen
♻ ☆ EgoBlind: Towards Egocentric Visual Assistance for the Blind NeurIPS'25
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.
comment: NeurIPS'25 (D&B Track)
♻ ☆ One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.
comment: Accepted by Neurips2025
♻ ☆ How many samples to label for an application given a foundation model? Chest X-ray classification study
Chest X-ray classification is vital yet resource-intensive, typically demanding extensive annotated data for accurate diagnosis. Foundation models mitigate this reliance, but how many labeled samples are required remains unclear. We systematically evaluate the use of power-law fits to predict the training size necessary for specific ROC-AUC thresholds. Testing multiple pathologies and foundation models, we find XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than a ResNet-50 baseline. Importantly, learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus. Our results enable practitioners to minimize annotation costs by labeling only the essential samples for targeted performance.
comment: 8 pages, 5 figures
♻ ☆ Learning Spatially Adaptive $\ell_1$-Norms Weights for Convolutional Synthesis Regularization
We propose an unrolled algorithm approach for learning spatially adaptive parameter maps in the framework of convolutional synthesis-based $\ell_1$ regularization. More precisely, we consider a family of pre-trained convolutional filters and estimate deeply parametrized spatially varying parameters applied to the sparse feature maps by means of unrolling a FISTA algorithm to solve the underlying sparse estimation problem. The proposed approach is evaluated for image reconstruction of low-field MRI and compared to spatially adaptive and non-adaptive analysis-type procedures relying on Total Variation regularization and to a well-established model-based deep learning approach. We show that the proposed approach produces visually and quantitatively comparable results with the latter approaches and at the same time remains highly interpretable. In particular, the inferred parameter maps quantify the local contribution of each filter in the reconstruction, which provides valuable insight into the algorithm mechanism and could potentially be used to discard unsuited filters.
comment: Accepted for publication in the proceedings of the EUSIPCO 2025 conference; corrected typo in equation (3)
♻ ☆ REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.
♻ ☆ MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in https://github.com/Shopee-MUG/MUG-V.
comment: Technical Report; Project Page: https://github.com/Shopee-MUG/MUG-V
♻ ☆ See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction
Occupancy prediction aims to estimate the 3D spatial distribution of occupied regions along with their corresponding semantic labels. Existing vision-based methods perform well on daytime benchmarks but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions. To address these challenges, we propose LIAR, a novel framework that learns illumination-affined representations. LIAR first introduces Selective Low-light Image Enhancement (SLLIE), which leverages the illumination priors from daytime scenes to adaptively determine whether a nighttime image is genuinely dark or sufficiently well-lit, enabling more targeted global enhancement. Building on the illumination maps generated by SLLIE, LIAR further incorporates two illumination-aware components: 2D Illumination-guided Sampling (2D-IGS) and 3D Illumination-driven Projection (3D-IDP), to respectively tackle local underexposure and overexposure. Specifically, 2D-IGS modulates feature sampling positions according to illumination maps, assigning larger offsets to darker regions and smaller ones to brighter regions, thereby alleviating feature degradation in underexposed areas. Subsequently,3D-IDP enhances semantic understanding in overexposed regions by constructing illumination intensity fields and supplying refined residual queries to the BEV context refinement process. Extensive experiments on both real and synthetic datasets demonstrate the superior performance of LIAR under challenging nighttime scenarios. The source code and pretrained models are available [here](https://github.com/yanzq95/LIAR).
♻ ☆ Spiking Neural Networks Need High Frequency Information
Spiking Neural Networks promise brain-inspired and energy-efficient computation by transmitting information through binary (0/1) spikes. Yet, their performance still lags behind that of artificial neural networks, often assumed to result from information loss caused by sparse and binary activations. In this work, we challenge this long-standing assumption and reveal a previously overlooked frequency bias: spiking neurons inherently suppress high-frequency components and preferentially propagate low-frequency information. This frequency-domain imbalance, we argue, is the root cause of degraded feature representation in SNNs. Empirically, on Spiking Transformers, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73% on Cifar-100, whereas replacing it with Max-Pool (high-pass) pushes the top-1 accuracy to 79.12%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: (1) extra Max-Pool in patch embedding, and (2) Depth-Wise Convolution in place of self-attention. Notably, Max-Former attains 82.39% top-1 accuracy on ImageNet using only 63.99M parameters, surpassing Spikformer (74.81%, 66.34M) by +7.58%. Extending our insight beyond transformers, our Max-ResNet-18 achieves state-of-the-art performance on convolution-based benchmarks: 97.17% on CIFAR-10 and 83.06\% on CIFAR-100. We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks. Code is available: https://github.com/bic-L/MaxFormer.
♻ ☆ Training-Free Label Space Alignment for Universal Domain Adaptation
Universal domain adaptation (UniDA) transfers knowledge from a labeled source domain to an unlabeled target domain, where label spaces may differ and the target domain may contain private classes. Previous UniDA methods primarily focused on visual space alignment but often struggled with visual ambiguities due to content differences, which limited their robustness and generalizability. To overcome this, we introduce a novel approach that leverages the strong \textit{zero-shot capabilities} of recent vision-language foundation models (VLMs) like CLIP, concentrating solely on label space alignment to enhance adaptation stability. CLIP can generate task-specific classifiers based only on label names. However, adapting CLIP to UniDA is challenging because the label space is not fully known in advance. In this study, we first utilize generative vision-language models to identify unknown categories in the target domain. Noise and semantic ambiguities in the discovered labels -- such as those similar to source labels (e.g., synonyms, hypernyms, hyponyms) -- complicate label alignment. To address this, we propose a training-free label-space alignment method for UniDA (\ours). Our method aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains. We then construct a \textit{universal classifier} that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts. The results reveal that the proposed method considerably outperforms existing UniDA techniques across key DomainBed benchmarks, delivering an average improvement of \textcolor{blue}{+7.9\%}in H-score and \textcolor{blue}{+6.1\%} in H$^3$-score. Furthermore, incorporating self-training further enhances performance and achieves an additional (\textcolor{blue}{+1.6\%}) increment in both H- and H$^3$-scores.
comment: 22 pages, 12 figures
♻ ☆ DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection NeurIPS 2025
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ SAM 2++: Tracking Anything at Any Granularity
Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
comment: update results
♻ ☆ With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You NeurIPS 2025
Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than $1\%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6\%$ in classification and $91.8\%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.
comment: NeurIPS 2025 camera-ready
♻ ☆ Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models
Unified generative models have shown remarkable performance in text and image generation. For image synthesis tasks, they adopt straightforward text-to-image (T2I) generation. However, direct T2I generation limits the models in handling complex compositional instructions, which frequently occur in real-world scenarios. Although this issue is vital, existing works mainly focus on improving the basic image generation capability of the models. While such improvements help to some extent, they still fail to adequately resolve the problem. Inspired by Chain of Thought (CoT) solving complex problems step by step, this work aims to introduce CoT into unified generative models to address the challenges of complex image generation that direct T2I generation cannot effectively solve, thereby endowing models with enhanced image generation ability. To achieve this, we first propose Functionality-oriented eXperts (FoXperts), an expert-parallel architecture in our model FoX, which assigns experts by function. FoXperts disentangles potential conflicts in mainstream modality-oriented designs and provides a solid foundation for CoT. When introducing CoT, the first question is how to design it for complex image generation. To this end, we emulate a human-like artistic workflow -- planning, acting, reflection, and correction -- and propose the Multimodal Chain of Thought (MCoT) approach, as the data involves both text and image. To address the subsequent challenge -- designing an effective MCoT training paradigm -- we develop a multi-task joint training scheme that equips the model with all capabilities required for each MCoT step in a disentangled manner. This paradigm avoids the difficulty of collecting consistent multi-step data tuples. Extensive experiments show that FoX consistently outperforms existing unified models on various T2I benchmarks, delivering notable improvements in complex image generation.
♻ ☆ Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion NeurIPS 2025
Multimodal learning (MML) is significantly constrained by modality imbalance, leading to suboptimal performance in practice. While existing approaches primarily focus on balancing the learning of different modalities to address this issue, they fundamentally overlook the inherent disproportion in model classification ability, which serves as the primary cause of this phenomenon. In this paper, we propose a novel multimodal learning approach to dynamically balance the classification ability of weak and strong modalities by incorporating the principle of boosting. Concretely, we first propose a sustained boosting algorithm in multimodal learning by simultaneously optimizing the classification and residual errors. Subsequently, we introduce an adaptive classifier assignment strategy to dynamically facilitate the classification performance of the weak modality. Furthermore, we theoretically analyze the convergence property of the cross-modal gap function, ensuring the effectiveness of the proposed boosting scheme. To this end, the classification ability of strong and weak modalities is expected to be balanced, thereby mitigating the imbalance issue. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SOTA) multimodal learning baselines. The source code is available at https://github.com/njustkmg/NeurIPS25-AUG.
comment: Accepted by NeurIPS 2025
♻ ☆ LBL: Logarithmic Barrier Loss Function for One-class Classification
One-class classification (OCC) aims to train a classifier solely on target data and attracts increasing attention due to its applicability in practice. Despite OCC has obtained many advances, it still lacks the effective OCC loss functions for deep learning. In this paper, a novel logarithmic barrier function based OCC loss (LBL) that assigns large gradients to margin samples and thus derives more compact hypersphere is first proposed by approximating the OCC objective smoothly. But the optimization of LBL may be instability especially when samples lie on the boundary leading to the infinity value. To address this issue, a smoother LBLSig loss is further proposed by utilizing a unilateral relaxation Sigmoid function. Experiments on different networks demonstrate the effectiveness of the proposed LBL and LBLSig. The source code can be found at https://github.com/ML-HDU/LBL_LBLSig.
♻ ☆ Brain3D: Generating 3D Objects from fMRI
Understanding the hidden mechanisms behind human's visual perception is a fundamental question in neuroscience. To that end, investigating into the neural responses of human mind activities, such as functional Magnetic Resonance Imaging (fMRI), has been a significant research vehicle. However, analyzing fMRI signals is challenging, costly, daunting, and demanding for professional training. Despite remarkable progress in fMRI analysis, existing approaches are limited to generating 2D images and far away from being biologically meaningful and practically useful. Under this insight, we propose to generate visually plausible and functionally more comprehensive 3D outputs decoded from brain signals, enabling more sophisticated modeling of fMRI data. Conceptually, we reformulate this task as a {\em fMRI conditioned 3D object generation} problem. We design a novel 3D object representation learning method, Brain3D, that takes as input the fMRI data of a subject who was presented with a 2D image, and yields as output the corresponding 3D object images. The key capabilities of this model include tackling the noises with high-level semantic signals and a two-stage architecture design for progressive high-level information integration. Extensive experiments validate the superior capability of our model over previous state-of-the-art 3D object generation methods. Importantly, we show that our model captures the distinct functionalities of each region of human vision system as well as their intricate interplay relationships, aligning remarkably with the established discoveries in neuroscience. Further, preliminary evaluations indicate that Brain3D can successfully identify the disordered brain regions in simulated scenarios, such as V1, V2, V3, V4, and the medial temporal lobe (MTL) within the human visual system. Our data and code will be available at https://brain-3d.github.io/.
comment: IJCV 2025
♻ ☆ Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code is publicly available at: https://github.com/YU-deep/ViF.git.
♻ ☆ MINGLE: Mixture of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging NeurIPS 2025
Continual model merging integrates independently fine-tuned models sequentially without access to the original training data, offering a scalable and efficient solution for continual learning. However, existing methods face two critical challenges: parameter interference among tasks, which leads to catastrophic forgetting, and limited adaptability to evolving test distributions. To address these issues, we introduce the task of Test-Time Continual Model Merging (TTCMM), which leverages a small set of unlabeled test samples during inference to alleviate parameter conflicts and handle distribution shifts. We propose MINGLE, a novel framework for TTCMM. MINGLE employs a mixture-of-experts architecture with parameter-efficient, low-rank experts, which enhances adaptability to evolving test distributions while dynamically merging models to mitigate conflicts. To further reduce forgetting, we propose Null-Space Constrained Gating, which restricts gating updates to subspaces orthogonal to prior task representations, thereby suppressing activations on old tasks and preserving past knowledge. We further introduce an Adaptive Relaxation Strategy that adjusts constraint strength dynamically based on interference signals observed during test-time adaptation, striking a balance between stability and adaptability. Extensive experiments on standard continual merging benchmarks demonstrate that MINGLE achieves robust generalization, significantly reduces forgetting, and consistently surpasses previous state-of-the-art methods by 7-9% on average across diverse task orders. Our code is available at: https://github.com/zihuanqiu/MINGLE
comment: accepted by NeurIPS 2025
♻ ☆ Where are we with calibration under dataset shift in image classification?
We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.
comment: Code available at https://github.com/biomedia-mira/calibration_under_shifts. Published in TMLR, October 2025 (https://openreview.net/forum?id=1NYKXlRU2H)
♻ ☆ Breaking the Discretization Barrier of Continuous Physics Simulation Learning
The modeling of complicated time-evolving physical dynamics from partial observations is a long-standing challenge. Particularly, observations can be sparsely distributed in a seemingly random or unstructured manner, making it difficult to capture highly nonlinear features in a variety of scientific and engineering problems. However, existing data-driven approaches are often constrained by fixed spatial and temporal discretization. While some researchers attempt to achieve spatio-temporal continuity by designing novel strategies, they either overly rely on traditional numerical methods or fail to truly overcome the limitations imposed by discretization. To address these, we propose CoPS, a purely data-driven methods, to effectively model continuous physics simulation from partial observations. Specifically, we employ multiplicative filter network to fuse and encode spatial information with the corresponding observations. Then we customize geometric grids and use message-passing mechanism to map features from original spatial domain to the customized grids. Subsequently, CoPS models continuous-time dynamics by designing multi-scale graph ODEs, while introducing a Markov-based neural auto-correction module to assist and constrain the continuous extrapolations. Comprehensive experiments demonstrate that CoPS advances the state-of-the-art methods in space-time continuous modeling across various scenarios.
♻ ☆ MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence's semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.
♻ ☆ FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications. Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes. Concretely, FLARE starts with camera pose estimation, whose results condition the subsequent learning of geometric structure and appearance, optimized through the objectives of geometry reconstruction and novel-view synthesis. Utilizing large-scale public datasets for training, our method delivers state-of-the-art performance in the tasks of pose estimation, geometry reconstruction, and novel view synthesis, while maintaining the inference efficiency (i.e., less than 0.5 seconds). The project page and code can be found at: https://zhanghe3z.github.io/FLARE/
Machine Learning 150
☆ Semantic World Models
Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as "semantic" world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at https://weirdlabuw.github.io/swm.
☆ Hubble: a Model Suite to Advance the Study of LLM Memorization
We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.
☆ Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
☆ Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.
comment: Code: https://github.com/dvlab-research/Scaf-GRPO
☆ The Feasibility of Training Sovereign Language Models in the Global South: A Study of Brazil and Mexico
The rapid escalation of computational requirements for training large-scale language models has reinforced structural asymmetries between high-capacity jurisdictions and countries in the Global South. This paper examines the technical and fiscal feasibility of sovereign-scale language model training in Brazil and Mexico under conditions of constrained hardware access, energy availability, and fiscal ceilings. Using a dual-axis design that varies accelerator generation (NVIDIA H100 vs. A100) and training duration (90 vs. 150 days), we estimate compute demand, energy consumption, capital expenditures, and regulatory compatibility for the training of a 10-trillion-token model. Our findings show that while all configurations remain below export-control and electrical infrastructure thresholds, fiscal viability is determined by hardware efficiency. H100-based scenarios achieve training feasibility at a total cost of 8-14 million USD, while A100 deployments require 19-32 million USD due to higher energy and hardware demand. We argue that extending training timelines should be treated as a policy lever to mitigate hardware constraints, enabling the production of usable, auditable, and locally aligned models without competing at the global frontier. This study contributes to the discourse on AI compute governance and technological sovereignty by highlighting context-sensitive strategies that allow middle-income countries to establish sustainable and strategically sufficient AI capabilities.
comment: 11 pages, 3 figures
☆ Integrating Transparent Models, LLMs, and Practitioner-in-the-Loop: A Case of Nonprofit Program Evaluation
Public and nonprofit organizations often hesitate to adopt AI tools because most models are opaque even though standard approaches typically analyze aggregate patterns rather than offering actionable, case-level guidance. This study tests a practitioner-in-the-loop workflow that pairs transparent decision-tree models with large language models (LLMs) to improve predictive accuracy, interpretability, and the generation of practical insights. Using data from an ongoing college-success program, we build interpretable decision trees to surface key predictors. We then provide each tree's structure to an LLM, enabling it to reproduce case-level predictions grounded in the transparent models. Practitioners participate throughout feature engineering, model design, explanation review, and usability assessment, ensuring that field expertise informs the analysis at every stage. Results show that integrating transparent models, LLMs, and practitioner input yields accurate, trustworthy, and actionable case-level evaluations, offering a viable pathway for responsible AI adoption in the public and nonprofit sectors.
Transformers are almost optimal metalearners for linear classification
Transformers have demonstrated impressive in-context learning (ICL) capabilities, raising the question of whether they can serve as metalearners that adapt to new tasks using only a small number of in-context examples, without any further training. While recent theoretical work has studied transformers' ability to perform ICL, most of these analyses do not address the formal metalearning setting, where the objective is to solve a collection of related tasks more efficiently than would be possible by solving each task individually. In this paper, we provide the first theoretical analysis showing that a simplified transformer architecture trained via gradient descent can act as a near-optimal metalearner in a linear classification setting. We consider a natural family of tasks where each task corresponds to a class-conditional Gaussian mixture model, with the mean vectors lying in a shared $k$-dimensional subspace of $R^d$. After training on a sufficient number of such tasks, we show that the transformer can generalize to a new task using only $O(k / R^4)$ in-context examples, where $R$ denotes the signal strength at test time. This performance (almost) matches that of an optimal learner that knows exactly the shared subspace and significantly outperforms any learner that only has access to the in-context data, which requires $\Omega(d / R^4)$ examples to generalize. Importantly, our bounds on the number of training tasks and examples per task needed to achieve this result are independent of the ambient dimension $d$.
☆ Blackbox Model Provenance via Palimpsestic Membership Inference
Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.
☆ Benchmarking World-Model Learning
Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$\unicode{x2014}$models should support many different tasks unknown ahead of time$\unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$\unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.
comment: 30 pages, 10 figures
☆ Environment Inference for Learning Generalizable Dynamical System NeurIPS 2025
Data-driven methods offer efficient and robust solutions for analyzing complex dynamical systems but rely on the assumption of I.I.D. data, driving the development of generalization techniques for handling environmental differences. These techniques, however, are limited by their dependence on environment labels, which are often unavailable during training due to data acquisition challenges, privacy concerns, and environmental variability, particularly in large public datasets and privacy-sensitive domains. In response, we propose DynaInfer, a novel method that infers environment specifications by analyzing prediction errors from fixed neural networks within each training round, enabling environment assignments directly from data. We prove our algorithm effectively solves the alternating optimization problem in unlabeled scenarios and validate it through extensive experiments across diverse dynamical systems. Results show that DynaInfer outperforms existing environment assignment techniques, converges rapidly to true labels, and even achieves superior performance when environment labels are available.
comment: NeurIPS 2025 Spotlight
☆ AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.
☆ GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters
Sparse fine-tuning techniques adapt LLMs to downstream tasks by only tuning a sparse subset of model parameters. However, the effectiveness of sparse adaptation depends on optimally selecting the model parameters to be fine-tuned. In this work, we introduce a novel sparse fine-tuning technique named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which fine-tunes only those model parameters which have the largest gradient magnitudes on downstream tasks and the smallest pre-trained magnitudes, intuitively prioritizing parameters that are highly task-relevant, but minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3 8B and Gemma 2B as base models shows that GaLLoP consistently improves or matches the in-distribution as well as out-of-distribution performance obtained via the usage of other leading parameter-efficient fine-tuning techniques, including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates catastrophic forgetting and memorization of task data, as important pre-trained parameters remain unchanged, and stabilizes performance relative to other fine-tuning techniques, robustly generalizing across most random seeds.
☆ The Tail Tells All: Estimating Model-Level Membership Inference Vulnerability Without Reference Models
Membership inference attacks (MIAs) have emerged as the standard tool for evaluating the privacy risks of AI models. However, state-of-the-art attacks require training numerous, often computationally expensive, reference models, limiting their practicality. We present a novel approach for estimating model-level vulnerability, the TPR at low FPR, to membership inference attacks without requiring reference models. Empirical analysis shows loss distributions to be asymmetric and heavy-tailed and suggests that most points at risk from MIAs have moved from the tail (high-loss region) to the head (low-loss region) of the distribution after training. We leverage this insight to propose a method to estimate model-level vulnerability from the training and testing distribution alone: using the absence of outliers from the high-loss region as a predictor of the risk. We evaluate our method, the TNR of a simple loss attack, across a wide range of architectures and datasets and show it to accurately estimate model-level vulnerability to the SOTA MIA attack (LiRA). We also show our method to outperform both low-cost (few reference models) attacks such as RMIA and other measures of distribution difference. We finally evaluate the use of non-linear functions to evaluate risk and show the approach to be promising to evaluate the risk in large-language models.
☆ SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration
The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ''underthinking'', where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model's reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a "deepening prompt" to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.
comment: Code: https://github.com/dvlab-research/SmartSwitch
☆ Exploring the Effect of DNN Depth on Adversarial Attacks in Network Intrusion Detection Systems
Adversarial attacks pose significant challenges to Machine Learning (ML) systems and especially Deep Neural Networks (DNNs) by subtly manipulating inputs to induce incorrect predictions. This paper investigates whether increasing the layer depth of deep neural networks affects their robustness against adversarial attacks in the Network Intrusion Detection System (NIDS) domain. We compare the adversarial robustness of various deep neural networks across both \ac{NIDS} and computer vision domains (the latter being widely used in adversarial attack experiments). Our experimental results reveal that in the NIDS domain, adding more layers does not necessarily improve their performance, yet it may actually significantly degrade their robustness against adversarial attacks. Conversely, in the computer vision domain, adding more layers exhibits a more modest impact on robustness. These findings can guide the development of robust neural networks for (NIDS) applications and highlight the unique characteristics of network security domains within the (ML) landscape.
☆ A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
comment: 22 pages,2 figures
☆ CONFEX: Uncertainty-Aware Counterfactual Explanations with Conformal Guarantees
Counterfactual explanations (CFXs) provide human-understandable justifications for model predictions, enabling actionable recourse and enhancing interpretability. To be reliable, CFXs must avoid regions of high predictive uncertainty, where explanations may be misleading or inapplicable. However, existing methods often neglect uncertainty or lack principled mechanisms for incorporating it with formal guarantees. We propose CONFEX, a novel method for generating uncertainty-aware counterfactual explanations using Conformal Prediction (CP) and Mixed-Integer Linear Programming (MILP). CONFEX explanations are designed to provide local coverage guarantees, addressing the issue that CFX generation violates exchangeability. To do so, we develop a novel localised CP procedure that enjoys an efficient MILP encoding by leveraging an offline tree-based partitioning of the input space. This way, CONFEX generates CFXs with rigorous guarantees on both predictive uncertainty and optimality. We evaluate CONFEX against state-of-the-art methods across diverse benchmarks and metrics, demonstrating that our uncertainty-aware approach yields robust and plausible explanations.
comment: 35 pages, 10 figures, 21 tables, 2 algorithms. [Main paper part consists of 11 pages, 2 figures, 1 table, 1 algorithm]
☆ When Do Transformers Learn Heuristics for Graph Connectivity?
Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $\leq 3^L$) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model's capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.
☆ BATIS: Bayesian Approaches for Targeted Improvement of Species Distribution Models
Species distribution models (SDMs), which aim to predict species occurrence based on environmental variables, are widely used to monitor and respond to biodiversity change. Recent deep learning advances for SDMs have been shown to perform well on complex and heterogeneous datasets, but their effectiveness remains limited by spatial biases in the data. In this paper, we revisit deep SDMs from a Bayesian perspective and introduce BATIS, a novel and practical framework wherein prior predictions are updated iteratively using limited observational data. Models must appropriately capture both aleatoric and epistemic uncertainty to effectively combine fine-grained local insights with broader ecological patterns. We benchmark an extensive set of uncertainty quantification approaches on a novel dataset including citizen science observations from the eBird platform. Our empirical study shows how Bayesian deep learning approaches can greatly improve the reliability of SDMs in data-scarce locations, which can contribute to ecological understanding and conservation efforts.
☆ Statistical Inference for Linear Functionals of Online Least-squares SGD when $t \gtrsim d^{1+δ}$
Stochastic Gradient Descent (SGD) has become a cornerstone method in modern data science. However, deploying SGD in high-stakes applications necessitates rigorous quantification of its inherent uncertainty. In this work, we establish \emph{non-asymptotic Berry--Esseen bounds} for linear functionals of online least-squares SGD, thereby providing a Gaussian Central Limit Theorem (CLT) in a \emph{growing-dimensional regime}. Existing approaches to high-dimensional inference for projection parameters, such as~\cite{chang2023inference}, rely on inverting empirical covariance matrices and require at least $t \gtrsim d^{3/2}$ iterations to achieve finite-sample Berry--Esseen guarantees, rendering them computationally expensive and restrictive in the allowable dimensional scaling. In contrast, we show that a CLT holds for SGD iterates when the number of iterations grows as $t \gtrsim d^{1+\delta}$ for any $\delta > 0$, significantly extending the dimensional regime permitted by prior works while improving computational efficiency. The proposed online SGD-based procedure operates in $\mathcal{O}(td)$ time and requires only $\mathcal{O}(d)$ memory, in contrast to the $\mathcal{O}(td^2 + d^3)$ runtime of covariance-inversion methods. To render the theory practically applicable, we further develop an \emph{online variance estimator} for the asymptotic variance appearing in the CLT and establish \emph{high-probability deviation bounds} for this estimator. Collectively, these results yield the first fully online and data-driven framework for constructing confidence intervals for SGD iterates in the near-optimal scaling regime $t \gtrsim d^{1+\delta}$.
comment: Improved version of arXiv:2302.09727 with new results
☆ Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning
Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.
☆ Bridging Earth and Space: A Survey on HAPS for Non-Terrestrial Networks
HAPS are emerging as key enablers in the evolution of 6G wireless networks, bridging terrestrial and non-terrestrial infrastructures. Operating in the stratosphere, HAPS can provide wide-area coverage, low-latency, energy-efficient broadband communications with flexible deployment options for diverse applications. This survey delivers a comprehensive overview of HAPS use cases, technologies, and integration strategies within the 6G ecosystem. The roles of HAPS in extending connectivity to underserved regions, supporting dynamic backhauling, enabling massive IoT, and delivering reliable low-latency communications for autonomous and immersive services are discussed. The paper reviews state-of-the-art architectures for terrestrial and non-terrestrial network integration, highlights recent field trials. Furthermore, key enabling technologies such as channel modeling, AI-driven resource allocation, interference control, mobility management, and energy-efficient communications are examined. The paper also outlines open research challenges. By addressing existing gaps in the literature, this survey positions HAPS as a foundational component of globally integrated, resilient, and sustainable 6G networks.
comment: 30 pages. This work has been submitted to IEEE Communications Surveys & Tutorials (under review)
☆ Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series
We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce \textit{Enhanced TimeAutoDiff}, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap'') by over 70\%, achieving $\Delta_{TRTS} \leq 0.014$ AUROC, while preserving training utility ($\Delta_{TSTR} \approx 0.01$). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50\% relative to small real test sets, and outperform them in 72--84\% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.
☆ SEMPO: Lightweight Foundation Models for Time Series Forecasting NeurIPS 2025
The recent boom of large pre-trained models witnesses remarkable success in developing foundation models (FMs) for time series forecasting. Despite impressive performance across diverse downstream forecasting tasks, existing time series FMs possess massive network architectures and require substantial pre-training on large-scale datasets, which significantly hinders their deployment in resource-constrained environments. In response to this growing tension between versatility and affordability, we propose SEMPO, a novel lightweight foundation model that requires pretraining on relatively small-scale data, yet exhibits strong general time series forecasting. Concretely, SEMPO comprises two key modules: 1) energy-aware SpEctral decomposition module, that substantially improves the utilization of pre-training data by modeling not only the high-energy frequency signals but also the low-energy yet informative frequency signals that are ignored in current methods; and 2) Mixture-of-PrOmpts enabled Transformer, that learns heterogeneous temporal patterns through small dataset-specific prompts and adaptively routes time series tokens to prompt-based experts for parameter-efficient model adaptation across different datasets and domains. Equipped with these modules, SEMPO significantly reduces both pre-training data scale and model size, while achieving strong generalization. Extensive experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO in both zero-shot and few-shot forecasting scenarios compared with state-of-the-art methods. Code and data are available at https://github.com/mala-lab/SEMPO.
comment: Accepted by NeurIPS 2025
☆ Fast Inference via Hierarchical Speculative Decoding
Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.
☆ Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation
Industrial and government organizations increasingly depend on data-driven analytics for workforce, finance, and regulated decision processes, where timeliness, cost efficiency, and compliance are critical. Distributed frameworks such as Spark and Flink remain effective for massive-scale batch or streaming analytics but introduce coordination complexity and auditing overheads that misalign with moderate-scale, latency-sensitive inference. Meanwhile, cloud providers now offer serverless GPUs, and models such as TabNet enable interpretable tabular ML, motivating new deployment blueprints for regulated environments. In this paper, we present a production-oriented Big Data as a Service (BDaaS) blueprint that integrates a single-node serverless GPU runtime with TabNet. The design leverages GPU acceleration for throughput, serverless elasticity for cost reduction, and feature-mask interpretability for IL4/FIPS compliance. We conduct benchmarks on the HR, Adult, and BLS datasets, comparing our approach against Spark and CPU baselines. Our results show that GPU pipelines achieve up to 4.5x higher throughput, 98x lower latency, and 90% lower cost per 1K inferences compared to Spark baselines, while compliance mechanisms add only ~5.7 ms latency with p99 < 22 ms. Interpretability remains stable under peak load, ensuring reliable auditability. Taken together, these findings provide a compliance-aware benchmark, a reproducible Helm-packaged blueprint, and a decision framework that demonstrate the practicality of secure, interpretable, and cost-efficient serverless GPU analytics for regulated enterprise and government settings.
comment: 10 pages, 7 figures, 4 tables. Accepted to IEEE BigData 2025
☆ Are Large Language Models Sensitive to the Motives Behind Communication? NeurIPS 2025
Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans' intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source -- for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs' behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents' information ecosystems. In these settings, we find that LLMs' inferences do not track the rational models' predictions nearly as closely -- partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.
comment: NeurIPS 2025
☆ Study of Training Dynamics for Memory-Constrained Fine-Tuning
Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.
☆ Policy Learning with Abstention
Policy learning algorithms are widely used in areas such as personalized medicine and advertising to develop individualized treatment regimes. However, most methods force a decision even when predictions are uncertain, which is risky in high-stakes settings. We study policy learning with abstention, where a policy may defer to a safe default or an expert. When a policy abstains, it receives a small additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention rule from their disagreements. We establish fast O(1/n)-type regret guarantees when propensities are known, and extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. We further show that abstention is a versatile tool with direct applications to other core problems in policy learning: it yields improved guarantees under margin conditions without the common realizability assumption, connects to distributionally robust policy learning by hedging against small data shifts, and supports safe policy improvement by ensuring improvement over a baseline policy with high probability.
☆ Remarks on a recent preprint of Chernikov and Towsner
In this brief note, we first give a counterexample to a theorem in Chernikov and Towsner, arXiv:2510.02420(1). In arXiv:2510.02420(2), the theorem has changed but as we explain the proof has a mistake. The change in the statement, due to changes in the underlying definition, affects the paper's claims. Since that theorem had been relevant to connecting the work of their paper to Coregliano-Malliaris high-arity PAC learning, a connection which now disappears, we also explain why their definitions miss crucial aspects that our work was designed to grapple with.
☆ Overlap-weighted orthogonal meta-learner for treatment effect estimation over time
Estimating heterogeneous treatment effects (HTEs) in time-varying settings is particularly challenging, as the probability of observing certain treatment sequences decreases exponentially with longer prediction horizons. Thus, the observed data contain little support for many plausible treatment sequences, which creates severe overlap problems. Existing meta-learners for the time-varying setting typically assume adequate treatment overlap, and thus suffer from exploding estimation variance when the overlap is low. To address this problem, we introduce a novel overlap-weighted orthogonal (WO) meta-learner for estimating HTEs that targets regions in the observed data with high probability of receiving the interventional treatment sequences. This offers a fully data-driven approach through which our WO-learner can counteract instabilities as in existing meta-learners and thus obtain more reliable HTE estimates. Methodologically, we develop a novel Neyman-orthogonal population risk function that minimizes the overlap-weighted oracle risk. We show that our WO-learner has the favorable property of Neyman-orthogonality, meaning that it is robust against misspecification in the nuisance functions. Further, our WO-learner is fully model-agnostic and can be applied to any machine learning model. Through extensive experiments with both transformer and LSTM backbones, we demonstrate the benefits of our novel WO-learner.
☆ Latent Space Factorization in LoRA NeurIPS 2025
Low-rank adaptation (LoRA) is a widely used method for parameter-efficient finetuning. However, existing LoRA variants lack mechanisms to explicitly disambiguate task-relevant information within the learned low-rank subspace, potentially limiting downstream performance. We propose Factorized Variational Autoencoder LoRA (FVAE-LoRA), which leverages a VAE to learn two distinct latent spaces. Our novel Evidence Lower Bound formulation explicitly promotes factorization between the latent spaces, dedicating one latent space to task-salient features and the other to residual information. Extensive experiments on text, audio, and image tasks demonstrate that FVAE-LoRA consistently outperforms standard LoRA. Moreover, spurious correlation evaluations confirm that FVAE-LoRA better isolates task-relevant signals, leading to improved robustness under distribution shifts. Our code is publicly available at: https://github.com/idiap/FVAE-LoRA
comment: Accepted at NeurIPS 2025
☆ Matrix-Free Least Squares Solvers: Values, Gradients, and What to Do With Them
This paper argues that the method of least squares has significant unfulfilled potential in modern machine learning, far beyond merely being a tool for fitting linear models. To release its potential, we derive custom gradients that transform the solver into a differentiable operator, like a neural network layer, enabling many diverse applications. Empirically, we demonstrate: (i) scalability by enforcing weight sparsity on a 50 million parameter model; (ii) imposing conservativeness constraints in score-based generative models; and (iii) hyperparameter tuning of Gaussian processes based on predictive performance. By doing this, our work represents the next iteration in developing differentiable linear-algebra tools and making them widely accessible to machine learning practitioners.
☆ Learning and Simulating Building Evacuation Patterns for Enhanced Safety Design Using Generative Models
Evacuation simulation is essential for building safety design, ensuring properly planned evacuation routes. However, traditional evacuation simulation relies heavily on refined modeling with extensive parameters, making it challenging to adopt such methods in a rapid iteration process in early design stages. Thus, this study proposes DiffEvac, a novel method to learn building evacuation patterns based on Generative Models (GMs), for efficient evacuation simulation and enhanced safety design. Initially, a dataset of 399 diverse functional layouts and corresponding evacuation heatmaps of buildings was established. Then, a decoupled feature representation is proposed to embed physical features like layouts and occupant density for GMs. Finally, a diffusion model based on image prompts is proposed to learn evacuation patterns from simulated evacuation heatmaps. Compared to existing research using Conditional GANs with RGB representation, DiffEvac achieves up to a 37.6% improvement in SSIM, 142% in PSNR, and delivers results 16 times faster, thereby cutting simulation time to 2 minutes. Case studies further demonstrate that the proposed method not only significantly enhances the rapid design iteration and adjustment process with efficient evacuation simulation but also offers new insights and technical pathways for future safety optimization in intelligent building design. The research implication is that the approach lowers the modeling burden, enables large-scale what-if exploration, and facilitates coupling with multi-objective design tools.
☆ A Climate-Aware Deep Learning Framework for Generalizable Epidemic Forecasting
Precise outbreak forecasting of infectious diseases is essential for effective public health responses and epidemic control. The increased availability of machine learning (ML) methods for time-series forecasting presents an enticing avenue to enhance outbreak forecasting. Though the COVID-19 outbreak demonstrated the value of applying ML models to predict epidemic profiles, using ML models to forecast endemic diseases remains underexplored. In this work, we present ForecastNet-XCL (an ensemble model based on XGBoost+CNN+BiLSTM), a deep learning hybrid framework designed to addresses this gap by creating accurate multi-week RSV forecasts up to 100 weeks in advance based on climate and temporal data, without access to real-time surveillance on RSV. The framework combines high-resolution feature learning with long-range temporal dependency capturing mechanisms, bolstered by an autoregressive module trained on climate-controlled lagged relations. Stochastic inference returns probabilistic intervals to inform decision-making. Evaluated across 34 U.S. states, ForecastNet-XCL reliably outperformed statistical baselines, individual neural nets, and conventional ensemble methods in both within- and cross-state scenarios, sustaining accuracy over extended forecast horizons. Training on climatologically diverse datasets enhanced generalization furthermore, particularly in locations having irregular or biennial RSV patterns. ForecastNet-XCL's efficiency, performance, and uncertainty-aware design make it a deployable early-warning tool amid escalating climate pressures and constrained surveillance resources.
☆ Uncertainty evaluation of segmentation models for Earth observation
This paper investigates methods for estimating uncertainty in semantic segmentation predictions derived from satellite imagery. Estimating uncertainty for segmentation presents unique challenges compared to standard image classification, requiring scalable methods producing per-pixel estimates. While most research on this topic has focused on scene understanding or medical imaging, this work benchmarks existing methods specifically for remote sensing and Earth observation applications. Our evaluation focuses on the practical utility of uncertainty measures, testing their ability to identify prediction errors and noise-corrupted input image regions. Experiments are conducted on two remote sensing datasets, PASTIS and ForTy, selected for their differences in scale, geographic coverage, and label confidence. We perform an extensive evaluation featuring several models, such as Stochastic Segmentation Networks and ensembles, in combination with a number of neural architectures and uncertainty metrics. We make a number of practical recommendations based on our findings.
☆ Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration
Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.
comment: Accepted at the Machine Learning journal, CfP: Discovery Science 2024
☆ Demonstrating Real Advantage of Machine-Learning-Enhanced Monte Carlo for Combinatorial Optimization
Combinatorial optimization problems are central to both practical applications and the development of optimization methods. While classical and quantum algorithms have been refined over decades, machine learning-assisted approaches are comparatively recent and have not yet consistently outperformed simple, state-of-the-art classical methods. Here, we focus on a class of Quadratic Unconstrained Binary Optimization (QUBO) problems, specifically the challenge of finding minimum energy configurations in three-dimensional Ising spin glasses. We use a Global Annealing Monte Carlo algorithm that integrates standard local moves with global moves proposed via machine learning. We show that local moves play a crucial role in achieving optimal performance. Benchmarking against Simulated Annealing and Population Annealing, we demonstrate that Global Annealing not only surpasses the performance of Simulated Annealing but also exhibits greater robustness than Population Annealing, maintaining effectiveness across problem hardness and system size without hyperparameter tuning. These results provide, to our knowledge, the first clear and robust evidence that a machine learning-assisted optimization method can exceed the capabilities of classical state-of-the-art techniques in a combinatorial optimization setting.
comment: 13 main pages, 6 main figures. 4 supplementary pages, 2 supplementary figures
☆ Insights into the Unknown: Federated Data Diversity Analysis on Molecular Data
AI methods are increasingly shaping pharmaceutical drug discovery. However, their translation to industrial applications remains limited due to their reliance on public datasets, lacking scale and diversity of proprietary pharmaceutical data. Federated learning (FL) offers a promising approach to integrate private data into privacy-preserving, collaborative model training across data silos. This federated data access complicates important data-centric tasks such as estimating dataset diversity, performing informed data splits, and understanding the structure of the combined chemical space. To address this gap, we investigate how well federated clustering methods can disentangle and represent distributed molecular data. We benchmark three approaches, Federated kMeans (Fed-kMeans), Federated Principal Component Analysis combined with Fed-kMeans (Fed-PCA+Fed-kMeans), and Federated Locality-Sensitive Hashing (Fed-LSH), against their centralized counterparts on eight diverse molecular datasets. Our evaluation utilizes both, standard mathematical and a chemistry-informed evaluation metrics, SF-ICF, that we introduce in this work. The large-scale benchmarking combined with an in-depth explainability analysis shows the importance of incorporating domain knowledge through chemistry-informed metrics, and on-client explainability analyses for federated diversity analysis on molecular data.
☆ The Confusing Instance Principle for Online Linear Quadratic Control
We revisit the problem of controlling linear systems with quadratic cost under unknown dynamics with model-based reinforcement learning. Traditional methods like Optimism in the Face of Uncertainty and Thompson Sampling, rooted in multi-armed bandits (MABs), face practical limitations. In contrast, we propose an alternative based on the Confusing Instance (CI) principle, which underpins regret lower bounds in MABs and discrete Markov Decision Processes (MDPs) and is central to the Minimum Empirical Divergence (MED) family of algorithms, known for their asymptotic optimality in various settings. By leveraging the structure of LQR policies along with sensitivity and stability analysis, we develop MED-LQ. This novel control strategy extends the principles of CI and MED beyond small-scale settings. Our benchmarks on a comprehensive control suite demonstrate that MED-LQ achieves competitive performance in various scenarios while highlighting its potential for broader applications in large-scale MDPs.
☆ Optimizing the Unknown: Black Box Bayesian Optimization with Energy-Based Model and Reinforcement Learning NeurIPS 2025
Existing Bayesian Optimization (BO) methods typically balance exploration and exploitation to optimize costly objective functions. However, these methods often suffer from a significant one-step bias, which may lead to convergence towards local optima and poor performance in complex or high-dimensional tasks. Recently, Black-Box Optimization (BBO) has achieved success across various scientific and engineering domains, particularly when function evaluations are costly and gradients are unavailable. Motivated by this, we propose the Reinforced Energy-Based Model for Bayesian Optimization (REBMBO), which integrates Gaussian Processes (GP) for local guidance with an Energy-Based Model (EBM) to capture global structural information. Notably, we define each Bayesian Optimization iteration as a Markov Decision Process (MDP) and use Proximal Policy Optimization (PPO) for adaptive multi-step lookahead, dynamically adjusting the depth and direction of exploration to effectively overcome the limitations of traditional BO methods. We conduct extensive experiments on synthetic and real-world benchmarks, confirming the superior performance of REBMBO. Additional analyses across various GP configurations further highlight its adaptability and robustness.
comment: This paper is accepted by 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach
We investigate the fundamental problem of leveraging offline data to accelerate online reinforcement learning - a direction with strong potential but limited theoretical grounding. Our study centers on how to learn and apply value envelopes within this context. To this end, we introduce a principled two-stage framework: the first stage uses offline data to derive upper and lower bounds on value functions, while the second incorporates these learned bounds into online algorithms. Our method extends prior work by decoupling the upper and lower bounds, enabling more flexible and tighter approximations. In contrast to approaches that rely on fixed shaping functions, our envelopes are data-driven and explicitly modeled as random variables, with a filtration argument ensuring independence across phases. The analysis establishes high-probability regret bounds determined by two interpretable quantities, thereby providing a formal bridge between offline pre-training and online fine-tuning. Empirical results on tabular MDPs demonstrate substantial regret reductions compared with both UCBVI and prior methods.
comment: 32 pages, 5 figures
☆ Bi-Level Decision-Focused Causal Learning for Large-Scale Marketing Optimization: Bridging Observational and Experimental Data NeurIPS 2025
Online Internet platforms require sophisticated marketing strategies to optimize user retention and platform revenue -- a classical resource allocation problem. Traditional solutions adopt a two-stage pipeline: machine learning (ML) for predicting individual treatment effects to marketing actions, followed by operations research (OR) optimization for decision-making. This paradigm presents two fundamental technical challenges. First, the prediction-decision misalignment: Conventional ML methods focus solely on prediction accuracy without considering downstream optimization objectives, leading to improved predictive metrics that fail to translate to better decisions. Second, the bias-variance dilemma: Observational data suffers from multiple biases (e.g., selection bias, position bias), while experimental data (e.g., randomized controlled trials), though unbiased, is typically scarce and costly -- resulting in high-variance estimates. We propose Bi-level Decision-Focused Causal Learning (Bi-DFCL) that systematically addresses these challenges. First, we develop an unbiased estimator of OR decision quality using experimental data, which guides ML model training through surrogate loss functions that bridge discrete optimization gradients. Second, we establish a bi-level optimization framework that jointly leverages observational and experimental data, solved via implicit differentiation. This novel formulation enables our unbiased OR estimator to correct learning directions from biased observational data, achieving optimal bias-variance tradeoff. Extensive evaluations on public benchmarks, industrial marketing datasets, and large-scale online A/B tests demonstrate the effectiveness of Bi-DFCL, showing statistically significant improvements over state-of-the-art. Currently, Bi-DFCL has been deployed at Meituan, one of the largest online food delivery platforms in the world.
comment: Accepted by NeurIPS 2025
☆ From Prototypes to Sparse ECG Explanations: SHAP-Driven Counterfactuals for Multivariate Time-Series Multi-class Classification
In eXplainable Artificial Intelligence (XAI), instance-based explanations for time series have gained increasing attention due to their potential for actionable and interpretable insights in domains such as healthcare. Addressing the challenges of explainability of state-of-the-art models, we propose a prototype-driven framework for generating sparse counterfactual explanations tailored to 12-lead ECG classification models. Our method employs SHAP-based thresholds to identify critical signal segments and convert them into interval rules, uses Dynamic Time Warping (DTW) and medoid clustering to extract representative prototypes, and aligns these prototypes to query R-peaks for coherence with the sample being explained. The framework generates counterfactuals that modify only 78% of the original signal while maintaining 81.3% validity across all classes and achieving 43% improvement in temporal stability. We evaluate three variants of our approach, Original, Sparse, and Aligned Sparse, with class-specific performance ranging from 98.9% validity for myocardial infarction (MI) to challenges with hypertrophy (HYP) detection (13.2%). This approach supports near realtime generation (< 1 second) of clinically valid counterfactuals and provides a foundation for interactive explanation platforms. Our findings establish design principles for physiologically-aware counterfactual explanations in AI-based diagnosis systems and outline pathways toward user-controlled explanation interfaces for clinical deployment.
☆ Teaming LLMs to Detect and Mitigate Hallucinations NeurIPS 2025
Recent work has demonstrated state-of-the-art results in large language model (LLM) hallucination detection and mitigation through consistency-based approaches which involve aggregating multiple responses sampled from a single LLM for a given prompt. These approaches help offset limitations stemming from the imperfect data on which LLMs are trained, which includes biases and under-representation of information required at deployment time among other limitations which can lead to hallucinations. We show that extending these single-model consistency methods to combine responses from multiple LLMs with different training data, training schemes and model architectures can result in substantial further improvements in hallucination detection and mitigation capabilities beyond their single-model consistency counterparts. We evaluate this \emph{consortium consistency} approach across many model teams from a pool of 15 LLMs and explore under what conditions it is beneficial to team together different LLMs in this manner. Further, we show that these performance improvements often come with reduced inference costs, offsetting a significant drawback with single-model consistency methods.
comment: Accepted to NeurIPS 2025 workshop on Reliable ML from Unreliable Data
☆ Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation
In the era of large language models (LLMs), weight-activation quantization helps fit models on edge device by reducing memory and compute bit-widths. However, three challenges persist for energy constrained hardware: (1) even after quantization, multiply-accumulate (MAC) operations remain unavoidable and continue to dominate energy consumption; (2) dequantization (or per-tensor/channel rescaling) introduces extra arithmetic and data movement, increasing latency and energy; (3) uniform parameters bit widths clip salient values-while intra-channel mixed precision is generally impractical on current matrix hardware and memory. In contrast, brain-inspired Spiking Neural Networks (SNNs), owing to their binary spike-based information representation and the Integrate-and-Fire (IF) paradigm, naturally support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs). Motivated by this property, we propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts, thereby enabling dynamic mixed storage of different bitwidths. Furthermore, by embedding the quantization scale into the threshold of the IF mechanism, our approach performs energy-efficient linear transformations on weights and activations while avoiding explicit dequantization. Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods, highlighting its effectiveness for accurate and energy-efficient LLM deployment.
☆ CARES: Context-Aware Resolution Selector for VLMs
Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.
☆ Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning
Imitation learning has proven effective for training robots to perform complex tasks from expert human demonstrations. However, it remains limited by its reliance on high-quality, task-specific data, restricting adaptability to the diverse range of real-world object configurations and scenarios. In contrast, non-expert data -- such as play data, suboptimal demonstrations, partial task completions, or rollouts from suboptimal policies -- can offer broader coverage and lower collection costs. However, conventional imitation learning approaches fail to utilize this data effectively. To address these challenges, we posit that with right design decisions, offline reinforcement learning can be used as a tool to harness non-expert data to enhance the performance of imitation learning policies. We show that while standard offline RL approaches can be ineffective at actually leveraging non-expert data under the sparse data coverage settings typically encountered in the real world, simple algorithmic modifications can allow for the utilization of this data, without significant additional assumptions. Our approach shows that broadening the support of the policy distribution can allow imitation algorithms augmented by offline RL to solve tasks robustly, showing considerably enhanced recovery and generalization behavior. In manipulation tasks, these innovations significantly increase the range of initial conditions where learned policies are successful when non-expert data is incorporated. Moreover, we show that these methods are able to leverage all collected data, including partial or suboptimal demonstrations, to bolster task-directed policy performance. This underscores the importance of algorithmic techniques for using non-expert data for robust policy learning in robotics.
☆ VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.
comment: 8 pages, 6 figures
☆ KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge
The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: https://github.com/yzf-code/KnowMol Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K
☆ ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices
The deployment of Large Language Models (LLMs) on CPU-based edge devices is crucial for enabling on-device intelligence and expanding AI accessibility. However, it remains challenging due to limited memory and computational resources. During edge inference, memory usage and latency are the primary bottlenecks. Although weight quantization can effectively reduce memory consumption, existing hardware-friendly approaches often rely on uniform quantization, which poorly fits weight distributions and incurs high dequantization overhead at low bit widths. To address these limitations, we propose ELUTQ, an efficient quantization framework introducing a novel quantization format, Hierarchical Linear Quantization (HLQ). HLQ better captures the statistical characteristics of weights without increasing the computational cost of Bit-serial LUT-based GEMM operations, thereby eliminating dequantization overhead. It is orthogonal to existing quantization algorithms and can be seamlessly integrated into various quantization pipelines. For efficient on-device deployment, ELUTQ provides optimized CPU kernels for end-to-end inference. Experiments show that for LLaMA3-8B, HLQ reduces perplexity by about 8% at 3-bit and 85% at 2-bit precision under post-training quantization, completing quantization within one hour. With efficient finetuning, HLQ further improves 2-bit performance within two hours. In terms of inference efficiency, our 2-bit LLaMA2-7B achieves over 25 tokens/s on an Apple M2 chip (4 threads, batch size = 1).
comment: 19 pages, 8 figures
☆ Online Two-Stage Submodular Maximization NeurIPS 2025
Given a collection of monotone submodular functions, the goal of Two-Stage Submodular Maximization (2SSM) [Balkanski et al., 2016] is to restrict the ground set so an objective selected u.a.r. from the collection attains a high maximal value, on average, when optimized over the restricted ground set. We introduce the Online Two-Stage Submodular Maximization (O2SSM) problem, in which the submodular objectives are revealed in an online fashion. We study this problem for weighted threshold potential functions, a large and important subclass of monotone submodular functions that includes influence maximization, data summarization, and facility location, to name a few. We design an algorithm that achieves sublinear $(1 - 1/e)^2$-regret under general matroid constraints and $(1 - 1/e)(1-e^{-k}k^k/k!)$-regret in the case of uniform matroids of rank $k$; the latter also yields a state-of-the-art bound for the (offline) 2SSM problem. We empirically validate the performance of our online algorithm with experiments on real datasets.
comment: To appear at NeurIPS 2025
☆ Graph Unlearning Meets Influence-aware Negative Preference Optimization
Recent advancements in graph unlearning models have enhanced model utility by preserving the node representation essentially invariant, while using gradient ascent on the forget set to achieve unlearning. However, this approach causes a drastic degradation in model utility during the unlearning process due to the rapid divergence speed of gradient ascent. In this paper, we introduce \textbf{INPO}, an \textbf{I}nfluence-aware \textbf{N}egative \textbf{P}reference \textbf{O}ptimization framework that focuses on slowing the divergence speed and improving the robustness of the model utility to the unlearning process. Specifically, we first analyze that NPO has slower divergence speed and theoretically propose that unlearning high-influence edges can reduce impact of unlearning. We design an influence-aware message function to amplify the influence of unlearned edges and mitigate the tight topological coupling between the forget set and the retain set. The influence of each edge is quickly estimated by a removal-based method. Additionally, we propose a topological entropy loss from the perspective of topology to avoid excessive information loss in the local structure during unlearning. Extensive experiments conducted on five real-world datasets demonstrate that INPO-based model achieves state-of-the-art performance on all forget quality metrics while maintaining the model's utility. Codes are available at \href{https://github.com/sh-qiangchen/INPO}{https://github.com/sh-qiangchen/INPO}.
☆ A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining CoT faithfulness. For cases where models produce non-monitorable reasoning, we explore the possibility of extracting a monitorable CoT from a non-monitorable CoT. To assess the viability of CoT monitoring safety cases, we establish prediction markets to aggregate forecasts on key technical milestones influencing their feasibility.
☆ g-DPO: Scalable Preference Optimization for Protein Language Models NeurIPS 2025
Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with the number of labeled sequences, leading to prohibitive training times even for modestly sized datasets. We introduce g-DPO, a framework that (i) uses sequence space clustering to prune redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group-based approximations. Across three protein engineering tasks, g-DPO maintains in-silico and in-vitro performance that is statistically indistinguishable from standard DPO, while converging 1.8 to 3.7 times faster, with greater gains expected as the size of the dataset increases.
comment: Accepted at two workshops: FM4LS NeurIPS 2025 (https://nips2025fm4ls.github.io/pages/accepted-paper.html) and MLSB in Copenhagen EurIPS 2025
☆ Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition
Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at https://github.com/CyberAgentAILab/mbr-for-asr
☆ HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission
Mixture-of-Experts (MoE) has become a popular architecture for scaling large models. However, the rapidly growing scale outpaces model training on a single DC, driving a shift toward a more flexible, cross-DC training paradigm. Under this, Expert Parallelism (EP) of MoE faces significant scalability issues due to the limited cross-DC bandwidth. Specifically, existing EP optimizations attempt to overlap data communication and computation, which has little benefit in low-bandwidth scenarios due to a much longer data communication time. Therefore, the trends of cross-DC EP scaling is fast becoming a critical roadblock to the continued growth of MoE models. To address this, we propose HybridEP, a modeling-guided framework to optimize EP under constrained bandwidth. Our key idea is to dynamically transform the spatial placement of experts to reduce data communication traffic and frequency, thereby minimizing EP's communication overheads. However, it is non-trivial to find the optimal solution because it complicates the original communication pattern by mixing data and expert communication. We therefore build a stream-based model to determine the optimal transmission ratio. Guided by this, we incorporate two techniques: (1) domain-based partition to construct the mapping between hybrid patterns and specific communication topology at GPU level, and (2) parameter-efficient migration to further refine this topology by reducing expert transmission overhead and enlarging the domain size. Combining all these designs, HybridEP can be considered as a more general EP with better scalability. Experimental results show that HybridEP outperforms existing state-of-the-art MoE training systems by up to 5.6x under constrained bandwidth. We further compare HybridEP and EP on large-scale simulations. HybridEP achieves up to 1.45x speedup with 1k DCs under different bandwidths.
☆ PCP-GAN: Property-Constrained Pore-scale image reconstruction via conditional Generative Adversarial Networks
Obtaining truly representative pore-scale images that match bulk formation properties remains a fundamental challenge in subsurface characterization, as natural spatial heterogeneity causes extracted sub-images to deviate significantly from core-measured values. This challenge is compounded by data scarcity, where physical samples are only available at sparse well locations. This study presents a multi-conditional Generative Adversarial Network (cGAN) framework that generates representative pore-scale images with precisely controlled properties, addressing both the representativeness challenge and data availability constraints. The framework was trained on thin section samples from four depths (1879.50-1943.50 m) of a carbonate formation, simultaneously conditioning on porosity values and depth parameters within a single unified model. This approach captures both universal pore network principles and depth-specific geological characteristics, from grainstone fabrics with interparticle-intercrystalline porosity to crystalline textures with anhydrite inclusions. The model achieved exceptional porosity control (R^2=0.95) across all formations with mean absolute errors of 0.0099-0.0197. Morphological validation confirmed preservation of critical pore network characteristics including average pore radius, specific surface area, and tortuosity, with statistical differences remaining within acceptable geological tolerances. Most significantly, generated images demonstrated superior representativeness with dual-constraint errors of 1.9-11.3% compared to 36.4-578% for randomly extracted real sub-images. This capability provides transformative tools for subsurface characterization, particularly valuable for carbon storage, geothermal energy, and groundwater management applications where knowing the representative morphology of the pore space is critical for implementing digital rock physics.
☆ Exploring "Many in Few" and "Few in Many" Properties in Long-Tailed, Highly-Imbalanced IC Defect Classification
Despite significant advancements in deep classification techniques and in-lab automatic optical inspection models for long-tailed or highly imbalanced data, applying these approaches to real-world IC defect classification tasks remains challenging. This difficulty stems from two primary factors. First, real-world conditions, such as the high yield-rate requirements in the IC industry, result in data distributions that are far more skewed than those found in general public imbalanced datasets. Consequently, classifiers designed for open imbalanced datasets often fail to perform effectively in real-world scenarios. Second, real-world samples exhibit a mix of class-specific attributes and class-agnostic, domain-related features. This complexity adds significant difficulty to the classification process, particularly for highly imbalanced datasets. To address these challenges, this paper introduces the IC-Defect-14 dataset, a large, highly imbalanced IC defect image dataset sourced from AOI systems deployed in real-world IC production lines. This dataset is characterized by its unique "intra-class clusters" property, which presents two major challenges: large intra-class diversity and high inter-class similarity. These characteristics, rarely found simultaneously in existing public datasets, significantly degrade the performance of current state-of-the-art classifiers for highly imbalanced data. To tackle this challenge, we propose ReCAME-Net, which follows a multi-expert classifier framework and integrates a regional channel attention module, metric learning losses, a hard category mining strategy, and a knowledge distillation procedure. Extensive experimental evaluations demonstrate that ReCAME-Net outperforms previous state-of-the-art models on the IC-Defect-14 dataset while maintaining comparable performance and competitiveness on general public datasets.
☆ Universal Quantitative Abstraction: Categorical Duality and Logical Completeness for Probabilistic Systems
A unified theory of quantitative abstraction is presented for probabilistic systems that links category theory, optimal transport, and quantitative modal logic. At its core is a canonical $ \varepsilon $-quotient endowed with a universal property: among all $ \varepsilon $-abstractions, it is the most informative one that respects a prescribed bound on value loss. This construction induces an adjunction between abstraction and realization functors $ (Q_{\varepsilon} \dashv R_{\varepsilon}) $, established via the Special Adjoint Functor Theorem, revealing a categorical duality between metric structure and logical semantics. A behavioral pseudometric is characterized as the unique fixed point of a Bellman-style operator, with contraction and Lipschitz properties proved in a coalgebraic setting. A quantitative modal $ \mu $-calculus is introduced and shown to be expressively complete for logically representable systems, so that behavioral distance coincides with maximal logical deviation. Compositionality under interface refinement is analyzed, clarifying how abstractions interact across system boundaries. An exact validation suite on finite Markov decision processes corroborates the contraction property, value-loss bounds, stability under perturbation, adversarial distinguishability, and scalability, demonstrating both robustness and computational feasibility. The resulting framework provides principled targets for state aggregation and representation learning, with mathematically precise guarantees for value-function approximation in stochastic domains.
☆ Revisiting the Relation Between Robustness and Universality
The modified universality hypothesis proposed by Jones et al. (2022) suggests that adversarially robust models trained for a given task are highly similar. We revisit the hypothesis and test its generality. While we verify Jones' main claim of high representational similarity in specific settings, results are not consistent across different datasets. We also discover that predictive behavior does not converge with increasing robustness and thus is not universal. We find that differing predictions originate in the classification layer, but show that more universal predictive behavior can be achieved with simple retraining of the classifiers. Overall, our work points towards partial universality of neural networks in specific settings and away from notions of strict universality.
☆ Neural Variational Dropout Processes ICLR
Learning to infer the conditional posterior model is a key step for robust meta-learning. This paper presents a new Bayesian meta-learning approach called Neural Variational Dropout Processes (NVDPs). NVDPs model the conditional posterior distribution based on a task-specific dropout; a low-rank product of Bernoulli experts meta-model is utilized for a memory-efficient mapping of dropout rates from a few observed contexts. It allows for a quick reconfiguration of a globally learned and shared neural network for new tasks in multi-task few-shot learning. In addition, NVDPs utilize a novel prior conditioned on the whole task data to optimize the conditional \textit{dropout} posterior in the amortized variational inference. Surprisingly, this enables the robust approximation of task-specific dropout rates that can deal with a wide range of functional ambiguities and uncertainties. We compared the proposed method with other meta-learning approaches in the few-shot learning tasks such as 1D stochastic regression, image inpainting, and classification. The results show the excellent performance of NVDPs.
comment: Accepted as a Poster at International Conference on Learning Representations (ICLR) 2022 (Apr 25-29, 2022)
☆ LLM Unlearning with LLM Beliefs
Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs. Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses. However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets. We refer to this as the squeezing effect, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success. To address this, we propose a bootstrapping (BS) framework that explicitly links the squeezing effect with the model's own high-confidence generations, namely its model beliefs. Since model beliefs inherently capture the very high-likelihood regions where probability mass is squeezed, incorporating them into the unlearning objective directly counters the squeezing effect. By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations, together achieving more thorough forgetting while preserving utility. Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of our approach.
☆ FairNet: Dynamic Fairness Correction without Performance Loss via Contrastive Conditional LoRA
Ensuring fairness in machine learning models is a critical challenge. Existing debiasing methods often compromise performance, rely on static correction strategies, and struggle with data sparsity, particularly within minority groups. Furthermore, their utilization of sensitive attributes is often suboptimal, either depending excessively on complete attribute labeling or disregarding these attributes entirely. To overcome these limitations, we propose FairNet, a novel framework for dynamic, instance-level fairness correction. FairNet integrates a bias detector with conditional low-rank adaptation (LoRA), which enables selective activation of the fairness correction mechanism exclusively for instances identified as biased, and thereby preserve performance on unbiased instances. A key contribution is a new contrastive loss function for training the LoRA module, specifically designed to minimize intra-class representation disparities across different sensitive groups and effectively address underfitting in minority groups. The FairNet framework can flexibly handle scenarios with complete, partial, or entirely absent sensitive attribute labels. Theoretical analysis confirms that, under moderate TPR/FPR for the bias detector, FairNet can enhance the performance of the worst group without diminishing overall model performance, and potentially yield slight performance improvements. Comprehensive empirical evaluations across diverse vision and language benchmarks validate the effectiveness of FairNet.
☆ Monitoring LLM-based Multi-Agent Systems Against Corruptions via Node Evaluation
Large Language Model (LLM)-based Multi-Agent Systems (MAS) have become a popular paradigm of AI applications. However, trustworthiness issues in MAS remain a critical concern. Unlike challenges in single-agent systems, MAS involve more complex communication processes, making them susceptible to corruption attacks. To mitigate this issue, several defense mechanisms have been developed based on the graph representation of MAS, where agents represent nodes and communications form edges. Nevertheless, these methods predominantly focus on static graph defense, attempting to either detect attacks in a fixed graph structure or optimize a static topology with certain defensive capabilities. To address this limitation, we propose a dynamic defense paradigm for MAS graph structures, which continuously monitors communication within the MAS graph, then dynamically adjusts the graph topology, accurately disrupts malicious communications, and effectively defends against evolving and diverse dynamic attacks. Experimental results in increasingly complex and dynamic MAS environments demonstrate that our method significantly outperforms existing MAS defense mechanisms, contributing an effective guardrail for their trustworthy applications. Our code is available at https://github.com/ChengcanWu/Monitoring-LLM-Based-Multi-Agent-Systems.
☆ From See to Shield: ML-Assisted Fine-Grained Access Control for Visual Data
As the volume of stored data continues to grow, identifying and protecting sensitive information within large repositories becomes increasingly challenging, especially when shared with multiple users with different roles and permissions. This work presents a system architecture for trusted data sharing with policy-driven access control, enabling selective protection of sensitive regions while maintaining scalability. The proposed architecture integrates four core modules that combine automated detection of sensitive regions, post-correction, key management, and access control. Sensitive regions are secured using a hybrid scheme that employs symmetric encryption for efficiency and Attribute-Based Encryption for policy enforcement. The system supports efficient key distribution and isolates key storage to strengthen overall security. To demonstrate its applicability, we evaluate the system on visual datasets, where Privacy-Sensitive Objects in images are automatically detected, reassessed, and selectively encrypted prior to sharing in a data repository. Experimental results show that our system provides effective PSO detection, increases macro-averaged F1 score (5%) and mean Average Precision (10%), and maintains an average policy-enforced decryption time of less than 1 second per image. These results demonstrate the effectiveness, efficiency and scalability of our proposed solution for fine-grained access control.
comment: 10 pages, 3 figures, 6 tables. In submission
☆ Iterative Training of Physics-Informed Neural Networks with Fourier-enhanced Features
Spectral bias, the tendency of neural networks to learn low-frequency features first, is a well-known issue with many training algorithms for physics-informed neural networks (PINNs). To overcome this issue, we propose IFeF-PINN, an algorithm for iterative training of PINNs with Fourier-enhanced features. The key idea is to enrich the latent space using high-frequency components through Random Fourier Features. This creates a two-stage training problem: (i) estimate a basis in the feature space, and (ii) perform regression to determine the coefficients of the enhanced basis functions. For an underlying linear model, it is shown that the latter problem is convex, and we prove that the iterative training scheme converges. Furthermore, we empirically establish that Random Fourier Features enhance the expressive capacity of the network, enabling accurate approximation of high-frequency PDEs. Through extensive numerical evaluation on classical benchmark problems, the superior performance of our method over state-of-the-art algorithms is shown, and the improved approximation across the frequency domain is illustrated.
comment: 9 pages, 5 figures in the main paper
☆ ARA: Adaptive Rank Allocation for Efficient Large Language Model SVD Compression
In the field of large language model (LLM) compression, singular value decomposition (SVD) is a widely studied and adopted low-rank decomposition technique. Since SVD operates exclusively on linear modules, and these modules in LLMs are separated by nonlinear components, SVD can only be applied independently to each linear module. Under a global compression ratio constraint, determining the appropriate rank for different linear modules becomes a critical problem. Existing approaches, such as heuristic algorithms and mask-based training, have made progress in addressing this challenge. However, these methods still suffer from several limitations: heuristic algorithms explore the solution space within restricted regions, while mask-based training struggles to efficiently capture the relationship between singular value spectra and trainable parameters. More importantly, current methods overlook the key property that the gain function is non-smooth at a compression ratio of 1, which often leads the training process to suboptimal local minima. To address these issues, we propose an Adaptive Rank Allocation (ARA) method. Specifically, (1) ARA introduces a dedicated mask design that enables efficient mapping and updating between retained ranks and trainable parameters; and (2) it employs an additional loss function to guide parameter selection toward globally optimal solutions. Experimental results demonstrate that ARA achieves state-of-the-art performance. On the LLaMA2-7B model with a 80\% compression ratio, ARA reduces perplexity on WikiText2 from 8.38 to 6.42 and improves average zero-shot task accuracy by 9.72 percentage points compared with uniform compression. These results highlight the effectiveness of our method for rank allocation in SVD-based LLM compression.
☆ CPSVD: Enhancing Large Language Model Compression via Column-Preserving Singular Value Decomposition
The rapid advancement of Large Language Models (LLMs) faces a critical bottleneck in their immense size, necessitating efficient compression techniques. While Singular Value Decomposition (SVD) is a promising approach, existing SVD-based methods treat the entire parameter matrix uniformly, overlooking that SVD approximation errors vary significantly across different matrix parts, which often leads to suboptimal compression. To address this, we propose \textbf{C}olumn-\textbf{P}reserving \textbf{S}ingular \textbf{V}alue \textbf{D}ecomposition (CPSVD), a novel method that refines SVD-based LLM compression by intelligently segmenting the parameter matrix. Unlike traditional SVD, CPSVD identifies and directly preserves matrix columns with high decomposition errors, applying SVD only to columns with low decomposition errors, while precisely determining the optimal balance point between these two strategies to minimize error. Furthermore, leveraging the inherent heterogeneity in decomposition errors across different matrices within an LLM, CPSVD adaptively allocates non-uniform compression rates to modules within that layer, while adhering to a target layer-wise compression ratio, thereby further enhancing compression performance. Extensive experiments demonstrate that CPSVD consistently outperforms state-of-the-art SVD-based LLM compression methods, achieving lower perplexity and higher accuracy on zero-shot tasks.
☆ Learning Noise-Resilient and Transferable Graph-Text Alignment via Dynamic Quality Assessment
Pre-training Graph Foundation Models (GFMs) on text-attributed graphs (TAGs) is central to web-scale applications such as search, recommendation, and knowledge discovery. However, existing CLIP-style graph-text aligners face two key limitations: they assume strict one-to-one correspondences between nodes and texts, overlooking the inherent many-to-many relations in real-world graphs; and they rely on static alignment objectives that cannot adapt to varying data quality, making them brittle under noisy supervision. Together, these limitations expose a core dilemma: embracing expressive many-to-many alignment amplifies noise, while reverting to strict one-to-one strategies sacrifices semantic diversity and fails to handle inherently mismatched pairs. To address these challenges, we propose ADAligner, a dynamic, quality-aware graph-text alignment framework that dynamically adjusts between expressive many-to-many and conservative one-to-one objectives according to supervision quality. ADAligner estimates batch-level alignment reliability in real time and adapts its optimization accordingly, promoting soft, subgraph-level many-to-many alignment when supervision is clean, while emphasizing reliable one-to-one alignment by dynamically filtering low-confidence pairs under noise. Theoretically, we prove that this dynamic mechanism forms a stable negative feedback process, ensuring convergence and robustness. Comprehensive experiments on nine diverse TAG datasets demonstrate that ADAligner consistently outperforms prior graph-text aligners on zero-/few-shot node classification, link prediction and cross-modal retrieval tasks. It maintains strong robustness under noisy supervision and accelerates pre-training by approximately 2 to 3 times compared to multimodal baselines, establishing a scalable and reliable foundation for graph-text representation learning in real-world web environments.
☆ LMFD: Latent Monotonic Feature Discovery
Many systems in our world age, degrade or otherwise move slowly but steadily in a certain direction. When monitoring such systems by means of sensors, one often assumes that some form of `age' is latently present in the data, but perhaps the available sensors do not readily provide this useful information. The task that we study in this paper is to extract potential proxies for this `age' from the available multi-variate time series without having clear data on what `age' actually is. We argue that when we find a sensor, or more likely some discovered function of the available sensors, that is sufficiently monotonic, that function can act as the proxy we are searching for. Using a carefully defined grammar and optimising the resulting equations in terms of monotonicity, defined as the absolute Spearman's Rank Correlation between time and the candidate formula, the proposed approach generates a set of candidate features which are then fitted and assessed on monotonicity. The proposed system is evaluated against an artificially generated dataset and two real-world datasets. In all experiments, we show that the system is able to combine sensors with low individual monotonicity into latent features with high monotonicity. For the real-world dataset of InfraWatch, a structural health monitoring project, we show that two features with individual absolute Spearman's $\rho$ values of $0.13$ and $0.09$ can be combined into a proxy with an absolute Spearman's $\rho$ of $0.95$. This demonstrates that our proposed method can find interpretable equations which can serve as a proxy for the `age' of the system.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Machine Learning and Principles and Practice of Knowledge Discovery in Databases, and is available online at https://doi.org/10.1007/978-3-031-74633-8_2
☆ A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond
Understanding the dynamics of feature learning in neural networks (NNs) remains a significant challenge. The work of (Mousavi-Hosseini et al., 2023) analyzes a multiple index teacher-student setting and shows that a two-layer student attains a low-rank structure in its first-layer weights when trained with stochastic gradient descent (SGD) and a strong regularizer. This structural property is known to reduce sample complexity of generalization. Indeed, in a second step, the same authors establish algorithm-specific learning guarantees under additional assumptions. In this paper, we focus exclusively on the structure discovery aspect and study it under weaker assumptions, more specifically: we allow (a) NNs of arbitrary size and depth, (b) with all parameters trainable, (c) under any smooth loss function, (d) tiny regularization, and (e) trained by any method that attains a second-order stationary point (SOSP), e.g.\ perturbed gradient descent (PGD). At the core of our approach is a key $\textit{derandomization}$ lemma, which states that optimizing the function $\mathbb{E}_{\mathbf{x}} \left[g_{\theta}(\mathbf{W}\mathbf{x} + \mathbf{b})\right]$ converges to a point where $\mathbf{W} = \mathbf{0}$, under mild conditions. The fundamental nature of this lemma directly explains structure discovery and has immediate applications in other domains including an end-to-end approximation for MAXCUT, and computing Johnson-Lindenstrauss embeddings.
♻ ☆ Graph Representation Learning with Diffusion Generative Models
Diffusion models have established themselves as state-of-the-art generative models across various data modalities, including images and videos, due to their ability to accurately approximate complex data distributions. Unlike traditional generative approaches such as VAEs and GANs, diffusion models employ a progressive denoising process that transforms noise into meaningful data over multiple iterative steps. This gradual approach enhances their expressiveness and generation quality. Not only that, diffusion models have also been shown to extract meaningful representations from data while learning to generate samples. Despite their success, the application of diffusion models to graph-structured data remains relatively unexplored, primarily due to the discrete nature of graphs, which necessitates discrete diffusion processes distinct from the continuous methods used in other domains. In this work, we leverage the representational capabilities of diffusion models to learn meaningful embeddings for graph data. By training a discrete diffusion model within an autoencoder framework, we enable both effective autoencoding and representation learning tailored to the unique characteristics of graph-structured data. We extract the representation from the combination of the encoder's output and the decoder's first time step hidden embedding. Our approach demonstrates the potential of discrete diffusion models to be used for graph representation learning. The code can be found at https://github.com/DanielMitiku/Graph-Representation-Learning-with-Diffusion-Generative-Models
♻ ☆ LoRA vs Full Fine-tuning: An Illusion of Equivalence
Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to effectively fine-tune LLMs with an extreme reduction in trainable parameters. But, \emph{are their learned solutions really equivalent?} We study how LoRA and full-finetuning change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that LoRA and full fine-tuning yield weight matrices whose singular value decompositions exhibit very different structure: weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}, while those trained with full fine-tuning do not. Further, we extend the finding that LoRA forgets less than full fine-tuning and find its forgetting is vastly localized to the intruder dimension -- by causally intervening on the intruder dimensions by changing their associated singular values post-fine-tuning, we show that they cause forgetting. Moreover, scaling them down significantly improves modeling of the pre-training distribution with a minimal drop in downstream task performance. Given this, we should expect accumulating intruder dimensions to be harmful and lead to more forgetting. This will be amplified during continual learning because of sequentially fine-tuning, and we show that LoRA models do accumulate intruder dimensions here tend to perform worse in this setting, emphasizing the practicality of our findings.
♻ ☆ Learning Reward Machines from Partially Observed Policies
Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.
♻ ☆ Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM
Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.
comment: Accepted by Neurips 2025
♻ ☆ Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.
comment: Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
♻ ☆ Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes
In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as "Q-Manipulation" (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.
♻ ☆ QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training NeurIPS 2025
Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.
comment: Accepted as Oral at NeurIPS 2025. Revision after camera ready
♻ ☆ Information-Theoretic Decentralized Secure Aggregation with Collusion Resilience
In decentralized federated learning (FL), multiple clients collaboratively learn a shared machine learning (ML) model by leveraging their privately held datasets distributed across the network, through interactive exchange of the intermediate model updates. To ensure data security, cryptographic techniques are commonly employed to protect model updates during aggregation. Despite growing interest in secure aggregation, existing works predominantly focus on protocol design and computational guarantees, with limited understanding of the fundamental information-theoretic limits of such systems. Moreover, optimal bounds on communication and key usage remain unknown in decentralized settings, where no central aggregator is available. Motivated by these gaps, we study the problem of decentralized secure aggregation (DSA) from an information-theoretic perspective. Specifically, we consider a network of $K$ fully-connected users, each holding a private input -- an abstraction of local training data -- who aim to securely compute the sum of all inputs. The security constraint requires that no user learns anything beyond the input sum, even when colluding with up to $T$ other users. We characterize the optimal rate region, which specifies the minimum achievable communication and secret key rates for DSA. In particular, we show that to securely compute one symbol of the desired input sum, each user must (i) transmit at least one symbol to others, (ii) hold at least one symbol of secret key, and (iii) all users must collectively hold no fewer than $K - 1$ independent key symbols. Our results establish the fundamental performance limits of DSA, providing insights for the design of provably secure and communication-efficient protocols in distributed learning systems.
comment: Submitted to IEEE for potential journal publication
♻ ☆ Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics
Graph-based learned simulators have emerged as a promising approach for simulating physical systems on unstructured meshes, offering speed and generalization across diverse geometries. However, they often struggle with capturing global phenomena, such as bending or long-range correlations usually occurring in solid mechanics, and suffer from error accumulation over long rollouts due to their reliance on local message passing and direct next-step prediction. We address these limitations by introducing the Rolling Diffusion-Batched Inference Network (ROBIN), a novel learned simulator that integrates two key innovations: (i) Rolling Diffusion-Batched Inference (ROBI), a parallelized inference scheme that amortizes the cost of diffusion-based refinement across physical time steps by overlapping denoising steps across a temporal window. (ii) A Hierarchical Graph Neural Network built on algebraic multigrid coarsening, enabling multiscale message passing across different mesh resolutions. This architecture, implemented via Algebraic-hierarchical Message Passing Networks, captures both fine-scale local dynamics and global structural effects critical for phenomena like beam bending or multi-body contact. We validate ROBIN on challenging 2D and 3D solid mechanics benchmarks involving geometric, material, and contact nonlinearities. ROBIN achieves state-of-the-art accuracy on all tasks, substantially outperforming existing next-step learned simulators while reducing inference time by up to an order of magnitude compared to standard diffusion simulators.
♻ ☆ Online Conformal Prediction with Efficiency Guarantees
We study the problem of conformal prediction in a novel online framework that directly optimizes efficiency. In our problem, we are given a target miscoverage rate $\alpha > 0$, and a time horizon $T$. On each day $t \le T$ an algorithm must output an interval $I_t \subseteq [0, 1]$, then a point $y_t \in [0, 1]$ is revealed. The goal of the algorithm is to achieve coverage, that is, $y_t \in I_t$ on (close to) a $(1 - \alpha)$-fraction of days, while maintaining efficiency, that is, minimizing the average volume (length) of the intervals played. This problem is an online analogue to the problem of constructing efficient confidence intervals. We study this problem over arbitrary and exchangeable (random order) input sequences. For exchangeable sequences, we show that it is possible to construct intervals that achieve coverage $(1 - \alpha) - o(1)$, while having length upper bounded by the best fixed interval that achieves coverage in hindsight. For arbitrary sequences however, we show that any algorithm that achieves a $\mu$-approximation in average length compared to the best fixed interval achieving coverage in hindsight, must make a multiplicative factor more mistakes than $\alpha T$, where the multiplicative factor depends on $\mu$ and the aspect ratio of the problem. Our main algorithmic result is a matching algorithm that can recover all Pareto-optimal settings of $\mu$ and number of mistakes. Furthermore, our algorithm is deterministic and therefore robust to an adaptive adversary. This gap between the exchangeable and arbitrary settings is in contrast to the classical online learning problem. In fact, we show that no single algorithm can simultaneously be Pareto-optimal for arbitrary sequences and optimal for exchangeable sequences. On the algorithmic side, we give an algorithm that achieves the near-optimal tradeoff between the two cases.
comment: To appear SODA 2026. Minor edits from previous posted version
♻ ☆ Training-Free Constrained Generation With Stable Diffusion Models NeurIPS 2025
Stable diffusion models represent the state-of-the-art in data synthesis across diverse domains and hold transformative potential for applications in science and engineering, e.g., by facilitating the discovery of novel solutions and simulating systems that are computationally intractable to model explicitly. While there is increasing effort to incorporate physics-based constraints into generative models, existing techniques are either limited in their applicability to latent diffusion frameworks or lack the capability to strictly enforce domain-specific constraints. To address this limitation this paper proposes a novel integration of stable diffusion models with constrained optimization frameworks, enabling the generation of outputs satisfying stringent physical and functional requirements. The effectiveness of this approach is demonstrated through material design experiments requiring adherence to precise morphometric properties, challenging inverse design tasks involving the generation of materials inducing specific stress-strain responses, and copyright-constrained content generation tasks. All code has been released at https://github.com/RAISELab-atUVA/Constrained-Stable-Diffusion.
comment: Spotlight at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity
Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a node's representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.
comment: 23 pages, 22 figures, 7 tables. v2: clarified over-squashing separation in light of related work
♻ ☆ Phase-driven Domain Generalizable Learning for Nonstationary Time Series
Pattern recognition is a fundamental task in continuous sensing applications, but real-world scenarios often experience distribution shifts that necessitate learning generalizable representations for such tasks. This challenge is exacerbated with time-series data, which also exhibit inherent nonstationarity--variations in statistical and spectral properties over time. In this work, we offer a fresh perspective on learning generalizable representations for time-series classification by considering the phase information of a signal as an approximate proxy for nonstationarity and propose a phase-driven generalizable representation learning framework for time-series classification, PhASER. It consists of three key elements: 1) Hilbert transform-based augmentation, which diversifies nonstationarity while preserving task-specific discriminatory semantics, 2) separate magnitude-phase encoding, viewing time-varying magnitude and phase as independent modalities, and 3) phase-residual feature broadcasting, integrating 2D phase features with a residual connection to the 1D signal representation, providing inherent regularization to improve distribution-invariant learning. Extensive evaluations on five datasets from sleep-stage classification, human activity recognition, and gesture recognition against 13 state-of-the-art baseline methods demonstrate that PhASER consistently outperforms the best baselines by an average of 5% and up to 11% in some cases. Additionally, the principles of PhASER can be broadly applied to enhance the generalizability of existing time-series representation learning models.
comment: TMLR 2025
♻ ☆ Interpretable Features for the Assessment of Neurodegenerative Diseases through Handwriting Analysis
Motor dysfunction is a common sign of neurodegenerative diseases (NDs) such as Parkinson's disease (PD) and Alzheimer's disease (AD), but may be difficult to detect, especially in the early stages. In this work, we examine the behavior of a wide array of interpretable features extracted from the handwriting signals of 113 subjects performing multiple tasks on a digital tablet, as part of the Neurological Signals dataset. The aim is to measure their effectiveness in characterizing NDs, including AD and PD. To this end, task-agnostic and task-specific features are extracted from 14 distinct tasks. Subsequently, through statistical analysis and a series of classification experiments, we investigate which features provide greater discriminative power between NDs and healthy controls and amongst different NDs. Preliminary results indicate that the tasks at hand can all be effectively leveraged to distinguish between the considered set of NDs, specifically by measuring the stability, the speed of writing, the time spent not writing, and the pressure variations between groups from our handcrafted interpretable features, which shows a statistically significant difference between groups, across multiple tasks. Using various binary classification algorithms on the computed features, we obtain up to 87% accuracy for the discrimination between AD and healthy controls (CTL), and up to 69% for the discrimination between PD and CTL.
comment: pages including references, accepted in IEEE JHBI
♻ ☆ Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the Best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3. Our code is available at https://github.com/IANNXANG/RuscaRL.
♻ ☆ SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents' capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.
♻ ☆ Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods
Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent
comment: 22 pages, 2 figures, 8 tables
♻ ☆ Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking NeurIPS 2025
Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence. Each token can take one of two states: masked or unmasked. We observe that token sequences often remain unchanged between consecutive sampling steps; consequently, the model repeatedly processes identical inputs, leading to redundant computation. To address this inefficiency, we propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states interpolated between the masked and unmasked states. This design enables the model to make predictions based on partially observed token information, and facilitates a fine-grained denoising process. We derive a variational training objective and introduce a simple architectural design to accommodate intermediate-state inputs. Our method demonstrates superior performance across a diverse set of generative modeling tasks. On text data, it achieves a perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and their hybrid variants (17.58), without relying on an autoregressive formulation. On image data, it attains competitive FID scores of 3.26 on CIFAR-10 and 6.98 on ImageNet-32, comparable to leading continuous generative models.
comment: Published at NeurIPS 2025. Project Page: https://chen-hao-chao.github.io/mdm-prime
♻ ☆ Scalable Boltzmann Generators for equilibrium sampling of large-scale materials
The use of generative models to sample equilibrium distributions of many-body systems, as first demonstrated by Boltzmann Generators, has attracted substantial interest due to their ability to produce unbiased and uncorrelated samples in `one shot'. Despite their promise and impressive results across the natural sciences, scaling these models to large systems remains a major challenge. In this work, we introduce a Boltzmann Generator architecture that addresses this scalability bottleneck with a focus on applications in materials science. We leverage augmented coupling flows in combination with graph neural networks to base the generation process on local environmental information, while allowing for energy-based training and fast inference. Compared to previous architectures, our model trains significantly faster, requires far less computational resources, and achieves superior sampling efficiencies. Crucially, the architecture is transferable to larger system sizes, which allows for the efficient sampling of materials with simulation cells of unprecedented size. We demonstrate the potential of our approach by applying it to several materials systems, including Lennard-Jones crystals, ice phases of mW water, and the phase diagram of silicon, for system sizes well above one thousand atoms. The trained Boltzmann Generators produce highly accurate equilibrium ensembles for various crystal structures, as well as Helmholtz and Gibbs free energies across a range of system sizes, able to reach scales where finite-size effects become negligible.
♻ ☆ The Coverage Principle: How Pre-Training Enables Post-Training
Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross-entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of \emph{coverage}, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of \emph{the coverage principle}, a phenomenon whereby next-token prediction (more generally, maximum likelihood) implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: \emph{coverage generalizes faster than cross-entropy}, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.
♻ ☆ Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction
In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}. Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded $f$-divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded $f$-divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional generation and \textbf{\emph{1.38}} for conditional generation. On the ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.
♻ ☆ Base Models Know How to Reason, Thinking Models Learn When NeurIPS 2025
Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning behaviors in thinking models. This approach provides an unbiased method to discover reasoning behaviors without imposing manual or LLM-derived assumptions. Across three base and four thinking models, using GSM8K and MATH500, our hybrid model recovers up to 91% of the performance gap to thinking models without any weight updates while steering only 12% of tokens. Concretely, our empirical setup provides a simple, causal way to test the effectiveness of existing reasoning mechanisms in base models by invoking them directly and measuring the resulting task performance. More broadly, these results reframe our understanding of how thinking models are trained: pre-training is when models acquire most of their reasoning mechanisms, and post-training teaches efficient deployment of these mechanisms at the right time, enabling efficient use of their inference-time compute.
comment: 10 pages, Accepted to the Mechanistic Interpretability Workshop at NeurIPS 2025
♻ ☆ LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits NeurIPS 2025
Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g. writing) generalizes to new tasks (e.g. math) is often not known a priori, often making using only one fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs simultaneously can incur a prohibitively high computational cost and lead to conflicting signals from different RMs that may degrade performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which frames reward model selection as a multi-armed bandit problem, efficiently and iteratively training LLMs using multiple RMs by selecting the most well-suited RM for each instance. On commonsense and math reasoning tasks, we show that LASeR boosts iterative LLM training, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of RM scores while also showing superior efficiency (e.g., a 2x speedup). Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to long-context generation, LASeR improves by 2.96 F1 points (avg.) on single-document QA tasks and 2.97 F1 points on few-shot learning over the RM score ensemble baseline with best-of-n sampling.
comment: NeurIPS 2025 camera-ready. First two authors contributed equally. Code: https://github.com/duykhuongnguyen/LASeR-MAB
♻ ☆ Learning to Learn with Contrastive Meta-Objective NeurIPS2025
Meta-learning enables learning systems to adapt quickly to new tasks, similar to humans. Different meta-learning approaches all work under/with the mini-batch episodic training framework. Such framework naturally gives the information about task identity, which can serve as additional supervision for meta-training to improve generalizability. We propose to exploit task identity as additional supervision in meta-training, inspired by the alignment and discrimination ability which is is intrinsic in human's fast learning. This is achieved by contrasting what meta-learners learn, i.e., model representations. The proposed ConML is evaluating and optimizing the contrastive meta-objective under a problem- and learner-agnostic meta-training framework. We demonstrate that ConML integrates seamlessly with existing meta-learners, as well as in-context learning models, and brings significant boost in performance with small implementation cost.
comment: Received by NeurIPS2025 (Oral)
♻ ☆ What Expressivity Theory Misses: Message Passing Complexity for GNNs NeurIPS 2025
Expressivity theory, characterizing which graphs a GNN can distinguish, has become the predominant framework for analyzing GNNs, with new models striving for higher expressivity. However, we argue that this focus is misguided: First, higher expressivity is not necessary for most real-world tasks as these tasks rarely require expressivity beyond the basic WL test. Second, expressivity theory's binary characterization and idealized assumptions fail to reflect GNNs' practical capabilities. To overcome these limitations, we propose Message Passing Complexity (MPC): a continuous measure that quantifies the difficulty for a GNN architecture to solve a given task through message passing. MPC captures practical limitations like over-squashing while preserving the theoretical impossibility results from expressivity theory, effectively narrowing the gap between theory and practice. Through extensive validation on fundamental GNN tasks, we show that MPC's theoretical predictions correlate with empirical performance, successfully explaining architectural successes and failures. Thereby, MPC advances beyond expressivity theory to provide a more powerful and nuanced framework for understanding and improving GNN architectures.
comment: NeurIPS 2025, Spotlight
♻ ☆ Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks
Neural networks are known for their ability to approximate smooth functions, yet they fail to generalize perfectly to unseen inputs when trained on discrete operations. Such operations lie at the heart of algorithmic tasks such as arithmetic, which is often used as a test bed for algorithmic execution in neural networks. In this work, we ask: can neural networks learn to execute binary-encoded algorithmic instructions exactly? We use the Neural Tangent Kernel (NTK) framework to study the training dynamics of two-layer fully connected networks in the infinite-width limit and show how a sufficiently large ensemble of such models can be trained to execute exactly, with high probability, four fundamental tasks: binary permutations, binary addition, binary multiplication, and Subtract and Branch if Negative (SBN) instructions. Since SBN is Turing-complete, our framework extends to computable functions. We show how this can be efficiently achieved using only logarithmically many training data. Our approach relies on two techniques: structuring the training data to isolate bit-level rules, and controlling correlations in the NTK regime to align model predictions with the target algorithmic executions.
comment: 43 pages, 7 figures
♻ ☆ 3D-GSRD: 3D Molecular Graph Auto-Encoder with Selective Re-mask Decoding
Masked graph modeling (MGM) is a promising approach for molecular representation learning (MRL).However, extending the success of re-mask decoding from 2D to 3D MGM is non-trivial, primarily due to two conflicting challenges: avoiding 2D structure leakage to the decoder, while still providing sufficient 2D context for reconstructing re-masked atoms. To address these challenges, we propose 3D-GSRD: a 3D Molecular Graph Auto-Encoder with Selective Re-mask Decoding. The core innovation of 3D-GSRD lies in its Selective Re-mask Decoding(SRD), which re-masks only 3D-relevant information from encoder representations while preserving the 2D graph structures. This SRD is synergistically integrated with a 3D Relational-Transformer(3D-ReTrans) encoder alongside a structure-independent decoder. We analyze that SRD, combined with the structure-independent decoder, enhances the encoder's role in MRL. Extensive experiments show that 3D-GSRD achieves strong downstream performance, setting a new state-of-the-art on 7 out of 8 targets in the widely used MD17 molecular property prediction benchmark. The code is released at https://github.com/WuChang0124/3D-GSRD.
♻ ☆ PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs
Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model's capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates protein-protein interaction prediction from a graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model's capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at https://github.com/SophieSarceau/PRING.
♻ ☆ ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning
Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.
♻ ☆ Pay Attention to Small Weights
Finetuning large pretrained neural networks is known to be resource-intensive, both in terms of memory and computational cost. To mitigate this, a common approach is to restrict training to a subset of the model parameters. By analyzing the relationship between gradients and weights during finetuning, we observe a notable pattern: large gradients are often associated with small-magnitude weights. This correlation is more pronounced in finetuning settings than in training from scratch. Motivated by this observation, we propose NANOADAM, which dynamically updates only the small-magnitude weights during finetuning and offers several practical advantages: first, this criterion is gradient-free -- the parameter subset can be determined without gradient computation; second, it preserves large-magnitude weights, which are likely to encode critical features learned during pretraining, thereby reducing the risk of catastrophic forgetting; thirdly, it permits the use of larger learning rates and consistently leads to better generalization performance in experiments. We demonstrate this for both NLP and vision tasks.
♻ ☆ Masked Generative Priors Improve World Models Sequence Modelling Capabilities
Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model-based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real-world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self-supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer-based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and introduce GIT-STORM. We evaluate our model on two downstream tasks: reinforcement learning and video prediction. GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark. Moreover, we apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research. To achieve this, we employ a state mixer function that integrates latent state representations with actions, enabling our model to handle continuous control tasks. We validate this approach through qualitative and quantitative analyses on the DeepMind Control Suite, showcasing the effectiveness of Transformer-based World Models in this new domain. Our results highlight the versatility and efficacy of the MaskGIT dynamics prior, paving the way for more accurate world models and effective RL policies.
♻ ☆ BlockGPT: Spatio-Temporal Modelling of Rainfall via Frame-Level Autoregression
Predicting precipitation maps is a highly complex spatiotemporal modeling task, critical for mitigating the impacts of extreme weather events. Short-term precipitation forecasting, or nowcasting, requires models that are not only accurate but also computationally efficient for real-time applications. Current methods, such as token-based autoregressive models, often suffer from flawed inductive biases and slow inference, while diffusion models can be computationally intensive. To address these limitations, we introduce BlockGPT, a generative autoregressive transformer using batched tokenization (Block) method that predicts full two-dimensional fields (frames) at each time step. Conceived as a model-agnostic paradigm for video prediction, BlockGPT factorizes space-time by using self-attention within each frame and causal attention across frames; in this work, we instantiate it for precipitation nowcasting. We evaluate BlockGPT on two precipitation datasets, viz. KNMI (Netherlands) and SEVIR (U.S.), comparing it to state-of-the-art baselines including token-based (NowcastingGPT) and diffusion-based (DiffCast+Phydnet) models. The results show that BlockGPT achieves superior accuracy, event localization as measured by categorical metrics, and inference speeds up to 31x faster than comparable baselines.
♻ ☆ Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis
This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap between the original discrete dynamics and the continuous time approximations due to the discretization error has not been comprehensively bridged yet. In this work, we study the HB momentum method in continuous time while putting more focus on the discretization error to provide additional theoretical tools to this area. In particular, we design a first-order piece-wise continuous differential equation, where we add a number of counter terms to account for the discretization error explicitly. As a result, we provide a continuous time model for the HB momentum method that allows the control of discretization error to arbitrary order of the step size. As an application, we leverage it to find a new implicit regularization of the directional smoothness and investigate the implicit bias of HB for diagonal linear networks, indicating how our results can be used in deep learning. Our theoretical findings are further supported by numerical experiments.
comment: 32 pages, 7 figures
♻ ☆ Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance
Random forests (RFs) utilize bootstrap sampling to generate individual training sets for each component tree by sampling with replacement, with the sample size typically equal to that of the original training set ($N$). Previous research indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR $\leq$ 1.0). Furthermore, we analyze how BR affects the leaf structure of decision trees within the RF and investigate factors influencing the optimal BR. Our results indicate that the optimal BR is primarily determined by the characteristics of the data set rather than the RF hyperparameters.
♻ ☆ TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research EMNLP 2025
Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset, progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including Edge Attribution Patching and Sparse Autoencoders, to identify minimal circuits and components supporting SQL generation. We compare circuits for different SQL subskills, evaluating their minimality, reliability, and identifiability. Finally, we conduct a layerwise logit lens analysis to reveal how models compose SQL queries across layers: from intent recognition to schema resolution to structured generation. Our work provides a robust framework for probing and comparing interpretability methods in a structured, progressively complex setting.
comment: Accepted to EMNLP 2025, 9 pages, 19 figures, 7 tables, 18 trained models Project Website: https://abirharrasse.github.io/tinysql/ Code Repository: https://github.com/withmartian/TinySQL Datasets and Models Viewer: https://huggingface.co/spaces/abir-hr196/tinysql-demo
♻ ☆ Graph Few-Shot Learning via Adaptive Spectrum Experts and Cross-Set Distribution Calibration NeurIPS25
Graph few-shot learning has attracted increasing attention due to its ability to rapidly adapt models to new tasks with only limited labeled nodes. Despite the remarkable progress made by existing graph few-shot learning methods, several key limitations remain. First, most current approaches rely on predefined and unified graph filters (e.g., low-pass or high-pass filters) to globally enhance or suppress node frequency signals. Such fixed spectral operations fail to account for the heterogeneity of local topological structures inherent in real-world graphs. Moreover, these methods often assume that the support and query sets are drawn from the same distribution. However, under few-shot conditions, the limited labeled data in the support set may not sufficiently capture the complex distribution of the query set, leading to suboptimal generalization. To address these challenges, we propose GRACE, a novel Graph few-shot leaRning framework that integrates Adaptive spectrum experts with Cross-sEt distribution calibration techniques. Theoretically, the proposed approach enhances model generalization by adapting to both local structural variations and cross-set distribution calibration. Empirically, GRACE consistently outperforms state-of-the-art baselines across a wide range of experimental settings. Our code can be found here.
comment: NeurIPS25
♻ ☆ Learning Linear Attention in Polynomial Time
Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.
♻ ☆ QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation NeurIPS'2025
The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still fail to ensure functional equivalence in the translated code. In this paper, we propose \textbf{QiMeng-MuPa}, a novel \textbf{Mu}tual-Supervised Learning framework for Sequential-to-\textbf{Pa}rallel code translation, to address the functional equivalence issue. QiMeng-MuPa consists of two models, a Translator and a Tester. Through an iterative loop consisting of Co-verify and Co-evolve steps, the Translator and the Tester mutually generate data for each other and improve collectively. The Tester generates unit tests to verify and filter functionally equivalent translated code, thereby evolving the Translator, while the Translator generates translated code as augmented input to evolve the Tester. Experimental results demonstrate that QiMeng-MuPa significantly enhances the performance of the base models: when applied to Qwen2.5-Coder, it not only improves Pass@1 by up to 28.91% and boosts Tester performance by 68.90%, but also outperforms the previous state-of-the-art method CodeRosetta by 1.56 and 6.92 in BLEU and CodeBLEU scores, while achieving performance comparable to DeepSeek-R1 and GPT-4.1. Our code is available at https://github.com/kcxain/mupa.
comment: Accepted to NeurIPS'2025
♻ ☆ A recursive Bayesian neural network for constitutive modeling of sands under monotonic and cyclic loading
In geotechnical engineering, constitutive models are central to capturing soil behavior across diverse drainage conditions, stress paths,and loading histories. While data driven deep learning (DL) approaches have shown promise as alternatives to traditional constitutive formulations, their deployment requires models that are both accurate and capable of quantifying predictive uncertainty. This study introduces a recursive Bayesian neural network (rBNN) framework that unifies temporal sequence learning with generalized Bayesian inference to achieve both predictive accuracy and rigorous uncertainty quantification. A key innovation is the incorporation of a sliding window recursive structure that enables the model to effectively capture path dependent soil responses under monotonic and cyclic loading. By treating network parameters as random variables and inferring their posterior distributions via generalized variational inference, the rBNN produces well calibrated confidence intervals alongside point predictions.The framework is validated against four datasets spanning both simulated and experimental triaxial tests: monotonic loading using a Hardening Soil model simulation and 28 CD tests on Baskarp sand, and cyclic loading using an exponential constitutive simulation of CD CU tests and 37 experimental cyclic CU tests on Ottawa F65 sand. This progression from monotonic to cyclic and from simulated to experimental data demonstrates the adaptability of the proposed approach across varying levels of data fidelity and complexity. Comparative analyses with LSTM, Encoder Decoder,and GRU architectures highlight that rBNN not only achieves competitive predictive accuracy but also provides reliable confidence intervals.
♻ ☆ Unveiling Transformer Perception by Exploring Input Manifolds
This paper introduces a general method for the exploration of equivalence classes in the input space of Transformer models. The proposed approach is based on sound mathematical theory which describes the internal layers of a Transformer architecture as sequential deformations of the input manifold. Using eigendecomposition of the pullback of the distance metric defined on the output space through the Jacobian of the model, we are able to reconstruct equivalence classes in the input space and navigate across them. Our method enables two complementary exploration procedures: the first retrieves input instances that produce the same class probability distribution as the original instance-thus identifying elements within the same equivalence class-while the second discovers instances that yield a different class probability distribution, effectively navigating toward distinct equivalence classes. Finally, we demonstrate how the retrieved instances can be meaningfully interpreted by projecting their embeddings back into a human-readable format.
comment: 11 pages, 4 figures
♻ ☆ AtomSurf : Surface Representation for Learning on Protein Structures ICLR 2025
While there has been significant progress in evaluating and comparing different representations for learning on protein data, the role of surface-based learning approaches remains not well-understood. In particular, there is a lack of direct and fair benchmark comparison between the best available surface-based learning methods against alternative representations such as graphs. Moreover, the few existing surface-based approaches either use surface information in isolation or, at best, perform global pooling between surface and graph-based architectures. In this work, we fill this gap by first adapting a state-of-the-art surface encoder for protein learning tasks. We then perform a direct and fair comparison of the resulting method against alternative approaches within the Atom3D benchmark, highlighting the limitations of pure surface-based learning. Finally, we propose an integrated approach, which allows learned feature sharing between graphs and surface representations on the level of nodes and vertices across all layers. We demonstrate that the resulting architecture achieves state-of-the-art results on all tasks in the Atom3D benchmark, while adhering to the strict benchmark protocol, as well as more broadly on binding site identification and binding pocket classification. Furthermore, we use coarsened surfaces and optimize our approach for efficiency, making our tool competitive in training and inference time with existing techniques. Code can be found online: https://github.com/Vincentx15/atomsurf
comment: Published as a conference paper at The Thirteenth International Conference on Learning Representations (ICLR 2025). The official open-access version is available at https://openreview.net/forum?id=ARQIJXFcTH
♻ ☆ Smoothed Distance Kernels for MMDs and Applications in Wasserstein Gradient Flows
Negative distance kernels $K(x,y) := - \|x-y\|$ were used in the definition of maximum mean discrepancies (MMDs) in statistics and lead to favorable numerical results in various applications. In particular, so-called slicing techniques for handling high-dimensional kernel summations profit from the simple parameter-free structure of the distance kernel. However, due to its non-smoothness in $x=y$, most of the classical theoretical results, e.g. on Wasserstein gradient flows of the corresponding MMD functional do not longer hold true. In this paper, we propose a new kernel which keeps the favorable properties of the negative distance kernel as being conditionally positive definite of order one with a nearly linear increase towards infinity and a simple slicing structure, but is Lipschitz differentiable now. Our construction is based on a simple 1D smoothing procedure of the absolute value function followed by a Riemann-Liouville fractional integral transform. Numerical results demonstrate that the new kernel performs similarly well as the negative distance kernel in gradient descent methods, but now with theoretical guarantees.
comment: 48 pages, 10 figures
♻ ☆ An Efficient Local Search Approach for Polarized Community Discovery in Signed Networks
Signed networks, where edges are labeled as positive or negative to represent friendly or antagonistic interactions, provide a natural framework for analyzing polarization, trust, and conflict in social systems. Detecting meaningful group structures in such networks is crucial for understanding online discourse, political divisions, and trust dynamics. A key challenge is to identify communities that are internally cohesive and externally antagonistic, while allowing for neutral or unaligned vertices. In this paper, we propose a method for identifying $k$ polarized communities that addresses a major limitation of prior methods: their tendency to produce highly size-imbalanced solutions. We introduce a novel optimization objective that avoids such imbalance. In addition, it is well known that approximation algorithms based on local search are highly effective for clustering signed networks when neutral vertices are not allowed. We build on this idea and design the first local search algorithm that extends to the setting with neutral vertices while scaling to large networks. By connecting our approach to block-coordinate Frank-Wolfe optimization, we prove a linear convergence rate, enabled by the structure of our objective. Experiments on real-world and synthetic datasets demonstrate that our method consistently outperforms state-of-the-art baselines in solution quality, while remaining competitive in computational efficiency.
♻ ☆ AdaptGrad: Adaptive Sampling to Reduce Noise NeurIPS 2025
Gradient Smoothing is an efficient approach to reducing noise in gradient-based model explanation method. SmoothGrad adds Gaussian noise to mitigate much of these noise. However, the crucial hyper-parameter in this method, the variance $\sigma$ of Gaussian noise, is set manually or with heuristic approach. However, it results in the smoothed gradients still containing a certain amount of noise. In this paper, we aim to interpret SmoothGrad as a corollary of convolution, thereby re-understanding the gradient noise and the role of $\sigma$ from the perspective of confidence level. Furthermore, we propose an adaptive gradient smoothing method, AdaptGrad, based on these insights. Through comprehensive experiments, both qualitative and quantitative results demonstrate that AdaptGrad could effectively reduce almost all the noise in vanilla gradients compared with baselines methods. AdaptGrad is simple and universal, making it applicable for enhancing gradient-based interpretability methods for better visualization.
comment: Accepted by NeurIPS 2025
♻ ☆ Incentivizing Time-Aware Fairness in Data Sharing NeurIPS 2025
In collaborative data sharing and machine learning, multiple parties aggregate their data resources to train a machine learning model with better model performance. However, as the parties incur data collection costs, they are only willing to do so when guaranteed incentives, such as fairness and individual rationality. Existing frameworks assume that all parties join the collaboration simultaneously, which does not hold in many real-world scenarios. Due to the long processing time for data cleaning, difficulty in overcoming legal barriers, or unawareness, the parties may join the collaboration at different times. In this work, we propose the following perspective: As a party who joins earlier incurs higher risk and encourages the contribution from other wait-and-see parties, that party should receive a reward of higher value for sharing data earlier. To this end, we propose a fair and time-aware data sharing framework, including novel time-aware incentives. We develop new methods for deciding reward values to satisfy these incentives. We further illustrate how to generate model rewards that realize the reward values and empirically demonstrate the properties of our methods on synthetic and real-world datasets.
comment: Accepted to NeurIPS 2025
♻ ☆ InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models
Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) --a critical phase for enhancing LLM performance--largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.
♻ ☆ Towards Context-Aware Domain Generalization: Understanding the Benefits and Limits of Marginal Transfer Learning NeurIPS
In this work, we analyze the conditions under which information about the context of an input $X$ can improve the predictions of deep learning models in new domains. Following work in marginal transfer learning in Domain Generalization (DG), we formalize the notion of context as a permutation-invariant representation of a set of data points that originate from the same domain as the input itself. We offer a theoretical analysis of the conditions under which this approach can, in principle, yield benefits, and formulate two necessary criteria that can be easily verified in practice. Additionally, we contribute insights into the kind of distribution shifts for which the marginal transfer learning approach promises robustness. Empirical analysis shows that our criteria are effective in discerning both favorable and unfavorable scenarios. Finally, we demonstrate that we can reliably detect scenarios where a model is tasked with unwarranted extrapolation in out-of-distribution (OOD) domains, identifying potential failure cases. Consequently, we showcase a method to select between the most predictive and the most robust model, circumventing the well-known trade-off between predictive performance and robustness.
comment: 10 pages, 4 figures, NeurIPS Workshop: Reliable ML from Unreliable Data
♻ ☆ Estimating Long-term Heterogeneous Dose-response Curve: Generalization Bound Leveraging Optimal Transport Weights
Long-term treatment effect estimation is a significant but challenging problem in many applications. Existing methods rely on ideal assumptions, such as no unobserved confounders or binary treatment, to estimate long-term average treatment effects. However, in numerous real-world applications, these assumptions could be violated, and average treatment effects are insufficient for personalized decision-making. In this paper, we address a more general problem of estimating long-term Heterogeneous Dose-Response Curve (HDRC) while accounting for unobserved confounders and continuous treatment. Specifically, to remove the unobserved confounders in the long-term observational data, we introduce an optimal transport weighting framework to align the long-term observational data to an auxiliary short-term experimental data. Furthermore, to accurately predict the heterogeneous effects of continuous treatment, we establish a generalization bound on counterfactual prediction error by leveraging the reweighted distribution induced by optimal transport. Finally, we develop a long-term HDRC estimator building upon the above theoretical foundations. Extensive experiments on synthetic and semi-synthetic datasets demonstrate the effectiveness of our approach.
♻ ☆ MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research NeurIPS 2025
Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.
comment: 49 pages, 9 figures. Accepted by NeurIPS 2025 D&B Track
♻ ☆ Mathematics with large language models as provers and verifiers
During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman & Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology is by no means complete or exact. It was nonetheless able to solve five out of six 2025 IMO problems, and close about a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].
♻ ☆ Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency
Synchronous federated learning (FL) scales poorly with the number of clients due to the straggler effect. Algorithms like FedAsync and GeneralizedFedAsync address this limitation by enabling asynchronous communication between clients and the central server. In this work, we rely on stochastic modeling and analysis to better understand the impact of design choices in asynchronous FL algorithms, such as the concurrency level and routing probabilities, and we leverage this knowledge to optimize loss. Compared to most existing studies, we account for the joint impact of heterogeneous and variable service speeds and heterogeneous datasets at the clients. We characterize in particular a fundamental trade-off for optimizing asynchronous FL: minimizing gradient estimation errors by avoiding model parameter staleness, while also speeding up the system by increasing the throughput of model updates. Our two main contributions can be summarized as follows. First, we prove a discrete variant of Little's law to derive a closed-form expression for relative delay, a metric that quantifies staleness. This allows us to efficiently minimize the average loss per model update, which has been the gold standard in literature to date, using the upper-bound of Leconte et al. as a proxy. Second, we observe that naively optimizing this metric drastically slows down the system by overemphasizing staleness at the expense of throughput. This motivates us to introduce an alternative metric that also accounts for speed, for which we derive a tractable upper-bound that can be minimized numerically. Extensive numerical results show these optimizations enhance accuracy by 10% to 30%.
♻ ☆ Using (Not-so) Large Language Models to Generate Simulation Models in a Formal DSL: A Study on Reaction Networks
Formal languages are an integral part of modeling and simulation. They allow the distillation of knowledge into concise simulation models amenable to automatic execution, interpretation, and analysis. However, the arguably most humanly accessible means of expressing models is through natural language, which is not easily interpretable by computers. Here, we evaluate how a Large Language Model (LLM) might be used for formalizing natural language into simulation models. Existing studies only explored using very large LLMs, like the commercial GPT models, without fine-tuning model weights. To close this gap, we show how an open-weights, 7B-parameter Mistral model can be fine-tuned to translate natural language descriptions to reaction network models in a domain-specific language, offering a self-hostable, compute-efficient, and memory efficient alternative. To this end, we develop a synthetic data generator to serve as the basis for fine-tuning and evaluation. Our quantitative evaluation shows that our fine-tuned Mistral model can recover the ground truth simulation model in up to 84.5% of cases. In addition, our small-scale user study demonstrates the model's practical potential for one-time generation as well as interactive modeling in various domains. While promising, in its current form, the fine-tuned small LLM cannot catch up with large LLMs. We conclude that higher-quality training data are required, and expect future small and open-source LLMs to offer new opportunities.
comment: 27 pages, 5 figures; supplemental material available at https://doi.org/10.1145/3733719
♻ ☆ Survey of Graph Neural Network for Internet of Things and NextG Networks
The exponential increase in Internet of Things (IoT) devices coupled with 6G pushing towards higher data rates and connected devices has sparked a surge in data. Consequently, harnessing the full potential of data-driven machine learning has become one of the important thrusts. In addition to the advancement in wireless technology, it is important to efficiently use the resources available and meet the users' requirements. Graph Neural Networks (GNNs) have emerged as a promising paradigm for effectively modeling and extracting insights which inherently exhibit complex network structures due to its high performance and accuracy, scalability, adaptability, and resource efficiency. There is a lack of a comprehensive survey that focuses on the applications and advances GNN has made in the context of IoT and Next Generation (NextG) networks. To bridge that gap, this survey starts by providing a detailed description of GNN's terminologies, architecture, and the different types of GNNs. Then we provide a comprehensive survey of the advancements in applying GNNs for IoT from the perspective of data fusion and intrusion detection. Thereafter, we survey the impact GNN has made in improving spectrum awareness. Next, we provide a detailed account of how GNN has been leveraged for networking and tactical systems. Through this survey, we aim to provide a comprehensive resource for researchers to learn more about GNN in the context of wireless networks, and understand its state-of-the-art use cases while contrasting to other machine learning approaches. Finally, we also discussed the challenges and wide range of future research directions to further motivate the use of GNN for IoT and NextG Networks.
♻ ☆ Generating Directed Graphs with Dual Attention and Asymmetric Encoding
Directed graphs naturally model systems with asymmetric, ordered relationships, essential to applications in biology, transportation, social networks, and visual understanding. Generating such graphs enables tasks such as simulation, data augmentation and novel instance discovery; however, directed graph generation remains underexplored. We identify two key factors limiting progress in this direction: first, modeling edge directionality introduces a substantially larger dependency space, making the underlying distribution harder to learn; second, the absence of standardized benchmarks hinders rigorous evaluation. Addressing the former requires more expressive models that are sensitive to directional topologies. We propose Directo, the first generative model for directed graphs built upon the discrete flow matching framework. Our approach combines: (i) principled positional encodings tailored to asymmetric pairwise relations, (ii) a dual-attention mechanism capturing both incoming and outgoing dependencies, and (iii) a robust, discrete generative framework. To support evaluation, we introduce a benchmark suite covering synthetic and real-world datasets. It shows that our method performs strongly across diverse settings and even competes with specialized models for particular classes, such as directed acyclic graphs. Our results highlight the effectiveness and generality of our approach, establishing a solid foundation for future research in directed graph generation.
♻ ☆ Non-Stationary Lipschitz Bandits
We study the problem of non-stationary Lipschitz bandits, where the number of actions is infinite and the reward function, satisfying a Lipschitz assumption, can change arbitrarily over time. We design an algorithm that adaptively tracks the recently introduced notion of significant shifts, defined by large deviations of the cumulative reward function. To detect such reward changes, our algorithm leverages a hierarchical discretization of the action space. Without requiring any prior knowledge of the non-stationarity, our algorithm achieves a minimax-optimal dynamic regret bound of $\mathcal{\widetilde{O}}(\tilde{L}^{1/3}T^{2/3})$, where $\tilde{L}$ is the number of significant shifts and $T$ the horizon. This result provides the first optimal guarantee in this setting.
♻ ☆ EEG-Based Consumer Behaviour Prediction: An Exploration from Classical Machine Learning to Graph Neural Networks
Prediction of consumer behavior is one of the important purposes in marketing, cognitive neuroscience, and human-computer interaction. The electroencephalography (EEG) data can help analyze the decision process by providing detailed information about the brain's neural activity. In this research, a comparative approach is utilized for predicting consumer behavior by EEG data. In the first step, the features of the EEG data from the NeuMa dataset were extracted and cleaned. For the Graph Neural Network (GNN) models, the brain connectivity features were created. Different machine learning models, such as classical models and Graph Neural Networks, are used and compared. The GNN models with different architectures are implemented to have a comprehensive comparison; furthermore, a wide range of classical models, such as ensemble models, are applied, which can be very helpful to show the difference and performance of each model on the dataset. Although the results did not show a significant difference overall, the GNN models generally performed better in some basic criteria where classical models were not satisfactory. This study not only shows that combining EEG signal analysis and machine learning models can provide an approach to deeper understanding of consumer behavior, but also provides a comprehensive comparison between the machine learning models that have been widely used in previous studies in the EEG-based neuromarketing such as Support Vector Machine (SVM), and the models which are not used or rarely used in the field, like Graph Neural Networks.
♻ ☆ Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning
Hebbian learning is a biological principle that intuitively describes how neurons adapt their connections through repeated stimuli. However, when applied to machine learning, it suffers serious issues due to the unconstrained updates of the connections and the lack of accounting for feedback mediation. Such shortcomings limit its effective scaling to complex network architectures and tasks. To this end, here we introduce the Structural Projection Hebbian Representation (SPHeRe), a novel unsupervised learning method that integrates orthogonality and structural information preservation through a local auxiliary nonlinear block. The loss for structural information preservation backpropagates to the input through an auxiliary lightweight projection that conceptually serves as feedback mediation while the orthogonality constraints account for the boundedness of updating magnitude. Extensive experimental results show that SPHeRe achieves SOTA performance among unsupervised synaptic plasticity approaches on standard image classification benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet. Furthermore, the method exhibits strong effectiveness in continual learning and transfer learning scenarios, and image reconstruction tasks show the robustness and generalizability of the extracted features. This work demonstrates the competitiveness and potential of Hebbian unsupervised learning rules within modern deep learning frameworks, demonstrating the possibility of efficient and biologically inspired learning algorithms without the strong dependence on strict backpropagation. Our code is available at https://github.com/brain-intelligence-lab/SPHeRe.
♻ ☆ From Reviews to Actionable Insights: An LLM-Based Approach for Attribute and Feature Extraction
This research proposes a systematic, large language model (LLM) approach for extracting product and service attributes, features, and associated sentiments from customer reviews. Grounded in marketing theory, the framework distinguishes perceptual attributes from actionable features, producing interpretable and managerially actionable insights. We apply the methodology to 20,000 Yelp reviews of Starbucks stores and evaluate eight prompt variants on a random subset of reviews. Model performance is assessed through agreement with human annotations and predictive validity for customer ratings. Results show high consistency between LLMs and human coders and strong predictive validity, confirming the reliability of the approach. Human coders required a median of six minutes per review, whereas the LLM processed each in two seconds, delivering comparable insights at a scale unattainable through manual coding. Managerially, the analysis identifies attributes and features that most strongly influence customer satisfaction and their associated sentiments, enabling firms to pinpoint "joy points," address "pain points," and design targeted interventions. We demonstrate how structured review data can power an actionable marketing dashboard that tracks sentiment over time and across stores, benchmarks performance, and highlights high-leverage features for improvement. Simulations indicate that enhancing sentiment for key service features could yield 1-2% average revenue gains per store.
♻ ☆ Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
♻ ☆ A Comprehensive Benchmark for RNA 3D Structure-Function Modeling
The relationship between RNA structure and function has recently attracted interest within the deep learning community, a trend expected to intensify as nucleic acid structure models advance. Despite this momentum, the lack of standardized, accessible benchmarks for applying deep learning to RNA 3D structures hinders progress. To this end, we introduce a collection of seven benchmarking datasets specifically designed to support RNA structure-function prediction. Built on top of the established Python package rnaglib, our library streamlines data distribution and encoding, provides tools for dataset splitting and evaluation, and offers a comprehensive, user-friendly environment for model comparison. The modular and reproducible design of our datasets encourages community contributions and enables rapid customization. To demonstrate the utility of our benchmarks, we report baseline results for all tasks using a relational graph neural network.
♻ ☆ Deep Linear Probe Generators for Weight Space Learning ICLR 2025
Weight space learning aims to extract information about a neural network, such as its training dataset or generalization error. Recent approaches learn directly from model weights, but this presents many challenges as weights are high-dimensional and include permutation symmetries between neurons. An alternative approach, Probing, represents a model by passing a set of learned inputs (probes) through the model, and training a predictor on top of the corresponding outputs. Although probing is typically not used as a stand alone approach, our preliminary experiment found that a vanilla probing baseline worked surprisingly well. However, we discover that current probe learning strategies are ineffective. We therefore propose Deep Linear Probe Generators (ProbeGen), a simple and effective modification to probing approaches. ProbeGen adds a shared generator module with a deep linear architecture, providing an inductive bias towards structured probes thus reducing overfitting. While simple, ProbeGen performs significantly better than the state-of-the-art and is very efficient, requiring between 30 to 1000 times fewer FLOPs than other top approaches.
comment: ICLR 2025. Project page: https://vision.huji.ac.il/probegen
♻ ☆ TimeWak: Temporal Chained-Hashing Watermark for Time Series Data
Synthetic time series generated by diffusion models enable sharing privacy-sensitive datasets, such as patients' functional MRI records. Key criteria for synthetic data include high data utility and traceability to verify the data source. Recent watermarking methods embed in homogeneous latent spaces, but state-of-the-art time series generators operate in data space, making latent-based watermarking incompatible. This creates the challenge of watermarking directly in data space while handling feature heterogeneity and temporal dependencies. We propose TimeWak, the first watermarking algorithm for multivariate time series diffusion models. To handle temporal dependence and spatial heterogeneity, TimeWak embeds a temporal chained-hashing watermark directly within the temporal-feature data space. The other unique feature is the $\epsilon$-exact inversion, which addresses the non-uniform reconstruction error distribution across features from inverting the diffusion process to detect watermarks. We derive the error bound of inverting multivariate time series while preserving robust watermark detectability. We extensively evaluate TimeWak on its impact on synthetic data quality, watermark detectability, and robustness under various post-editing attacks, against five datasets and baselines of different temporal lengths. Our results show that TimeWak achieves improvements of 61.96% in context-FID score, and 8.44% in correlational scores against the strongest state-of-the-art baseline, while remaining consistently detectable.
♻ ☆ Learn More by Using Less: Distributed Learning with Energy-Constrained Devices
Federated Learning (FL) has emerged as a solution for distributed model training across decentralized, privacy-preserving devices, but the different energy capacities of participating devices (system heterogeneity) constrain real-world implementations. These energy limitations not only reduce model accuracy but also increase dropout rates, impacting on convergence in practical FL deployments. In this work, we propose LeanFed, an energy-aware FL framework designed to optimize client selection and training workloads on battery-constrained devices. LeanFed leverages adaptive data usage by dynamically adjusting the fraction of local data each device utilizes during training, thereby maximizing device participation across communication rounds while ensuring they do not run out of battery during the process. We rigorously evaluate LeanFed against traditional FedAvg on CIFAR-10 and CIFAR-100 datasets, simulating various levels of data heterogeneity and device participation rates. Results show that LeanFed consistently enhances model accuracy and stability, particularly in settings with high data heterogeneity and limited battery life, by mitigating client dropout and extending device availability. This approach demonstrates the potential of energy-efficient, privacy-preserving FL in real-world, large-scale applications, setting a foundation for robust and sustainable pervasive AI on resource-constrained networks.
♻ ☆ Backward Conformal Prediction
We introduce $\textit{Backward Conformal Prediction}$, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the coverage level accordingly. Our method builds on two key foundations: (i) recent results by Gauthier et al. [2025] on post-hoc validity using e-values, which ensure marginal coverage of the form $\mathbb{P}(Y_{\rm test} \in \hat C_n^{\tilde{\alpha}}(X_{\rm test})) \ge 1 - \mathbb{E}[\tilde{\alpha}]$ up to a first-order Taylor approximation for any data-dependent miscoverage $\tilde{\alpha}$, and (ii) a novel leave-one-out estimator $\hat{\alpha}^{\rm LOO}$ of the marginal miscoverage $\mathbb{E}[\tilde{\alpha}]$ based on the calibration set, ensuring that the theoretical guarantees remain computable in practice. This approach is particularly useful in applications where large prediction sets are impractical such as medical diagnosis. We provide theoretical results and empirical evidence supporting the validity of our method, demonstrating that it maintains computable coverage guarantees while ensuring interpretable, well-controlled prediction set sizes.
comment: Code available at: https://github.com/GauthierE/backward-cp
♻ ☆ Explaining Time Series Classifiers with PHAR: Rule Extraction and Fusion from Post-hoc Attributions
Explaining machine learning (ML) models for time series (TS) classification remains challenging due to the difficulty of interpreting raw time series and the high dimensionality of the input space. We introduce PHAR-Post-hoc Attribution Rules - a unified framework that transforms numeric feature attributions from post-hoc, instance-wise explainers (e.g., LIME, SHAP) into structured, human-readable rules. These rules define human-readable intervals that indicate where and when decision-relevant segments occur and can enhance model transparency by localizing threshold-based conditions on the raw series. PHAR performs comparably to native rule-based methods, such as Anchor, while scaling more efficiently to long TS sequences and achieving broader instance coverage. A dedicated rule fusion step consolidates rule sets using strategies like weighted selection and lasso-based refinement, balancing key quality metrics: coverage, confidence, and simplicity. This fusion ensures each instance receives a concise and unambiguous rule, improving both explanation fidelity and consistency. We further introduce visualization techniques to illustrate specificity-generalization trade-offs in the derived rules. PHAR resolves conflicting and overlapping explanations - a common effect of the Rashomon phenomenon - into coherent, domain-adaptable insights. Comprehensive experiments on UCR/UEA Time Series Classification Archive demonstrate that PHAR may improve interpretability, decision transparency, and practical applicability for TS classification tasks by providing concise, human-readable rules aligned with model predictions.
♻ ☆ CATransformers: Carbon Aware Transformers Through Joint Model-Hardware Optimization
Machine learning solutions are rapidly adopted to enable a variety of key use cases, from conversational AI assistants to scientific discovery. This growing adoption is expected to increase the associated lifecycle carbon footprint, including both \emph{operational carbon} from training and inference and \emph{embodied carbon} from AI hardware manufacturing. We introduce \ourframework -- the first carbon-aware co-optimization framework for Transformer-based models and hardware accelerators. By integrating both operational and embodied carbon into early-stage design space exploration, \ourframework enables sustainability-driven model architecture and hardware accelerator co-design that reveals fundamentally different trade-offs than latency- or energy-centric approaches. Evaluated across a range of Transformer models, \ourframework consistently demonstrates the potential to reduce total carbon emissions -- by up to 30\% -- while maintaining accuracy and latency. We further highlight its extensibility through a focused case study on multi-modal models. Our results emphasize the need for holistic optimization methods that prioritize carbon efficiency without compromising model capability and execution time performance. The source code of \ourframework is available at {\small{\href{https://github.com/facebookresearch/CATransformers}{\texttt{https://github.com/facebookresearch/CATransformers}}}}.
♻ ☆ Learning Spatially Adaptive $\ell_1$-Norms Weights for Convolutional Synthesis Regularization
We propose an unrolled algorithm approach for learning spatially adaptive parameter maps in the framework of convolutional synthesis-based $\ell_1$ regularization. More precisely, we consider a family of pre-trained convolutional filters and estimate deeply parametrized spatially varying parameters applied to the sparse feature maps by means of unrolling a FISTA algorithm to solve the underlying sparse estimation problem. The proposed approach is evaluated for image reconstruction of low-field MRI and compared to spatially adaptive and non-adaptive analysis-type procedures relying on Total Variation regularization and to a well-established model-based deep learning approach. We show that the proposed approach produces visually and quantitatively comparable results with the latter approaches and at the same time remains highly interpretable. In particular, the inferred parameter maps quantify the local contribution of each filter in the reconstruction, which provides valuable insight into the algorithm mechanism and could potentially be used to discard unsuited filters.
comment: Accepted for publication in the proceedings of the EUSIPCO 2025 conference; corrected typo in equation (3)
♻ ☆ REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.
♻ ☆ QPPG: Quantum-Preconditioned Policy Gradient for Link Adaptation in Rayleigh Fading Channels
Reliable link adaptation is critical for efficient wireless communications in dynamic fading environments. However, reinforcement learning (RL) solutions often suffer from unstable convergence due to poorly conditioned policy gradients, hindering their practical application. We propose the quantum-preconditioned policy gradient (QPPG) algorithm, which leverages Fisher-information-based preconditioning to stabilise and accelerate policy updates. Evaluations in Rayleigh fading scenarios show that QPPG achieves faster convergence, a 28.6% increase in average throughput, and a 43.8% decrease in average transmit power compared to classical methods. This work introduces quantum-geometric conditioning to link adaptation, marking a significant advance in developing robust, quantum-inspired reinforcement learning for future 6G networks, thereby enhancing communication reliability and energy efficiency.
comment: Submitted to IEEE Wireless Communications Letters
♻ ☆ Merging Embedded Topics with Optimal Transport for Online Topic Modeling on Data Streams
Topic modeling is a key component in unsupervised learning, employed to identify topics within a corpus of textual data. The rapid growth of social media generates an ever-growing volume of textual data daily, making online topic modeling methods essential for managing these data streams that continuously arrive over time. This paper introduces a novel approach to online topic modeling named StreamETM. This approach builds on the Embedded Topic Model (ETM) to handle data streams by merging models learned on consecutive partial document batches using unbalanced optimal transport. Additionally, an online change point detection algorithm is employed to identify shifts in topics over time, enabling the identification of significant changes in the dynamics of text streams. Numerical experiments on simulated and real-world data show StreamETM outperforming competitors. We provide the code publicly available at https://github.com/fgranese/StreamETM.
♻ ☆ Advancing Carbon Capture using AI: Design of permeable membrane and estimation of parameters for Carbon Capture using linear regression and membrane-based equations
This study focuses on membrane-based systems for CO$_2$ separation, addressing the urgent need for efficient carbon capture solutions to mitigate climate change. Linear regression models, based on membrane equations, were utilized to estimate key parameters, including porosity ($\epsilon$) of 0.4805, Kozeny constant (K) of 2.9084, specific surface area ($\sigma$) of 105.3272 m$^2$/m$^3$, mean pressure (Pm) of 6.2166 MPa, viscosity ($\mu$) of 0.1997 Ns/m$^2$, and gas flux (Jg) of 3.2559 kg m$^{-2}$ s$^{-1}$. These parameters were derived from the analysis of synthetic datasets using linear regression. The study also provides insights into the performance of the membrane, with a flow rate (Q) of 9.8778 $\times$ 10$^{-4}$ m$^3$/s, an injection pressure (P$_1$) of 2.8219 MPa, and an exit pressure (P$_2$) of 2.5762 MPa. The permeability value of 0.045 for CO$_2$ indicates the potential for efficient separation. Optimizing membrane properties to selectively block CO$_2$ while allowing other gases to pass is crucial for improving carbon capture efficiency. By integrating these technologies into industrial processes, significant reductions in greenhouse gas emissions can be achieved, fostering a circular carbon economy and contributing to global climate goals. This study also explores how artificial intelligence (AI) can aid in designing membranes for carbon capture, addressing the global climate change challenge and supporting the Sustainable Development Goals (SDGs) set by the United Nations.
comment: The data is outdated and ML portion requires update, we are working on it
♻ ☆ Generalizing while preserving monotonicity in comparison-based preference learning models
If you tell a learning model that you prefer an alternative $a$ over another alternative $b$, then you probably expect the model to be monotone, that is, the valuation of $a$ increases, and that of $b$ decreases. Yet, perhaps surprisingly, many widely deployed comparison-based preference learning models, including large language models, fail to have this guarantee. Until now, the only comparison-based preference learning algorithms that were proved to be monotone are the Generalized Bradley-Terry models. Yet, these models are unable to generalize to uncompared data. In this paper, we advance the understanding of the set of models with generalization ability that are monotone. Namely, we propose a new class of Linear Generalized Bradley-Terry models with Diffusion Priors, and identify sufficient conditions on alternatives' embeddings that guarantee monotonicity. Our experiments show that this monotonicity is far from being a general guarantee, and that our new class of generalizing models improves accuracy, especially when the dataset is limited.
comment: Accepted at Neurips 2025
♻ ☆ Understanding Reasoning in Thinking Language Models via Steering Vectors ICLR 2025
Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model's reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using three DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.
comment: Accepted to the Workshop on Reasoning and Planning for Large Language Models at ICLR 2025
♻ ☆ Trajectory learning for ensemble forecasts via the continuous ranked probability score: a Lorenz '96 case study
This paper demonstrates the feasibility of trajectory learning for ensemble forecasts by employing the continuous ranked probability score (CRPS) as a loss function. Using the two-scale Lorenz '96 system as a case study, we develop and train both additive and multiplicative stochastic parametrizations to generate ensemble predictions. Results indicate that CRPS-based trajectory learning produces parametrizations that are both accurate and sharp. The resulting parametrizations are straightforward to calibrate and outperform derivative-fitting-based parametrizations in short-term forecasts. This approach is particularly promising for data assimilation applications due to its accuracy over short lead times.
comment: 21 pages, 11 figures. All comments are welcome!
♻ ☆ Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.
♻ ☆ ACT: Agentic Classification Tree
When used in high-stakes settings, AI systems are expected to produce decisions that are transparent, interpretable, and auditable, a requirement increasingly expected by regulations. Decision trees such as CART provide clear and verifiable rules, but they are restricted to structured tabular data and cannot operate directly on unstructured inputs such as text. In practice, large language models (LLMs) are widely used for such data, yet prompting strategies such as chain-of-thought or prompt optimization still rely on free-form reasoning, limiting their ability to ensure trustworthy behaviors. We present the Agentic Classification Tree (ACT), which extends decision-tree methodology to unstructured inputs by formulating each split as a natural-language question, refined through impurity-based evaluation and LLM feedback via TextGrad. Experiments on text benchmarks show that ACT matches or surpasses prompting-based baselines while producing transparent and interpretable decision paths.
comment: 18 pages, 6 figures
♻ ☆ DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection NeurIPS 2025
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You NeurIPS 2025
Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than $1\%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6\%$ in classification and $91.8\%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.
comment: NeurIPS 2025 camera-ready
Human-Computer Interaction 27
☆ Integrating Transparent Models, LLMs, and Practitioner-in-the-Loop: A Case of Nonprofit Program Evaluation
Public and nonprofit organizations often hesitate to adopt AI tools because most models are opaque even though standard approaches typically analyze aggregate patterns rather than offering actionable, case-level guidance. This study tests a practitioner-in-the-loop workflow that pairs transparent decision-tree models with large language models (LLMs) to improve predictive accuracy, interpretability, and the generation of practical insights. Using data from an ongoing college-success program, we build interpretable decision trees to surface key predictors. We then provide each tree's structure to an LLM, enabling it to reproduce case-level predictions grounded in the transparent models. Practitioners participate throughout feature engineering, model design, explanation review, and usability assessment, ensuring that field expertise informs the analysis at every stage. Results show that integrating transparent models, LLMs, and practitioner input yields accurate, trustworthy, and actionable case-level evaluations, offering a viable pathway for responsible AI adoption in the public and nonprofit sectors.
☆ Cultural Dimensions of Artificial Intelligence Adoption: Empirical Insights for Wave 1 from a Multinational Longitudinal Pilot Study
The swift diffusion of artificial intelligence (AI) raises critical questions about how cultural contexts shape adoption patterns and their consequences for human daily life. This study investigates the cultural dimensions of AI adoption and their influence on cognitive strategies across nine national contexts in Europe, Africa, Asia, and South America. Drawing on survey data from a diverse pilot sample (n = 21) and guided by cross-cultural psychology, digital ethics, and sociotechnical systems theory, we examine how demographic variables (age, gender, professional role) and cultural orientations (language, values, and institutional exposure) mediate perceptions of trust, ethical acceptability, and reliance on AI. Results reveal two key findings: First, cultural factors, particularly language and age, significantly affect AI adoption and perceptions of reliability with older participants reporting higher engagement with AI for educational purposes. Second, ethical judgment about AI use varied across domains, with professional contexts normalizing its role as a pragmatic collaborator while academic settings emphasized risks of plagiarism. These findings extend prior research on culture and technology adoption by demonstrating that AI use is neither universal nor neutral but culturally contingent, domain-specific, and ethically situated. The study highlights implications for AI use in education, professional practice, and global technology policy, pointing at actions that enable usage of AI in a way that is both culturally adaptive and ethically robust.
☆ LifeSync-Games: Toward a Video Game Paradigm for Promoting Responsible Gaming and Human Development
Technological advancements have made video games a central part of the digital lives of nearly 3 billion people worldwide. Although games can address various social, physical, and psychological needs, their potential to support human development and well-being remains underutilized. Research highlights both negative effects, such as addiction and isolation, and positive outcomes like cognitive improvements and problem-solving skills. However, public discourse and regulation often focus more on risks than benefits. To address this imbalance, we present LifeSync-Games, a framework leveraging simplified digital twins to connect virtual gameplay with real-life activities. This reciprocal relationship aims to enhance the developmental value of gaming by promoting self-regulation and fostering growth across physical, mental, and social domains. We present the framework's theoretical foundations, technological components, design guidelines, and evaluation approaches. Additionally, we present early applications in both new and bestselling games to demonstrate its versatility and practical relevance.
comment: 8 pages, 4 figures, 1 table, 66 references
☆ Directive, Metacognitive or a Blend of Both? A Comparison of AI-Generated Feedback Types on Student Engagement, Confidence, and Outcomes
Feedback is one of the most powerful influences on student learning, with extensive research examining how best to implement it in educational settings. Increasingly, feedback is being generated by artificial intelligence (AI), offering scalable and adaptive responses. Two widely studied approaches are directive feedback, which gives explicit explanations and reduces cognitive load to speed up learning, and metacognitive feedback which prompts learners to reflect, track their progress, and develop self-regulated learning (SRL) skills. While both approaches have clear theoretical advantages, their comparative effects on engagement, confidence, and quality of work remain underexplored. This study presents a semester-long randomised controlled trial with 329 students in an introductory design and programming course using an adaptive educational platform. Participants were assigned to receive directive, metacognitive, or hybrid AI-generated feedback that blended elements of both directive and metacognitive feedback. Results showed that revision behaviour differed across feedback conditions, with Hybrid prompting the most revisions compared to Directive and Metacognitive. Confidence ratings were uniformly high, and resource quality outcomes were comparable across conditions. These findings highlight the promise of AI in delivering feedback that balances clarity with reflection. Hybrid approaches, in particular, show potential to combine actionable guidance for immediate improvement with opportunities for self-reflection and metacognitive growth.
☆ Sentiment Analysis of Social Media Data for Predicting Consumer Behavior Trends Using Machine Learning
In the era of rapid technological advancement, social media platforms such as Twitter (X) have emerged as indispensable tools for gathering consumer insights, capturing diverse opinions, and understanding public attitudes. This research applies advanced machine learning methods for sentiment analysis on Twitter data, with a focus on predicting consumer trends. Using the Sentiment140 dataset, the study detects evolving patterns in consumer preferences with "car" as an example. A structured workflow was used to clean and prepare data for analysis. Machine learning models, including Support Vector Machines (SVM), Naive Bayes, Long Short-Term Memory (LSTM) networks, and Bidirectional Encoder Representations from Transformers (BERT), were employed to classify sentiments and predict trends. Model performance was measured using accuracy, precision, recall, and F1 score, with BERT achieving the highest results (Accuracy: 83.48%, Precision: 79.37%, Recall: 90.60%, F1: 84.61). Results show that LSTM and BERT effectively capture linguistic and contextual patterns, improving prediction accuracy and providing insights into consumer behavior. Temporal analysis revealed sentiment shifts across time, while Named Entity Recognition (NER) identified related terms and themes. This research addresses challenges like sarcasm detection and multilingual data processing, offering a scalable framework for generating actionable consumer insights.
comment: 21 pages, 10 figures, conference
☆ Unmanned Aerial Vehicles Control in a Digital Twin: Exploring the Effect of Different Points of View on User Experience in Virtual Reality
Controlling Unmanned Aerial Vehicles (UAVs) is a cognitively demanding task, with accidents often arising from insufficient situational awareness, inadequate training, and poor user experiences. Providing more intuitive and immersive visual feedback, particularly through Digital Twin technologies, offers new opportunities to enhance pilot awareness and overall experience quality. In this study, we investigate how different virtual points of view (POVs) influence user experience and performance during UAV piloting in Virtual Reality (VR), utilizing a digital twin that faithfully replicates the real-world flight environment. We developed a VR application that enables participants to control a physical DJI Mini 4 Pro drone while immersed in a digital twin with four distinct camera perspectives: Baseline View (static external), First-Person View, Chase View, and Third-Person View. Nineteen participants completed a series of ring-based obstacle courses from each perspective. In addition to objective flight data, we collected standardized subjective assessments of user experience, presence, workload, cybersickness, and situational awareness. Quantitative analyses revealed that the First-Person View was associated with significantly higher mental demand and effort, greater trajectory deviation, but smoother control inputs compared to the Third-Person and Chase perspectives. Complementing these findings, preference data indicated that the Third-Person View was most consistently favored, whereas the First-Person View elicited polarized reactions.
☆ EasyVitessce: auto-magically adding interactivity to Scverse single-cell and spatial biology plots
EasyVitessce is a Python package that turns existing static Scanpy and SpatialData plots into interactive visualizations by virtue of adding a single line of Python code. The package uses Vitessce internally to render interactive plots, and abstracts away technical details involved with configuration of Vitessce. The resulting interactive plots can be viewed in computational notebook environments or their configurations can be exported for usage in other contexts such as web applications, enhancing the utility of popular Scverse Python plotting APIs. EasyVitessce is released under the MIT License and available on the Python Package Index (PyPI). The source code is publicly available on GitHub.
☆ From Prototypes to Sparse ECG Explanations: SHAP-Driven Counterfactuals for Multivariate Time-Series Multi-class Classification
In eXplainable Artificial Intelligence (XAI), instance-based explanations for time series have gained increasing attention due to their potential for actionable and interpretable insights in domains such as healthcare. Addressing the challenges of explainability of state-of-the-art models, we propose a prototype-driven framework for generating sparse counterfactual explanations tailored to 12-lead ECG classification models. Our method employs SHAP-based thresholds to identify critical signal segments and convert them into interval rules, uses Dynamic Time Warping (DTW) and medoid clustering to extract representative prototypes, and aligns these prototypes to query R-peaks for coherence with the sample being explained. The framework generates counterfactuals that modify only 78% of the original signal while maintaining 81.3% validity across all classes and achieving 43% improvement in temporal stability. We evaluate three variants of our approach, Original, Sparse, and Aligned Sparse, with class-specific performance ranging from 98.9% validity for myocardial infarction (MI) to challenges with hypertrophy (HYP) detection (13.2%). This approach supports near realtime generation (< 1 second) of clinically valid counterfactuals and provides a foundation for interactive explanation platforms. Our findings establish design principles for physiologically-aware counterfactual explanations in AI-based diagnosis systems and outline pathways toward user-controlled explanation interfaces for clinical deployment.
☆ Design Considerations for Human Oversight of AI: Insights from Co-Design Workshops and Work Design Theory
As AI systems become increasingly capable and autonomous, domain experts' roles are shifting from performing tasks themselves to overseeing AI-generated outputs. Such oversight is critical, as undetected errors can have serious consequences or undermine the benefits of AI. Effective oversight, however, depends not only on detecting and correcting AI errors but also on the motivation and engagement of the oversight personnel and the meaningfulness they see in their work. Yet little is known about how domain experts approach and experience the oversight task and what should be considered to design effective and motivational interfaces that support human oversight. To address these questions, we conducted four co-design workshops with domain experts from psychology and computer science. We asked them to first oversee an AI-based grading system, and then discuss their experiences and needs during oversight. Finally, they collaboratively prototyped interfaces that could support them in their oversight task. Our thematic analysis revealed four key user requirements: understanding tasks and responsibilities, gaining insight into the AI's decision-making, contributing meaningfully to the process, and collaborating with peers and the AI. We integrated these empirical insights with the SMART model of work design to develop a generalizable framework of twelve design considerations. Our framework links interface characteristics and user requirements to the psychological processes underlying effective and satisfying work. Being grounded in work design theory, we expect these considerations to be applicable across domains and discuss how they extend existing guidelines for human-AI interaction and theoretical frameworks for effective human oversight by providing concrete guidance on the design of engaging and meaningful interfaces that support human oversight of AI systems.
☆ Interactive visualization of kidney micro-compartmental segmentations and associated pathomics on whole slide images
Application of machine learning techniques enables segmentation of functional tissue units in histology whole-slide images (WSIs). We built a pipeline to apply previously validated segmentation models of kidney structures and extract quantitative features from these structures. Such quantitative analysis also requires qualitative inspection of results for quality control, exploration, and communication. We extend the Vitessce web-based visualization tool to enable visualization of segmentations of multiple types of functional tissue units, such as, glomeruli, tubules, arteries/arterioles in the kidney. Moreover, we propose a standard representation for files containing multiple segmentation bitmasks, which we define polymorphically, such that existing formats including OME-TIFF, OME-NGFF, AnnData, MuData, and SpatialData can be used. We demonstrate that these methods enable researchers and the broader public to interactively explore datasets containing multiple segmented entities and associated features, including for exploration of renal morphometry of biopsies from the Kidney Precision Medicine Project (KPMP) and the Human Biomolecular Atlas Program (HuBMAP).
☆ Learning To Defer To A Population With Limited Demonstrations
This paper addresses the critical data scarcity that hinders the practical deployment of learning to defer (L2D) systems to the population. We introduce a context-aware, semi-supervised framework that uses meta-learning to generate expert-specific embeddings from only a few demonstrations. We demonstrate the efficacy of a dual-purpose mechanism, where these embeddings are used first to generate a large corpus of pseudo-labels for training, and subsequently to enable on-the-fly adaptation to new experts at test-time. The experiment results on three different datasets confirm that a model trained on these synthetic labels rapidly approaches oracle-level performance, validating the data efficiency of our approach. By resolving a key training bottleneck, this work makes adaptive L2D systems more practical and scalable, paving the way for human-AI collaboration in real-world environments. To facilitate reproducibility and address implementation details not covered in the main text, we provide our source code and training configurations at https://github.com/nil123532/learning-to-defer-to-a-population-with-limited-demonstrations.
comment: Accepted to IEEE DICTA 2025 (poster). 7 pages, 2 figures
☆ LLMartini: Seamless and Interactive Leveraging of Multiple LLMs through Comparison and Composition
The growing diversity of large language models (LLMs) means users often need to compare and combine outputs from different models to obtain higher-quality or more comprehensive responses. However, switching between separate interfaces and manually integrating outputs is inherently inefficient, leading to a high cognitive burden and fragmented workflows. To address this, we present LLMartini, a novel interactive system that supports seamless comparison, selection, and intuitive cross-model composition tools. The system decomposes responses into semantically aligned segments based on task-specific criteria, automatically merges consensus content, and highlights model differences through color coding while preserving unique contributions. In a user study (N=18), LLMartini significantly outperformed conventional manual methods across all measured metrics, including task completion time, cognitive load, and user satisfaction. Our work highlights the importance of human-centered design in enhancing the efficiency and creativity of multi-LLM interactions and offers practical implications for leveraging the complementary strengths of various language models.
☆ See, Think, Act: Online Shopper Behavior Simulation with VLM Agents
LLMs have recently demonstrated strong potential in simulating online shopper behavior. Prior work has improved action prediction by applying SFT on action traces with LLM-generated rationales, and by leveraging RL to further enhance reasoning capabilities. Despite these advances, current approaches rely on text-based inputs and overlook the essential role of visual perception in shaping human decision-making during web GUI interactions. In this paper, we investigate the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs, leveraging OPeRA dataset. By grounding agent decision-making in both textual and visual modalities, we aim to narrow the gap between synthetic agents and real-world users, thereby enabling more cognitively aligned simulations of online shopping behavior. Specifically, we employ SFT for joint action prediction and rationale generation, conditioning on the full interaction context, which comprises action history, past HTML observations, and the current webpage screenshot. To further enhance reasoning capabilities, we integrate RL with a hierarchical reward structure, scaled by a difficulty-aware factor that prioritizes challenging decision points. Empirically, our studies show that incorporating visual grounding yields substantial gains: the combination of text and image inputs improves exact match accuracy by more than 6% over text-only inputs. These results indicate that multi-modal grounding not only boosts predictive accuracy but also enhances simulation fidelity in visually complex environments, which captures nuances of human attention and decision-making that text-only agents often miss. Finally, we revisit the design space of behavior simulation frameworks, identify key methodological limitations, and propose future research directions toward building efficient and effective human behavior simulators.
☆ Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM Interactions
Large language model (LLM)-powered chatbots are increasingly used for opinion exploration. Prior research examined how LLMs alter user views, yet little work extended beyond one-way influence to address how user input can affect LLM responses and how such bi-directional influence manifests throughout the multi-turn conversations. This study investigates this dynamic through 50 controversial-topic discussions with participants (N=266) across three conditions: static statements, standard chatbot, and personalized chatbot. Results show that human opinions barely shifted, while LLM outputs changed more substantially, narrowing the gap between human and LLM stance. Personalization amplified these shifts in both directions compared to the standard setting. Analysis of multi-turn conversations further revealed that exchanges involving participants' personal stories were most likely to trigger stance changes for both humans and LLMs. Our work highlights the risk of over-alignment in human-LLM interaction and the need for careful design of personalized chatbots to more thoughtfully and stably align with users.
comment: 26 pages, 8 figures
☆ Designing a Secure and Resilient Distributed Smartphone Participant Data Collection System SP 2025
Real-world health studies require continuous and secure data collection from mobile and wearable devices. We introduce MotionPI, a smartphone-based system designed to collect behavioral and health data through sensors and surveys with minimal interaction from participants. The system integrates passive data collection (such as GPS and wristband motion data) with Ecological Momentary Assessment (EMA) surveys, which can be triggered randomly or based on physical activity. MotionPI is designed to work under real-life constraints, including limited battery life, weak or intermittent cellular connection, and minimal user supervision. It stores data both locally and on a secure cloud server, with encrypted transmission and storage. It integrates through Bluetooth Low Energy (BLE) into wristband devices that store raw data and communicate motion summaries and trigger events. MotionPI demonstrates a practical solution for secure and scalable mobile data collection in cyber-physical health studies.
comment: 9 pages, 3 figures. Accepted at EAI SmartSP 2025 Conference (Springer LNICST). This version is the arXiv preprint prepared for open access
☆ The Risks of Industry Influence in Tech Research
Emerging information technologies like social media, search engines, and AI can have a broad impact on public health, political institutions, social dynamics, and the natural world. It is critical to develop a scientific understanding of these impacts to inform evidence-based technology policy that minimizes harm and maximizes benefits. Unlike most other global-scale scientific challenges, however, the data necessary for scientific progress are generated and controlled by the same industry that might be subject to evidence-based regulation. Moreover, technology companies historically have been, and continue to be, a major source of funding for this field. These asymmetries in information and funding raise significant concerns about the potential for undue industry influence on the scientific record. In this Perspective, we explore how technology companies can influence our scientific understanding of their products. We argue that science faces unique challenges in the context of technology research that will require strengthening existing safeguards and constructing wholly new ones.
comment: 17 pages, 2 Figures
☆ RatioWaveNet: A Learnable RDWT Front-End for Robust and Interpretable EEG Motor-Imagery Classification
Brain-computer interfaces (BCIs) based on motor imagery (MI) translate covert movement intentions into actionable commands, yet reliable decoding from non-invasive EEG remains challenging due to nonstationarity, low SNR, and subject variability. We present RatioWaveNet, which augments a strong temporal CNN-Transformer backbone (TCFormer) with a trainable, Rationally-Dilated Wavelet Transform (RDWT) front end. The RDWT performs an undecimated, multi-resolution subband decomposition that preserves temporal length and shift-invariance, enhancing sensorimotor rhythms while mitigating jitter and mild artifacts; subbands are fused via lightweight grouped 1-D convolutions and passed to a multi-kernel CNN for local temporal-spatial feature extraction, a grouped-query attention encoder for long-range context, and a compact TCN head for causal temporal integration. Our goal is to test whether this principled wavelet front end improves robustness precisely where BCIs typically fail - on the hardest subjects - and whether such gains persist on average across seeds under both intra- and inter-subject protocols. On BCI-IV-2a and BCI-IV-2b, across five seeds, RatioWaveNet improves worst-subject accuracy over the Transformer backbone by +0.17 / +0.42 percentage points (Sub-Dependent / LOSO) on 2a and by +1.07 / +2.54 percentage points on 2b, with consistent average-case gains and modest computational overhead. These results indicate that a simple, trainable wavelet front end is an effective plug-in to strengthen Transformer-based BCIs, improving worst-case reliability without sacrificing efficiency.
♻ ☆ VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition
This study investigated auditory self-recognition boundaries using AI voice morphing technology, examining when individuals cease recognizing their own voice. Through controlled morphing between participants' voices and demographically matched targets at 1% increments using a mixed-methods design, we measured self-identification ratings and response times among 21 participants aged 18-64. Results revealed a critical recognition threshold at 35.2% morphing (95% CI [31.4, 38.1]). Older participants tolerated significantly higher morphing levels before losing self-recognition ($\beta$ = 0.617, p = 0.048), suggesting age-related vulnerabilities. Greater acoustic embedding distances predicted slower decision-making ($r \approx 0.5-0.53, p < 0.05$), with the longest response times for cloned versions of participants' own voices. Qualitative analysis revealed prosodic-based recognition strategies, universal voice manipulation discomfort, and awareness of applications spanning assistive technology to security risks. These findings establish foundational evidence for individual differences in voice morphing detection, with implications for AI ethics and vulnerable population protection as voice synthesis becomes accessible.
♻ ☆ The Relational Origins of Rules in Online Communities
Where do rules come from in online communities? While prior studies of online community governance in social computing have sought to characterize rules by their functions and enforcement practices, scholars have largely overlooked rule adoption and change. This study investigates how and why online communities adopt and change their rules. We conducted a grounded theory-based analysis of 40 in-depth interviews with community leaders from subreddits, Fandom wikis, and Fediverse servers, and identified seven processes involved in the adoption of online community rules. Our findings reveal that, beyond functional reasons like regulating behavior and solving problems, rules are also adopted and changed for relational reasons, such as signaling or reinforcing community legitimacy and identity to other communities. While rule change was often prompted by challenges during community growth or decline, change also depended on volunteer leaders' work capacity, the presence of member feedback mechanisms, and relational dynamics between leaders and members. The findings extend prior theories from social computing and organizational research, illustrating how institutionalist and ecological explanations of the relational origins of rules complement functional accounts. The results also support design recommendations that integrate the relational aspects of rules and rulemaking to facilitate successful governance across communities' lifecycles.
♻ ☆ Knowledge Prompting: How Knowledge Engineers Use Large Language Models
Despite many advances in knowledge engineering (KE), challenges remain in areas such as engineering knowledge graphs (KGs) at scale, keeping up with evolving domain knowledge, multilingualism, and multimodality. Recently, KE has used LLMs to support semi-automatic tasks, but the most effective use of LLMs to support knowledge engineers across the KE activites is still in its infancy. To explore the vision of LLM copilots for KE and change existing KE practices, we conducted a multimethod study during a KE hackathon. We investigated participants' views on the use of LLMs, the challenges they face, the skills they may need to integrate LLMs into their practices, and how they use LLMs responsibly. We found participants felt LLMs could contribute to improving efficiency when engineering KGs, but presented increased challenges around the already complex issues of evaluating the KE tasks. We discovered prompting to be a useful but undervalued skill for knowledge engineers working with LLMs, and note that natural language processing skills may become more relevant across more roles in KG construction. Integrating LLMs into KE tasks needs to be mindful of potential risks and harms related to responsible AI. Given the limited ethical training, most knowledge engineers receive solutions such as our suggested `KG cards' based on data cards could be a useful guide for KG construction. Our findings can support designers of KE AI copilots, KE researchers, and practitioners using advanced AI to develop trustworthy applications, propose new methodologies for KE and operate new technologies responsibly.
♻ ☆ The Endless Tuning. An Artificial Intelligence Design To Avoid Human Replacement and Trace Back Responsibilities
The Endless Tuning is a design method for a reliable deployment of artificial intelligence based on a double mirroring process, which pursues both the goals of avoiding human replacement and filling the so-called responsibility gap (Matthias 2004). Originally depicted in (Fabris et al. 2024) and ensuing the relational approach urged therein, it was then actualized in a protocol, implemented in three prototypical applications regarding decision-making processes (respectively: loan granting, pneumonia diagnosis, and art style recognition) and tested with such as many domain experts. Step by step illustrating the protocol, giving insights concretely showing a different voice (Gilligan 1993) in the ethics of artificial intelligence, a philosophical account of technical choices (e.g., a reversed and hermeneutic deployment of XAI algorithms) will be provided in the present study together with the results of the experiments, focusing on user experience rather than statistical accuracy. Even thoroughly employing deep learning models, full control was perceived by the interviewees in the decision-making setting, while it appeared that a bridge can be built between accountability and liability in case of damage.
♻ ☆ Coordinated Strategies in Realistic Air Combat by Hierarchical Multi-Agent Reinforcement Learning
Achieving mission objectives in a realistic simulation of aerial combat is highly challenging due to imperfect situational awareness and nonlinear flight dynamics. In this work, we introduce a novel 3D multi-agent air combat environment and a Hierarchical Multi-Agent Reinforcement Learning framework to tackle these challenges. Our approach combines heterogeneous agent dynamics, curriculum learning, league-play, and a newly adapted training algorithm. To this end, the decision-making process is organized into two abstraction levels: low-level policies learn precise control maneuvers, while high-level policies issue tactical commands based on mission objectives. Empirical results show that our hierarchical approach improves both learning efficiency and combat performance in complex dogfight scenarios.
comment: 2025 IEEE International Conference on Agentic AI (ICA)
♻ ☆ MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence's semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.
♻ ☆ Large Language Models in Architecture Studio: A Framework for Learning Outcomes
The study explores the role of large language models (LLMs) in the context of the architectural design studio, understood as the pedagogical core of architectural education. Traditionally, the studio has functioned as an experiential learning space where students tackle design problems through reflective practice, peer critique, and faculty guidance. However, the integration of artificial intelligence (AI) in this environment has been largely focused on form generation, automation, and representation-al efficiency, neglecting its potential as a pedagogical tool to strengthen student autonomy, collaboration, and self-reflection. The objectives of this research were: (1) to identify pedagogical challenges in self-directed, peer-to-peer, and teacher-guided learning processes in architecture studies; (2) to propose AI interventions, particularly through LLM, that contribute to overcoming these challenges; and (3) to align these interventions with measurable learning outcomes using Bloom's taxonomy. The findings show that the main challenges include managing student autonomy, tensions in peer feedback, and the difficulty of balancing the transmission of technical knowledge with the stimulation of creativity in teaching. In response to this, LLMs are emerging as complementary agents capable of generating personalized feedback, organizing collaborative interactions, and offering adaptive cognitive scaffolding. Furthermore, their implementation can be linked to the cognitive levels of Bloom's taxonomy: facilitating the recall and understanding of architectural concepts, supporting application and analysis through interactive case studies, and encouraging synthesis and evaluation through hypothetical design scenarios.
♻ ☆ PTFA: An LLM-based Agent that Facilitates Online Consensus Building through Parallel Thinking
Consensus building is inherently challenging due to the diverse opinions held by stakeholders. Effective facilitation is crucial to support the consensus building process and enable efficient group decision making. However, the effectiveness of facilitation is often constrained by human factors such as limited experience and scalability. In this research, we propose a Parallel Thinking-based Facilitation Agent (PTFA) that facilitates online, text-based consensus building processes.The PTFA automatically collects real-time textual input and leverages large language models (LLMs)to perform all six distinct roles of the well-established Six Thinking Hats technique in parallel thinking.To illustrate the potential of the agent, a pilot study was conducted, demonstrating its capabilities in idea generation, emotional probing, and deeper analysis of idea quality. Additionally, future open research challenges such as optimizing scheduling and managing behaviors in divergent phase are identified. Furthermore, a comprehensive dataset that contains not only the conversational content among the participants but also between the participants and the agent is constructed for future study.
♻ ☆ GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare
Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a mutual welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and mutual welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .
comment: 31 pages, 6 figures
♻ ☆ Behavioral Biometrics for Automatic Detection of User Familiarity in VR
As virtual reality (VR) devices become increasingly integrated into everyday settings, a growing number of users without prior experience will engage with VR systems. Automatically detecting a user's familiarity with VR as an interaction medium enables real-time, adaptive training and interface adjustments, minimizing user frustration and improving task performance. In this study, we explore the automatic detection of VR familiarity by analyzing hand movement patterns during a passcode-based door-opening task, which is a well-known interaction in collaborative virtual environments such as meeting rooms, offices, and healthcare spaces. While novice users may lack prior VR experience, they are likely to be familiar with analogous real-world tasks involving keypad entry. We conducted a pilot study with 26 participants, evenly split between experienced and inexperienced VR users, who performed tasks using both controller-based and hand-tracking interactions. Our approach uses state-of-the-art deep classifiers for automatic VR familiarity detection, achieving the highest accuracies of 92.05% and 83.42% for hand-tracking and controller-based interactions, respectively. In the cross-device evaluation, where classifiers trained on controller data were tested using hand-tracking data, the model achieved an accuracy of 78.89%. The integration of both modalities in the mixed-device evaluation obtained an accuracy of 94.19%. Our results underline the promise of using hand movement biometrics for the real-time detection of user familiarity in critical VR applications, paving the way for personalized and adaptive VR experiences.
comment: 7 pages, 7 figures, 17th International Conference on Quality of Multimedia Experience (QoMEX 2025)
Software Engineering 34
☆ Integrating Transparent Models, LLMs, and Practitioner-in-the-Loop: A Case of Nonprofit Program Evaluation
Public and nonprofit organizations often hesitate to adopt AI tools because most models are opaque even though standard approaches typically analyze aggregate patterns rather than offering actionable, case-level guidance. This study tests a practitioner-in-the-loop workflow that pairs transparent decision-tree models with large language models (LLMs) to improve predictive accuracy, interpretability, and the generation of practical insights. Using data from an ongoing college-success program, we build interpretable decision trees to surface key predictors. We then provide each tree's structure to an LLM, enabling it to reproduce case-level predictions grounded in the transparent models. Practitioners participate throughout feature engineering, model design, explanation review, and usability assessment, ensuring that field expertise informs the analysis at every stage. Results show that integrating transparent models, LLMs, and practitioner input yields accurate, trustworthy, and actionable case-level evaluations, offering a viable pathway for responsible AI adoption in the public and nonprofit sectors.
☆ BOSQTGEN: Breaking the Sound Barrier in Test Generation
Modern software is increasingly built by composing APIs, elevating the API contract to a critical role. Inadequate contracts, however, lead to mismatched expectations and failures, creating a pressing need for robust conformance testing. Current test generation techniques are hindered by key challenges: polyglot systems, source code inaccessibility, a cost-reliability trade-off, and, most critically, the difficulty of generating structured inputs. We introduce BOSQTGEN, a novel black-box methodology and tool for API test generation. BOSQTGEN utilizes a novel approach for decomposing API specifications into primitives, using LLMs to suggest coherent strata for them, and employing combinatorial testing to efficiently sample over these values. This approach ensures coverage of critical interactions while avoiding the redundancy of random sampling. The resulting BOSQTGEN system achieves an average of 82% code coverage on RESTful benchmarks, often a 20% or more increase over prior state-of-the-art systems and nearing parity with hand-written test suites. Providing a fully API-driven approach to test generation, enables developers to automatically create high-quality test cases for validation or test-driven development.
Review of Tools for Zero-Code LLM Based Application Development
Large Language Models (LLMs) are transforming software creation by enabling zero code development platforms. Our survey reviews recent platforms that let users build applications without writing code, by leveraging LLMs as the brains of the development process. We adopt a broad survey methodology, categorizing platforms based on key dimensions such as interface style, backend integration, output type, and extensibility. We analyze both dedicated LLM based app builders (OpenAI's custom GPTs, Bolt.new, Dust.tt, Flowise, Cognosys) and general no code platforms (e.g., Bubble, Glide) that integrate LLM capabilities. We present a taxonomy categorizing these platforms by their interface (conversational, visual, etc.), supported LLM backends, output type (chatbot, full application, workflow), and degree of extensibility. Core features such as autonomous agents, memory management, workflow orchestration, and API integrations are in scope of the survey. We provide a detailed comparison, highlighting each platform's strengths and limitations. Trade offs (customizability, scalability, vendor lock-in) are discussed in comparison with traditional and low code development approaches. Finally, we outline future directions, including multimodal interfaces, on device LLMs, and improved orchestration for democratizing app creation with AI. Our findings indicate that while zero code LLM platforms greatly reduce the barrier to creating AI powered applications, they still face challenges in flexibility and reliability. Overall, the landscape is rapidly evolving, offering exciting opportunities to empower non programmers to create sophisticated software.
comment: Accepted in 6th World Conference on Artificial Intelligence: Advances and Applications (WCAIAA 2025)
☆ Toward Agentic Software Engineering Beyond Code: Framing Vision, Values, and Vocabulary
Agentic AI is poised to usher in a seismic paradigm shift in Software Engineering (SE). As technologists rush head-along to make agentic AI a reality, SE researchers are driven to establish agentic SE as a research area. While early visions of agentic SE are primarily focused on code-related activities, early empirical evidence calls for a consideration of a range of socio-technical concerns to make it work in practice. This paper contributes to the emerging community vision by: (a) recommending an expansion of its scope beyond code, toward a 'whole of process' vision, grounding it in SE foundations and evolution and emerging agentic SE frameworks, (b) proposing a preliminary set of values and principles to guide efforts, and (c) sharing guidance on designing/using well-defined vocabulary for agentic SE. It is hoped that these ideas will encourage community collaborations and steer the SE community towards laying strong foundations of agentic SE so its not only inevitable but also deliberate and desirable in the long run.
comment: 5 pages
☆ LifeSync-Games: Toward a Video Game Paradigm for Promoting Responsible Gaming and Human Development
Technological advancements have made video games a central part of the digital lives of nearly 3 billion people worldwide. Although games can address various social, physical, and psychological needs, their potential to support human development and well-being remains underutilized. Research highlights both negative effects, such as addiction and isolation, and positive outcomes like cognitive improvements and problem-solving skills. However, public discourse and regulation often focus more on risks than benefits. To address this imbalance, we present LifeSync-Games, a framework leveraging simplified digital twins to connect virtual gameplay with real-life activities. This reciprocal relationship aims to enhance the developmental value of gaming by promoting self-regulation and fostering growth across physical, mental, and social domains. We present the framework's theoretical foundations, technological components, design guidelines, and evaluation approaches. Additionally, we present early applications in both new and bestselling games to demonstrate its versatility and practical relevance.
comment: 8 pages, 4 figures, 1 table, 66 references
☆ Dara: Automated multiple-hypothesis phase identification and refinement from powder X-ray diffraction
Powder X-ray diffraction (XRD) is a foundational technique for characterizing crystalline materials. However, the reliable interpretation of XRD patterns, particularly in multiphase systems, remains a manual and expertise-demanding task. As a characterization method that only provides structural information, multiple reference phases can often be fit to a single pattern, leading to potential misinterpretation when alternative solutions are overlooked. To ease humans' efforts and address the challenge, we introduce Dara (Data-driven Automated Rietveld Analysis), a framework designed to automate the robust identification and refinement of multiple phases from powder XRD data. Dara performs an exhaustive tree search over all plausible phase combinations within a given chemical space and validates each hypothesis using a robust Rietveld refinement routine (BGMN). Key features include structural database filtering, automatic clustering of isostructural phases during tree expansion, peak-matching-based scoring to identify promising phases for refinement. When ambiguity exists, Dara generates multiple hypothesis which can then be decided between by human experts or with further characteriztion tools. By enhancing the reliability and accuracy of phase identification, Dara enables scalable analysis of realistic complex XRD patterns and provides a foundation for integration into multimodal characterization workflows, moving toward fully self-driving materials discovery.
☆ FidelityGPT: Correcting Decompilation Distortions with Retrieval Augmented Generation
Decompilation converts machine code into human-readable form, enabling analysis and debugging without source code. However, fidelity issues often degrade the readability and semantic accuracy of decompiled output. Existing methods, such as variable renaming or structural simplification, provide partial improvements but lack robust detection and correction, particularly for complex closed-source binaries. We present FidelityGPT, a framework that enhances decompiled code accuracy and readability by systematically detecting and correcting semantic distortions. FidelityGPT introduces distortion-aware prompt templates tailored to closed-source settings and integrates Retrieval-Augmented Generation (RAG) with a dynamic semantic intensity algorithm to locate distorted lines and retrieve semantically similar code from a database. A variable dependency algorithm further mitigates long-context limitations by analyzing redundant variables and integrating their dependencies into the prompt context. Evaluated on 620 function pairs from a binary similarity benchmark, FidelityGPT achieved an average detection accuracy of 89% and a precision of 83%. Compared to the state-of-the-art DeGPT (Fix Rate 83%, Corrected Fix Rate 37%), FidelityGPT attained 94% FR and 64% CFR, demonstrating significant gains in accuracy and readability. These results highlight its potential to advance LLM-based decompilation and reverse engineering.
☆ Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1
In the quest for scientific progress, communicating research is as vital as the discovery itself. Yet, researchers are often sidetracked by the manual, repetitive chore of building project webpages to make their dense papers accessible. While automation has tackled static slides and posters, the dynamic, interactive nature of webpages has remained an unaddressed challenge. To bridge this gap, we reframe the problem, arguing that the solution lies not in a single command, but in a collaborative, hierarchical process. We introduce $\textbf{AutoPage}$, a novel multi-agent system that embodies this philosophy. AutoPage deconstructs paper-to-page creation into a coarse-to-fine pipeline from narrative planning to multimodal content generation and interactive rendering. To combat AI hallucination, dedicated "Checker" agents verify each step against the source paper, while optional human checkpoints ensure the final product aligns perfectly with the author's vision, transforming the system from a mere tool into a powerful collaborative assistant. To rigorously validate our approach, we also construct $\textbf{PageBench}$, the first benchmark for this new task. Experiments show AutoPage not only generates high-quality, visually appealing pages but does so with remarkable efficiency in under 15 minutes for less than \$0.1. Code and dataset will be released at $\href{https://mqleet.github.io/AutoPage_ProjectPage/}{Webpage}$.
☆ A Goal-Driven Survey on Root Cause Analysis
Root Cause Analysis (RCA) is a crucial aspect of incident management in large-scale cloud services. While the term root cause analysis or RCA has been widely used, different studies formulate the task differently. This is because the term "RCA" implicitly covers tasks with distinct underlying goals. For instance, the goal of localizing a faulty service for rapid triage is fundamentally different from identifying a specific functional bug for a definitive fix. However, previous surveys have largely overlooked these goal-based distinctions, conventionally categorizing papers by input data types (e.g., metric-based vs. trace-based methods). This leads to the grouping of works with disparate objectives, thereby obscuring the true progress and gaps in the field. Meanwhile, the typical audience of an RCA survey is either laymen who want to know the goals and big picture of the task or RCA researchers who want to figure out past research under the same task formulation. Thus, an RCA survey that organizes the related papers according to their goals is in high demand. To this end, this paper presents a goal-driven framework that effectively categorizes and integrates 135 papers on RCA in the context of cloud incident management based on their diverse goals, spanning the period from 2014 to 2025. In addition to the goal-driven categorization, it discusses the ultimate goal of all RCA papers as an umbrella covering different RCA formulations. Moreover, the paper discusses open challenges and future directions in RCA.
☆ Mapping and Evolving Interoperability Testing in European Energy Systems: The int:net Perspective
The ongoing transformation of the European energy landscape, driven by the integration of renewable energy sources, digital technologies, and decentralized systems, requires a high degree of interoperability across diverse components and systems. Ensuring that these elements can exchange information and operate together reliably is essential for achieving a secure, flexible, and efficient energy supply infrastructure. While several initiatives have contributed to the development of smart grid testing infrastructures, they do not provide a dedicated or comprehensive focus on interoperability testing. A structured and harmonized overview of interoperability testing capabilities across Europe is therefore still missing. This work therefore presents a novel contribution by analyzing the European interoperability testing facility landscape through a structured survey of 30 facilities. It provides a categorized inventory of testing infrastructures, applied methodologies, and reference test cases, and introduces a blueprint for the development of future testing environments. The findings contribute to the establishment of a coordinated European ecosystem for interoperability testing, supporting collaboration, innovation, and alignment with the goals of the energy transition.
comment: 2025 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT Europe)
☆ AutoMT: A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems
Autonomous Driving Systems (ADS) are safety-critical, where failures can be severe. While Metamorphic Testing (MT) is effective for fault detection in ADS, existing methods rely heavily on manual effort and lack automation. We present AutoMT, a multi-agent MT framework powered by Large Language Models (LLMs) that automates the extraction of Metamorphic Relations (MRs) from local traffic rules and the generation of valid follow-up test cases. AutoMT leverages LLMs to extract MRs from traffic rules in Gherkin syntax using a predefined ontology. A vision-language agent analyzes scenarios, and a search agent retrieves suitable MRs from a RAG-based database to generate follow-up cases via computer vision. Experiments show that AutoMT achieves up to 5 x higher test diversity in follow-up case generation compared to the best baseline (manual expert-defined MRs) in terms of validation rate, and detects up to 20.55% more behavioral violations. While manual MT relies on a fixed set of predefined rules, AutoMT automatically extracts diverse metamorphic relations that augment real-world datasets and help uncover corner cases often missed during in-field testing and data collection. Its modular architecture separating MR extraction, filtering, and test generation supports integration into industrial pipelines and potentially enables simulation-based testing to systematically cover underrepresented or safety-critical scenarios.
☆ Bytecode-centric Detection of Known-to-be-vulnerable Dependencies in Java Projects ICSE 2026
On average, 71% of the code in typical Java projects comes from open-source software (OSS) dependencies, making OSS dependencies the dominant component of modern software code bases. This high degree of OSS reliance comes with a considerable security risk of adding known security vulnerabilities to a code base. To remedy this risk, researchers and companies have developed various dependency scanners, which try to identify inclusions of known-to-be-vulnerable OSS dependencies. However, there are still challenges that modern dependency scanners do not overcome, especially when it comes to dependency modifications, such as re-compilations, re-bundlings or re-packagings, which are common in the Java ecosystem. To overcome these challenges, we present Jaralyzer, a bytecode-centric dependency scanner for Java. Jaralyzer does not rely on the metadata or the source code of the included OSS dependencies being available but directly analyzes a dependency's bytecode. Our evaluation across 56 popular OSS components demonstrates that Jaralyzer outperforms other popular dependency scanners in detecting vulnerabilities within modified dependencies. It is the only scanner capable of identifying vulnerabilities across all the above mentioned types of modifications. But even when applied to unmodified dependencies, Jaralyzer outperforms the current state-of-the-art code-centric scanner Eclipse Steady by detecting 28 more true vulnerabilities and yielding 29 fewer false warnings.
comment: To be published in: ICSE 2026 Proceedings
☆ Collaborative penetration testing suite for emerging generative AI algorithms
Problem Space: AI Vulnerabilities and Quantum Threats Generative AI vulnerabilities: model inversion, data poisoning, adversarial inputs. Quantum threats Shor Algorithm breaking RSA ECC encryption. Challenge Secure generative AI models against classical and quantum cyberattacks. Proposed Solution Collaborative Penetration Testing Suite Five Integrated Components: DAST SAST OWASP ZAP, Burp Suite, SonarQube, Fortify. IAST Contrast Assess integrated with CI CD pipeline. Blockchain Logging Hyperledger Fabric for tamper-proof logs. Quantum Cryptography Lattice based RLWE protocols. AI Red Team Simulations Adversarial ML & Quantum-assisted attacks. Integration Layer: Unified workflow for AI, cybersecurity, and quantum experts. Key Results 300+ vulnerabilities identified across test environments. 70% reduction in high-severity issues within 2 weeks. 90% resolution efficiency for blockchain-logged vulnerabilities. Quantum-resistant cryptography maintained 100% integrity in tests. Outcome: Quantum AI Security Protocol integrating Blockchain Quantum Cryptography AI Red Teaming.
☆ An Empirical Study of Bitwise Operators Intuitiveness through Performance Metrics
Objectives: This study aims to investigate the readability and understandability of bitwise operators in programming, with the main hypothesis that there will be a difference in the performance metrics (response time and error rate) between participants exposed to various bitwise operators related questions and those who are not. Participants: Participants in this human research study include people without programming background, novice programmers, and university students with varying programming experience (from freshmen to PhD level). There were 23 participants for this study. Study Methods: This study uses an Within-Subjects Experimental Design to assess how people with diverse programming backgrounds understand and use bitwise operators. Participants complete tasks in JavaScript program, and their task completion time and accuracy of the tasks are recorded for analysis. Findings: The results indicate that operators can be one of the factors predicting response time, with a small but significant effect, with R-squared 0.032, (1, 494) = 16.5, p < .001. Additionally, some operators like OR, NOT, and Left Shift showed statistical significance in task completion times compared to other operators. Conclusions: While the complexity of bitwise operators did not generally result in longer task completion times, certain operators were found to be less intuitive, suggesting the need for further investigation and potential redesign for improved understandability.
comment: 15 pages, 10 tables, 9 Figures
☆ From Specification to Service: Accelerating API-First Development Using Multi-Agent Systems
This paper presents a system that uses Large Language Models (LLMs)-based agents to automate the API-first development of RESTful microservices. This system helps to create an OpenAPI specification, generate server code from it, and refine the code through a feedback loop that analyzes execution logs and error messages. The integration of log analysis enables the LLM to detect and address issues efficiently, reducing the number of iterations required to produce functional and robust services. This study's main goal is to advance API-first development automation for RESTful web services and test the capability of LLM-based multi-agent systems in supporting the API-first development approach. To test the proposed system's potential, we utilized the PRAB benchmark. The results indicate that if we keep the OpenAPI specification small and focused, LLMs are capable of generating complete functional code with business logic that aligns to the specification. The code for the system is publicly available at https://github.com/sirbh/code-gen
comment: 9 Figures, 6Tables
☆ Trace: Securing Smart Contract Repository Against Access Control Vulnerability
Smart contract vulnerabilities, particularly improper Access Control that allows unauthorized execution of restricted functions, have caused billions of dollars in losses. GitHub hosts numerous smart contract repositories containing source code, documentation, and configuration files-these serve as intermediate development artifacts that must be compiled and packaged before deployment. Third-party developers often reference, reuse, or fork code from these repositories during custom development. However, if the referenced code contains vulnerabilities, it can introduce significant security risks. Existing tools for detecting smart contract vulnerabilities are limited in their ability to handle complex repositories, as they typically require the target contract to be compilable to generate an abstract representation for further analysis. This paper presents TRACE, a tool designed to secure non-compilable smart contract repositories against access control vulnerabilities. TRACE employs LLMs to locate sensitive functions involving critical operations (e.g., transfer) within the contract and subsequently completes function snippets into a fully compilable contract. TRACE constructs a function call graph from the abstract syntax tree (AST) of the completed contract. It uses the control flow graph (CFG) of each function as node information. The nodes of the sensitive functions are then analyzed to detect Access Control vulnerabilities. Experimental results demonstrate that TRACE outperforms state-of-the-art tools on an open-sourced CVE dataset, detecting 14 out of 15 CVEs. In addition, it achieves 89.2% precision on 5,000 recent on-chain contracts, far exceeding the best existing tool at 76.9%. On 83 real-world repositories, TRACE achieves 87.0% precision, significantly surpassing DeepSeek-R1's 14.3%.
☆ A General Solution for the Implementation of CI/CD in Embedded Linux Development
With the growing use of embedded systems in various industries, the need for automated platforms for the development and deployment of customized Linux-based operating systems has become more important. This research was conducted with the aim of designing and implementing an integrated and reproducible infrastructure for the development, building, and testing of a Linux-based operating system using the Yocto Project. The proposed structure was implemented based on a three-layer architecture consisting of the main Yocto repositories, a custom layer (meta-custom), and a coordinating manifest layer to ensure version synchronization, scalability, and reproducibility. Three sample projects, including libhelloworld, helloworld, and the kernel module hello mod, were developed and integrated into the build process. Continuous Integration and Continuous Deployment pipelines were implemented with GitLab CI and combined with an isolated Docker environment to automate and streamline the build and testing workflows. Using a local cache server containing hashserv, downloads and sstate cache significantly reduced the build time. The functionality and stability of the system were verified through six boot test scenarios in the QEMU simulator. The results show that the proposed design not only ensures reproducibility but also can be extended to advanced applications such as continuous deployment of real-time Linux versions. Future recommendations include expanding automated tests, implementing system monitoring with Prometheus and Grafana, using distributed builds, optimizing with Docker multi-stage builds, and enabling continuous deployment of real-time Linux changes to provide a stable and scalable model for industrial and research projects in embedded systems with a rapid and reliable development cycle.
☆ Automated Concern Extraction from Textual Requirements of Cyber-Physical Systems: A Multi-solution Study
Cyber-physical systems (CPSs) are characterized by a deep integration of the information space and the physical world, which makes the extraction of requirements concerns more challenging. Some automated solutions for requirements concern extraction have been proposed to alleviate the burden on requirements engineers. However, evaluating the effectiveness of these solutions, which relies on fair and comprehensive benchmarks, remains an open question. To address this gap, we propose ReqEBench, a new CPSs requirements concern extraction benchmark, which contains 2,721 requirements from 12 real-world CPSs. ReqEBench offers four advantages. It aligns with real-world CPSs requirements in multiple dimensions, e.g., scale and complexity. It covers comprehensive concerns related to CPSs requirements. It undergoes a rigorous annotation process. It covers multiple application domains of CPSs, e.g., aerospace and healthcare. We conducted a comparative study on three types of automated requirements concern extraction solutions and revealed their performance in real-world CPSs using our ReqEBench. We found that the highest F1 score of GPT-4 is only 0.24 in entity concern extraction. We further analyze failure cases of popular LLM-based solutions, summarize their shortcomings, and provide ideas for improving their capabilities. We believe ReqEBench will facilitate the evaluation and development of automated requirements concern extraction.
comment: 27 pages, 3 figures
☆ The Cost of Downgrading Build Systems: A Case Study of Kubernetes
Since developers invoke the build system frequently, its performance can impact productivity. Modern artifact-based build tools accelerate builds, yet prior work shows that teams may abandon them for alternatives that are easier to maintain. While prior work shows why downgrades are performed, the implications of downgrades remain largely unexplored. In this paper, we describe a case study of the Kubernetes project, focusing on its downgrade from an artifact-based build tool (Bazel) to a language-specific solution (Go Build). We reproduce and analyze the full and incremental builds of change sets during the downgrade period. On the one hand, we find that Bazel builds are faster than Go Build, completing full builds in 23.06-38.66 up to 75.19 impose a larger memory footprint than Go Build of 81.42-351.07 respectively. Bazel builds also impose a greater CPU load at parallelism settings above eight for full builds and above one for incremental builds. We estimate that downgrading from Bazel can increase CI resource costs by up to 76 explore whether our observations generalize by replicating our Kubernetes study on four other projects that also downgraded from Bazel to older build tools. We observe that while build time penalties decrease, Bazel consistently consumes more memory. We conclude that abandoning artifact-based build tools, despite perceived maintainability benefits, tends to incur considerable performance costs for large projects. Our observations may help stakeholders to balance trade-offs in build tool adoption
☆ ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering
Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto-Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state-of-the-art LLMs and three open-source tool-use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope's effectiveness in enhancing LLM tool use.
comment: Preprint under review
☆ A Framework for the Adoption and Integration of Generative AI in Midsize Organizations and Enterprises (FAIGMOE)
Generative Artificial Intelligence (GenAI) presents transformative opportunities for organizations, yet both midsize organizations and larger enterprises face distinctive adoption challenges. Midsize organizations encounter resource constraints and limited AI expertise, while enterprises struggle with organizational complexity and coordination challenges. Existing technology adoption frameworks, including TAM (Technology Acceptance Model), TOE (Technology Organization Environment), and DOI (Diffusion of Innovations) theory, lack the specificity required for GenAI implementation across these diverse contexts, creating a critical gap in adoption literature. This paper introduces FAIGMOE (Framework for the Adoption and Integration of Generative AI in Midsize Organizations and Enterprises), a conceptual framework addressing the unique needs of both organizational types. FAIGMOE synthesizes technology adoption theory, organizational change management, and innovation diffusion perspectives into four interconnected phases: Strategic Assessment, Planning and Use Case Development, Implementation and Integration, and Operationalization and Optimization. Each phase provides scalable guidance on readiness assessment, strategic alignment, risk governance, technical architecture, and change management adaptable to organizational scale and complexity. The framework incorporates GenAI specific considerations including prompt engineering, model orchestration, and hallucination management that distinguish it from generic technology adoption frameworks. As a perspective contribution, FAIGMOE provides the first comprehensive conceptual framework explicitly addressing GenAI adoption across midsize and enterprise organizations, offering actionable implementation protocols, assessment instruments, and governance templates requiring empirical validation through future research.
☆ On Interaction Effects in Greybox Fuzzing ICSE '26
A greybox fuzzer is an automated software testing tool that generates new test inputs by applying randomly chosen mutators (e.g., flipping a bit or deleting a block of bytes) to a seed input in random order and adds all coverage-increasing inputs to the corpus of seeds. We hypothesize that the order in which mutators are applied to a seed input has an impact on the effectiveness of greybox fuzzers. In our experiments, we fit a linear model to a dataset that contains the effectiveness of all possible mutator pairs and indeed observe the conjectured interaction effect. This points us to more efficient fuzzing by choosing the most promising mutator sequence with a higher likelihood. We propose MuoFuzz, a greybox fuzzer that learns and chooses the most promising mutator sequences. MuoFuzz learns the conditional probability that the next mutator will yield an interesting input, given the previously selected mutator. Then, it samples from the learned probability using a random walk to generate mutator sequences. We compare the performance of MuoFuzz to AFL++, which uses a fixed selection probability, and MOPT, which optimizes the selection probability of each mutator in isolation. Experimental results on the FuzzBench and MAGMA benchmarks show that MuoFuzz achieves the highest code coverage and finds four bugs missed by AFL++ and one missed by both AFL++ and MOPT.
comment: 12 pages, 2 figures, Accepted for presentation at the 48th International Conference on Software Engineering (ICSE '26)
☆ SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment
With the increasing deployment of Large Language Models (LLMs) on mobile and edge platforms, securing them against model extraction attacks has become a pressing concern. However, protecting model privacy without sacrificing the performance benefits of untrusted AI accelerators, such as GPUs, presents a challenging trade-off. In this paper, we initiate the study of high-performance execution on LLMs and present SecureInfer, a hybrid framework that leverages a heterogeneous Trusted Execution Environments (TEEs)-GPU architecture to isolate privacy-critical components while offloading compute-intensive operations to untrusted accelerators. Building upon an outsourcing scheme, SecureInfer adopts an information-theoretic and threat-informed partitioning strategy: security-sensitive components, including non-linear layers, projection of attention head, FNN transformations, and LoRA adapters, are executed inside an SGX enclave, while other linear operations (matrix multiplication) are performed on the GPU after encryption and are securely restored within the enclave. We implement a prototype of SecureInfer using the LLaMA-2 model and evaluate it across performance and security metrics. Our results show that SecureInfer offers strong security guarantees with reasonable performance, offering a practical solution for secure on-device model inference.
comment: Accepted at IEEE Intelligent Computing and Systems at the Edge (ICEdge) 2025
☆ Designing a Secure and Resilient Distributed Smartphone Participant Data Collection System SP 2025
Real-world health studies require continuous and secure data collection from mobile and wearable devices. We introduce MotionPI, a smartphone-based system designed to collect behavioral and health data through sensors and surveys with minimal interaction from participants. The system integrates passive data collection (such as GPS and wristband motion data) with Ecological Momentary Assessment (EMA) surveys, which can be triggered randomly or based on physical activity. MotionPI is designed to work under real-life constraints, including limited battery life, weak or intermittent cellular connection, and minimal user supervision. It stores data both locally and on a secure cloud server, with encrypted transmission and storage. It integrates through Bluetooth Low Energy (BLE) into wristband devices that store raw data and communicate motion summaries and trigger events. MotionPI demonstrates a practical solution for secure and scalable mobile data collection in cyber-physical health studies.
comment: 9 pages, 3 figures. Accepted at EAI SmartSP 2025 Conference (Springer LNICST). This version is the arXiv preprint prepared for open access
☆ BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents. We introduce a novel method for synthetic generation of difficult and diverse bugs. Our method instructs SWE Agents to introduce a feature into the codebase whereby they may unintentionally break tests, resulting in bugs. Prior approaches often induce an out-of-distribution effect by generating bugs intentionally (e.g. by introducing local perturbation to existing code), which does not reflect realistic development processes. We perform qualitative analysis to demonstrate that our approach for generating bugs more closely reflects the patterns found in human-authored edits. Through extensive experiments, we demonstrate that our bugs provide more efficient training data for supervised fine-tuning, outperforming other bug datasets by 2% with half the training data (1.2k vs. 3k bugs). We train on our newly generated bugs in addition to existing bug datasets to get FrogBoss a state-of-the-art 32B parameter model on SWE-bench Verified with a pass@1 of 54.6% and FrogMini a state-of-the-art 14B model on SWE-bench Verified with a pass@1 of 45.3% on SWE-bench Verified averaged over three seeds.
☆ Policy-Governed RAG - Research Design Study
A policy-governed RAG architecture is specified for audit-ready generation in regulated workflows, organized as a triptych: (I) Contracts/Control (SHRDLU-like), which governs output adherence to legal and internal policies; (II) Manifests/Trails (Memex-like), which cryptographically anchors all cited source evidence to ensure verifiable provenance; and (III) Receipts/Verification (Xanadu-like), which provides the final, portable proof of compliance for auditors (portable COSE/JOSE) (see Section 4 and Appendix A). Rather than explaining model internals, outputs are gated ex-ante and bound to cryptographically verifiable evidence for each material answer. Unvalidated targets are stated (>=20% relative reduction in confident errors; p95 latency <= 900 ms; <= 2.2x serve cost) together with a pre-registered (optional) pilot using NO-GO gates. The design complements existing RAG/guardrails by making policy checks auditable, replayable, and receipt-backed. Target domains include back-office compliance in pharma, medical devices, finance, legal, and the public sector where error costs may exceed thousands of euros and audit trails are mandatory under regulations such as the EU AI Act. Future evaluations may pre-commit to publishing negative results when any example NO-GO gate is not met.
comment: 51 pages, 8 figures
☆ Knowledge-Guided Multi-Agent Framework for Application-Level Software Code Generation
Automated code generation driven by Large Lan- guage Models (LLMs) has enhanced development efficiency, yet generating complex application-level software code remains challenging. Multi-agent frameworks show potential, but existing methods perform inadequately in large-scale application-level software code generation, failing to ensure reasonable orga- nizational structures of project code and making it difficult to maintain the code generation process. To address this, this paper envisions a Knowledge-Guided Application-Level Code Generation framework named KGACG, which aims to trans- form software requirements specification and architectural design document into executable code through a collaborative closed- loop of the Code Organization & Planning Agent (COPA), Coding Agent (CA), and Testing Agent (TA), combined with a feedback mechanism. We demonstrate the collaborative process of the agents in KGACG in a Java Tank Battle game case study while facing challenges. KGACG is dedicated to advancing the automation of application-level software development.
☆ SODBench: A Large Language Model Approach to Documenting Spreadsheet Operations
Numerous knowledge workers utilize spreadsheets in business, accounting, and finance. However, a lack of systematic documentation methods for spreadsheets hinders automation, collaboration, and knowledge transfer, which risks the loss of crucial institutional knowledge. This paper introduces Spreadsheet Operations Documentation (SOD), an AI task that involves generating human-readable explanations from spreadsheet operations. Many previous studies have utilized Large Language Models (LLMs) for generating spreadsheet manipulation code; however, translating that code into natural language for SOD is a less-explored area. To address this, we present a benchmark of 111 spreadsheet manipulation code snippets, each paired with a corresponding natural language summary. We evaluate five LLMs, GPT-4o, GPT-4o-mini, LLaMA-3.3-70B, Mixtral-8x7B, and Gemma2-9B, using BLEU, GLEU, ROUGE-L, and METEOR metrics. Our findings suggest that LLMs can generate accurate spreadsheet documentation, making SOD a feasible prerequisite step toward enhancing reproducibility, maintainability, and collaborative workflows in spreadsheets, although there are challenges that need to be addressed.
comment: 14 pages, 5 figures, 4 tables
☆ Development of an Automated Web Application for Efficient Web Scraping: Design and Implementation
This paper presents the design and implementation of a user-friendly, automated web application that simplifies and optimizes the web scraping process for non-technical users. The application breaks down the complex task of web scraping into three main stages: fetching, extraction, and execution. In the fetching stage, the application accesses target websites using the HTTP protocol, leveraging the requests library to retrieve HTML content. The extraction stage utilizes powerful parsing libraries like BeautifulSoup and regular expressions to extract relevant data from the HTML. Finally, the execution stage structures the data into accessible formats, such as CSV, ensuring the scraped content is organized for easy use. To provide personalized and secure experiences, the application includes user registration and login functionalities, supported by MongoDB, which stores user data and scraping history. Deployed using the Flask framework, the tool offers a scalable, robust environment for web scraping. Users can easily input website URLs, define data extraction parameters, and download the data in a simplified format, without needing technical expertise. This automated tool not only enhances the efficiency of web scraping but also democratizes access to data extraction by empowering users of all technical levels to gather and manage data tailored to their needs. The methodology detailed in this paper represents a significant advancement in making web scraping tools accessible, efficient, and easy to use for a broader audience.
♻ ☆ QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation NeurIPS'2025
The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still fail to ensure functional equivalence in the translated code. In this paper, we propose \textbf{QiMeng-MuPa}, a novel \textbf{Mu}tual-Supervised Learning framework for Sequential-to-\textbf{Pa}rallel code translation, to address the functional equivalence issue. QiMeng-MuPa consists of two models, a Translator and a Tester. Through an iterative loop consisting of Co-verify and Co-evolve steps, the Translator and the Tester mutually generate data for each other and improve collectively. The Tester generates unit tests to verify and filter functionally equivalent translated code, thereby evolving the Translator, while the Translator generates translated code as augmented input to evolve the Tester. Experimental results demonstrate that QiMeng-MuPa significantly enhances the performance of the base models: when applied to Qwen2.5-Coder, it not only improves Pass@1 by up to 28.91% and boosts Tester performance by 68.90%, but also outperforms the previous state-of-the-art method CodeRosetta by 1.56 and 6.92 in BLEU and CodeBLEU scores, while achieving performance comparable to DeepSeek-R1 and GPT-4.1. Our code is available at https://github.com/kcxain/mupa.
comment: Accepted to NeurIPS'2025
♻ ☆ Type-aware LLM-based Regression Test Generation for Python Programs
Automated regression test generation has been extensively explored, yet generating high-quality tests for Python programs remains particularly challenging. Because of the Python's dynamic typing features, existing approaches, ranging from search-based software testing (SBST) to recent LLM-driven techniques, are often prone to type errors. Hence, existing methods often generate invalid inputs and semantically inconsistent test cases, which ultimately undermine their practical effectiveness. To address these limitations, we present Test4Py, a novel framework that enhances type correctness in automated test generation for Python. Test4Py leverages the program's call graph to capture richer contextual information about parameters, and introduces a behavior-based type inference mechanism that accurately infers parameter types and construct valid test inputs. Beyond input construction, Test4Py integrates an iterative repair procedure that progressively refines generated test cases to improve coverage. In an evaluation on 183 real-world Python modules, Test4Py achieved an average statement coverage of 83.0% and branch coverage of 70.8%, outperforming state-of-the-art tools by 7.2% and 8.4%, respectively.
♻ ☆ TrioXpert: An Automated Incident Management Framework for Microservice System
Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks. TrioXpert has also been deployed in Lenovo's production environment, demonstrating substantial gains in diagnostic efficiency and accuracy.
♻ ☆ Can Agents Fix Agent Issues? NeurIPS 2025
LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AGENTISSUE-BENCH and reveal their limited effectiveness (i.e., with only 3.33% - 12.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/ .
comment: Accepted by the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Detecting and Preventing Latent Risk Accumulation in High-Performance Software Systems
Modern distributed systems employ aggressive optimization strategies that create latent risks - hidden vulnerabilities where exceptional performance masks catastrophic fragility when optimizations fail. Cache layers achieving 99% hit rates can obscure database bottlenecks until cache failures trigger 100x load amplification and cascading collapse. Current reliability engineering focuses on reactive incident response rather than proactive detection of optimization-induced vulnerabilities. This paper presents the first comprehensive framework for systematic latent risk detection, prevention, and optimization through integrated mathematical modeling, intelligent perturbation testing, and risk-aware performance optimization. We introduce the Latent Risk Index (LRI) that correlates strongly with incident severity (r=0.863, p<0.001), enabling predictive risk assessment. Our framework integrates three systems: HYDRA employing six optimization-aware perturbation strategies achieving 89.7% risk discovery rates, RAVEN providing continuous production monitoring with 92.9% precision and 93.8% recall across 1,748 scenarios, and APEX enabling risk-aware optimization maintaining 96.6% baseline performance while reducing latent risks by 59.2%. Evaluation across three testbed environments demonstrates strong statistical validation with large effect sizes (Cohen d>2.0) and exceptional reproducibility (r>0.92). Production deployment over 24 weeks shows 69.1% mean time to recovery reduction, 78.6% incident severity reduction, and 81 prevented incidents generating 1.44M USD average annual benefits with 3.2-month ROI. Our approach transforms reliability engineering from reactive incident management to proactive risk-aware optimization.
comment: 26 pages, 12 tables, 4 figures. Academic-industry collaboration. Framework (HYDRA, RAVEN, APEX) for optimization-induced vulnerabilities. Evaluated: 2,160 configs, 12.7TB data, 1,748 scenarios
Systems and Control 25
☆ Nodal Capacity Expansion Planning with Flexible Large-Scale Load Siting
We propose explicitly incorporating large-scale load siting into a stochastic nodal power system capacity expansion planning model that concurrently co-optimizes generation, transmission and storage expansion. The potential operational flexibility of some of these large loads is also taken into account by considering them as consisting of a set of tranches with different reliability requirements, which are modeled as a constraint on expected served energy across operational scenarios. We implement our model as a two-stage stochastic mixed-integer optimization problem with cross-scenario expectation constraints. To overcome the challenge of scalability, we build upon existing work to implement this model on a high performance computing platform and exploit scenario parallelization using an augmented Progressive Hedging Algorithm. The algorithm is implemented using the bounding features of mpisppy, which have shown to provide satisfactory provable optimality gaps despite the absence of theoretical guarantees of convergence. We test our approach to assess the value of this proactive planning framework on total system cost and reliability metrics using realistic testcases geographically assigned to San Diego and South Carolina, with datacenter and direct air capture facilities as large loads.
☆ Bridging Earth and Space: A Survey on HAPS for Non-Terrestrial Networks
HAPS are emerging as key enablers in the evolution of 6G wireless networks, bridging terrestrial and non-terrestrial infrastructures. Operating in the stratosphere, HAPS can provide wide-area coverage, low-latency, energy-efficient broadband communications with flexible deployment options for diverse applications. This survey delivers a comprehensive overview of HAPS use cases, technologies, and integration strategies within the 6G ecosystem. The roles of HAPS in extending connectivity to underserved regions, supporting dynamic backhauling, enabling massive IoT, and delivering reliable low-latency communications for autonomous and immersive services are discussed. The paper reviews state-of-the-art architectures for terrestrial and non-terrestrial network integration, highlights recent field trials. Furthermore, key enabling technologies such as channel modeling, AI-driven resource allocation, interference control, mobility management, and energy-efficient communications are examined. The paper also outlines open research challenges. By addressing existing gaps in the literature, this survey positions HAPS as a foundational component of globally integrated, resilient, and sustainable 6G networks.
comment: 30 pages. This work has been submitted to IEEE Communications Surveys & Tutorials (under review)
☆ Optimal Kron-based Reduction of Networks (Opti-KRON) for Three-phase Distribution Feeders
This paper presents a novel structure-preserving, Kron-based reduction framework for unbalanced distribution feeders. The method aggregates electrically similar nodes within a mixed-integer optimization (MIP) problem to produce reduced networks that optimally reproduce the voltage profiles of the original full network. To overcome computational bottlenecks of MIP formulations, we propose an exhaustive-search formulation to identify optimal aggregation decisions while enforcing voltage margin limits. The proposed exhaustive network reduction algorithm is parallelizable on GPUs, which enables scalable network reduction. The resulting reduced networks approximate the full system's voltage profiles with low errors and are suitable for steady-state analysis and optimal power flow studies. The framework is validated on two real utility distribution feeders with 5,991 and 8,381 nodes. The reduced models achieve up to 90% and 80% network reduction, respectively, while the maximum voltage-magnitude error remains below 0.003 p.u. Furthermore, on a 1000-node version of the network, the GPU-accelerated reduction algorithm runs up to 15x faster than its CPU-based counterpart.
☆ Control Barrier Functions for the Full Class of Signal Temporal Logic Tasks using Spatiotemporal Tubes
This paper introduces a new framework for synthesizing time-varying control barrier functions (TV-CBFs) for general Signal Temporal Logic (STL) specifications using spatiotemporal tubes (STT). We first formulate the STT synthesis as a robust optimization problem (ROP) and solve it through a scenario optimization problem (SOP), providing formal guarantees that the resulting tubes capture the given STL specifications. These STTs are then used to construct TV-CBFs, ensuring that under any control law rendering them invariant, the system satisfies the STL tasks. We demonstrate the framework through case studies on a differential-drive mobile robot and a quadrotor, and provide a comparative analysis showing improved efficiency over existing approaches.
☆ Multi-UAV Flood Monitoring via CVT with Gaussian Mixture of Density Functions for Coverage Control
This study presents a control strategy for coordinating multiple unmanned aerial vehicles (UAVs) to monitor unknown flood regions and estimate the extent of inundation. The proposed method adopts a density-driven coverage framework based on Centroidal Voronoi Tessellation (CVT), in which the density function is modeled using a Gaussian Mixture of Density Functions (GMDF). This formulation provides a more accurate characterization of inundated areas compared to conventional axis-aligned Gaussian models. The performance of the two density modeling approaches is systematically evaluated under different UAV fleet sizes (16, 20, and 24), with multiple simulation trials conducted in the ROS/Gazebo environment. The results show that the GMDF-based formulation consistently achieves higher coverage rates, demonstrating its effectiveness in enhancing flood monitoring and improving UAV spatial distribution.
comment: 9 pages,6 figures
☆ Risk Assessment of an Autonomous Underwater Snake Robot in Confined Operations
The growing interest in ocean discovery imposes a need for inspection and intervention in confined and demanding environments. Eely's slender shape, in addition to its ability to change its body configurations, makes articulated underwater robots an adequate option for such environments. However, operation of Eely in such environments imposes demanding requirements on the system, as it must deal with uncertain and unstructured environments, extreme environmental conditions, and reduced navigational capabilities. This paper proposes a Bayesian approach to assess the risks of losing Eely during two mission scenarios. The goal of this work is to improve Eely's performance and the likelihood of mission success. Sensitivity analysis results are presented in order to demonstrate the causes having the highest impact on losing Eely.
comment: 9 pages, 6 figures, Accepted for publication in OCEANS 2023 - Limerick
☆ Managing Charging Induced Grid Stress and Battery Degradation in Electric Taxi Fleets
Operating fleets of electric vehicles (EVs) introduces several challenges, some of which are borne by the fleet operator, and some of which are borne by the power grid. To maximize short-term profit a fleet operator could always charge EVs at the maximum rate to ensure vehicles are ready to service ride demand. However, due to the stochastic nature of electricity demand, charging EVs at their maximum rate may potentially increase the grid stress and lead to overall instability. Furthermore, high-rate charging of EVs can accelerate battery degradation, thereby reducing the service lifespan of the fleet. This study aims to reconcile the conflicting incentives of fleet longevity, short-term profitability, and grid stability by simulating a taxi fleet throughout its lifespan in relation to its charging policies and service conditions. We develop an EV fleet simulator to evaluate the battery degradation due to unpredictable charging and ride demand. Consequently, the impact on the power grid through the charging infrastructure is assessed due to these activities. This simulation utilizes publicly accessible real-world travel data from the NYC taxi dataset. We compare a baseline 80-20 fleet charging policy with a reinforcement learning-based policy designed to prolong the fleet's service life and alleviate grid stress. We monitor grid stress, battery degradation, and profitability over five years and find that our learned policy outperforms the baseline. This simulator enables fleet operators to assess the impact of different charging policies on these indicators to make informed decisions in the future.
comment: 7 pages, 8 figures, to be published in the proceedings of 2025 IEEE Innovative Smart Grid Technologies - Asia (ISGT-Asia)
☆ Magnetic field estimation using Gaussian process regression for interactive wireless power system design
Wireless power transfer (WPT) with coupled resonators offers a promising solution for the seamless powering of electronic devices. Interactive design approaches that visualize the magnetic field and power transfer efficiency based on system geometry adjustments can facilitate the understanding and exploration of the behavior of these systems for dynamic applications. However, typical electromagnetic field simulation methods, such as the Method of Moments (MoM), require significant computational resources, limiting the rate at which computation can be performed for acceptable interactivity. Furthermore, the system's sensitivity to positional and geometrical changes necessitates a large number of simulations, and structures such as ferromagnetic shields further complicate these simulations. Here, we introduce a machine learning approach using Gaussian Process Regression (GPR), demonstrating for the first time the rapid estimation of the entire magnetic field and power transfer efficiency for near-field coupled systems. To achieve quick and accurate estimation, we develop 3D adaptive grid systems and an active learning strategy to effectively capture the nonlinear interactions between complex system geometries and magnetic fields. By training a regression model, our approach achieves magnetic field computation with sub-second latency and with an average error of less than 6% when validated against independent electromagnetic simulation results.
comment: 29 pages, 8 figures, 1 table
☆ Spatiotemporal Tubes based Control of Unknown Multi-Agent Systems for Temporal Reach-Avoid-Stay Tasks
The paper focuses on designing a controller for unknown dynamical multi-agent systems to achieve temporal reach-avoid-stay tasks for each agent while preventing inter-agent collisions. The main objective is to generate a spatiotemporal tube (STT) for each agent and thereby devise a closed-form, approximation-free, and decentralized control strategy that ensures the system trajectory reaches the target within a specific time while avoiding time-varying unsafe sets and collisions with other agents. In order to achieve this, the requirements of STTs are formulated as a robust optimization problem (ROP) and solved using a sampling-based scenario optimization problem (SOP) to address the issue of infeasibility caused by the infinite number of constraints in ROP. The STTs are generated by solving the SOP, and the corresponding closed-form control is designed to fulfill the specified task. Finally, the effectiveness of our approach is demonstrated through two case studies, one involving omnidirectional robots and the other involving multiple drones modelled as Euler-Lagrange systems.
☆ Query-Efficient Zeroth-Order Algorithms for Nonconvex Optimization
Zeroth-order optimization (ZO) has been a powerful framework for solving black-box problems, which estimates gradients using zeroth-order data to update variables iteratively. The practical applicability of ZO critically depends on the efficiency of single-step gradient estimation and the overall query complexity. However, existing ZO algorithms cannot achieve efficiency on both simultaneously. In this work, we consider a general constrained optimization model with black-box objective and constraint functions. To solve it, we propose novel algorithms that can achieve the state-of-the-art overall query complexity bound of $\mathcal{O}(d/\epsilon^4)$ to find an $\epsilon$-stationary solution ($d$ is the dimension of variable space), while reducing the queries for estimating a single-step gradient from $\mathcal{O}(d)$ to $\mathcal{O}(1)$. Specifically, we integrate block updates with gradient descent ascent and a block gradient estimator, which leads to two algorithms, ZOB-GDA and ZOB-SGDA, respectively. Instead of constructing full gradients, they estimate only partial gradients along random blocks of dimensions, where the adjustable block sizes enable high single-step efficiency without sacrificing convergence guarantees. Our theoretical results establish the finite-sample convergence of the proposed algorithms for nonconvex optimization. Finally, numerical experiments on a practical problem demonstrate that our algorithms require over ten times fewer queries than existing methods.
comment: 34 pages, 4 figures
☆ Policy Gradient Method for LQG Control via Input-Output-History Representation: Convergence to $O(ε)$-Stationary Points
We study the policy gradient method (PGM) for the linear quadratic Gaussian (LQG) dynamic output-feedback control problem using an input-output-history (IOH) representation of the closed-loop system. First, we show that any dynamic output-feedback controller is equivalent to a static partial-state feedback gain for a new system representation characterized by a finite-length IOH. Leveraging this equivalence, we reformulate the search for an optimal dynamic output feedback controller as an optimization problem over the corresponding partial-state feedback gain. Next, we introduce a relaxed version of the IOH-based LQG problem by incorporating a small process noise with covariance $\epsilon I$ into the new system to ensure coerciveness, a key condition for establishing gradient-based convergence guarantees. Consequently, we show that a vanilla PGM for the relaxed problem converges to an $\mathcal{O}(\epsilon)$-stationary point, i.e., $\overline{K}$ satisfying $\|\nabla J(\overline{K})\|_F \leq \mathcal{O}(\epsilon)$, where $J$ denotes the original LQG cost. Numerical experiments empirically indicate convergence to the vicinity of the globally optimal LQG controller.
☆ Safe Output-Feedback Adaptive Optimal Control of Affine Nonlinear Systems
In this paper, we develop a safe control synthesis method that integrates state estimation and parameter estimation within an adaptive optimal control (AOC) and control barrier function (CBF)-based control architecture. The developed approach decouples safety objectives from the learning objectives using a CBF-based guarding controller where the CBFs are robustified to account for the lack of full-state measurements. The coupling of this guarding controller with the AOC-based stabilizing control guarantees safety and regulation despite the lack of full state measurement. The paper leverages recent advancements in deep neural network-based adaptive observers to ensure safety in the presence of state estimation errors. Safety and convergence guarantees are provided using a Lyapunov-based analysis, and the effectiveness of the developed controller is demonstrated through simulation under mild excitation conditions.
☆ Ultra-Fast Wireless Power Hacking
The rapid growth of electric vehicles (EVs) has driven the development of roadway wireless charging technology, effectively extending EV driving range. However, wireless charging introduces significant cybersecurity challenges. Any receiver within the magnetic field can potentially extract energy, and previous research demonstrated that a hacker could detect the operating frequency and steal substantial power. However, our approach required time to track new frequencies or precise adjustments of inductance and capacitance, which would be less effective against potential rapid transmitter frequency changes or capacitance drift. As a solution, we enhanced the interceptor and enabled it to intrude as well as steal energy within just three cycles of the high-frequency signal. Moreover, it can work without any circuit parameters or look-up tables. The key innovation is synchronizing the receiver current with the phase of the magnetic sensor voltage. Through MATLAB / Simulink simulations, finite-element analysis, and experimental validation, we demonstrated that our improved method can steal over 76% of the power received by a fully resonant receiver under identical conditions. This attack demonstrates that simple frequency-changing power encryption offers limited protection against such threats.
comment: 11 pages, 15 figures
☆ Approximate Model Predictive Control for Microgrid Energy Management via Imitation Learning
Efficient energy management is essential for reliable and sustainable microgrid operation amid increasing renewable integration. This paper proposes an imitation learning-based framework to approximate mixed-integer Economic Model Predictive Control (EMPC) for microgrid energy management. The proposed method trains a neural network to imitate expert EMPC control actions from offline trajectories, enabling fast, real-time decision making without solving optimization problems online. To enhance robustness and generalization, the learning process includes noise injection during training to mitigate distribution shift and explicitly incorporates forecast uncertainty in renewable generation and demand. Simulation results demonstrate that the learned policy achieves economic performance comparable to EMPC while only requiring $10\%$ of the computation time of optimization-based EMPC in practice.
comment: Submitted to Engineering Applications of Artificial Intelligence (EAAI) and IFAC WC 2026
☆ IMAS$^2$: Joint Agent Selection and Information-Theoretic Coordinated Perception In Dec-POMDPs
We study the problem of jointly selecting sensing agents and synthesizing decentralized active perception policies for the chosen subset of agents within a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) framework. Our approach employs a two-layer optimization structure. In the inner layer, we introduce information-theoretic metrics, defined by the mutual information between the unknown trajectories or some hidden property in the environment and the collective partial observations in the multi-agent system, as a unified objective for active perception problems. We employ various optimization methods to obtain optimal sensor policies that maximize mutual information for distinct active perception tasks. In the outer layer, we prove that under certain conditions, the information-theoretic objectives are monotone and submodular with respect to the subset of observations collected from multiple agents. We then exploit this property to design an IMAS$^2$ (Information-theoretic Multi-Agent Selection and Sensing) algorithm for joint sensing agent selection and sensing policy synthesis. However, since the policy search space is infinite, we adapt the classical Nemhauser-Wolsey argument to prove that the proposed IMAS$^2$ algorithm can provide a tight $(1 - 1/e)$-guarantee on the performance. Finally, we demonstrate the effectiveness of our approach in a multi-agent cooperative perception in a grid-world environment.
♻ ☆ Modeling and Simulation of an Active Car Suspension with a Robust LQR Controller under Road Disturbance, Parameter Uncertainty and White Noise
Vehicle suspension is important for passengers to travel comfortably and to be less exposed to effects such as vibration and shock. A good suspension system increases the road holding of vehicles, allows them to take turns safely, and reduces the risk of traffic accidents. A passive suspension system is the most widely used suspension system in vehicles due to its simple structure and low cost. Passive suspension systems do not have an actuator and therefore do not have a controller. Active suspension systems have an actuator and a controller. Although their structures are more complex and costly, they are safer. PID controller is widely used in active suspension systems due to its simple structure, reasonable cost, and easy adjustment of coefficients. In this study, a more robust LQR-controlled active suspension was designed than a passive suspension and a PID-controlled active suspension. Robustness analyses were performed for passive suspension, PID-controlled active suspension, and LQR-controlled active suspension. Suspension travel, sprung mass acceleration, and sprung mass motion simulations were performed for all three suspensions under road disturbance, under simultaneous road disturbance and parameter uncertainty and under road disturbance with white noise. A comparative analysis was performed by obtaining the rise time, overshoot, and settling time data of the suspensions under different conditions. It was observed that the LQR-controlled active suspension showed the fastest rise time, the least overshoot and had the shortest settling time. In this case, it was proven that the LQR controlled active suspension provided a more comfortable and safe ride compared to the other two suspension systems.
comment: 20 pages, 19 figures
♻ ☆ AttentionSwarm: Reinforcement Learning with Attention Control Barier Function for Crazyflie Drones in Dynamic Environments
We introduce AttentionSwarm, a novel benchmark designed to evaluate safe and efficient swarm control in a dynamic drone racing scenario. Central to our approach is the Attention Model-Based Control Barrier Function (CBF) framework, which integrates attention mechanisms with safety-critical control theory to enable real-time collision avoidance and trajectory optimization. This framework dynamically prioritizes critical obstacles and agents in the swarm's vicinity using attention weights, while CBFs formally guarantee safety by enforcing collision-free constraints. The AttentionSwarm algorithm was developed and evaluated using a swarm of Crazyflie 2.1 micro quadrotors, which were tested indoors with the Vicon motion capture system to ensure precise localization and control. Experimental results show that our system achieves a 95-100% collision-free navigation rate in a dynamic multi-agent drone racing environment, underscoring its effectiveness and robustness in real-world scenarios. This work offers a promising foundation for safe, high-speed multi-robot applications in logistics, inspection, and racing.
comment: 6 pages, 6 figures
♻ ☆ A Data-Driven Method to Identify Major Contributors to Low-Frequency Oscillations
We present a purely data-driven method to pinpoint generation plants that significantly contribute to poorly damped oscillations as part of post-event analysis. First, Extended Dynamic Mode Decomposition (EDMD) is applied on PMU data from the point of interconnection (POI) of the plants to obtain the finite-dimensional Koopman operator. Then, modal analysis is performed on a reduced-order Koopman operator to extract spatio-temporal patterns. The data-driven eigenvalues and eigenvectors quantify each plant's contribution to critical oscillatory modes without requiring any system model information. We demonstrate the effectiveness of this method through simulated case studies on modified IEEE 39-bus and WECC 179-bus test systems by benchmarking the data-driven results against ground-truth models. Its performance is further validated using PMU data from real oscillation events in the ISO-New England system. This data-driven method offers a practical tool for both planning-stage simulations and post-event analysis of real oscillation events, enabling effective mitigation.
comment: 10 pages, 11 figures, Journal paper.Submitted to IEEE Transactions on Power System
♻ ☆ Planning of Off-Grid Renewable Power to Ammonia Systems with Heterogeneous Flexibility: A Multistakeholder Equilibrium Perspective
Off-grid renewable power to ammonia (ReP2A) systems present a promising pathway toward carbon neutrality in both the energy and chemical industries. However, due to chemical safety requirements, the limited flexibility of ammonia synthesis poses a challenge when attempting to align with the variable hydrogen flow produced from renewable power. This necessitates the optimal sizing of equipment capacity for effective and coordinated production across the system. Additionally, an ReP2A system may involve multiple stakeholders with varying degrees of operational flexibility, complicating the planning problem. This paper first examines the multistakeholder sizing equilibrium (MSSE) of the ReP2A system. First, we propose an MSSE model that accounts for individual planning decisions and the competing economic interests of the stakeholders of power generation, hydrogen production, and ammonia synthesis. We then construct an equivalent optimization problem based on Karush-Kuhn-Tucker (KKT) conditions to determine the equilibrium. Following this, we decompose the problem in the temporal dimension and solve it via multicut generalized Benders decomposition (GBD) to address long-term balancing issues. Case studies based on a realistic project reveal that the equilibrium does not naturally balance the interests of all stakeholders due to their heterogeneous characteristics. Our findings suggest that benefit transfer or re-arrangement ensure mutual benefits and the successful implementation of ReP2A projects.
comment: Accepted in IEEE Transactions on Power Systems \copyright2025 IEEE
♻ ☆ Optimal Investment Portfolio of Thyristor- and IGBT-based Electrolysis Rectifiers in Utility-scale Renewable P2H Systems
Renewable power-to-hydrogen (ReP2H) systems require rectifiers to supply power to electrolyzers (ELZs). Two main types of rectifiers, insulated-gate bipolar transistor rectifiers (IGBT-Rs) and thyristor rectifiers (TRs), offer distinct tradeoffs. IGBT-Rs provide flexible reactive power control but are costly, whereas TRs are more affordable with lower power loss but consume a large amount of uncontrollable reactive power. A mixed configuration of rectifiers in utility-scale ReP2H systems could achieve a decent tradeoff and increase overall profitability. To explore this potential, this paper proposes an optimal investment portfolio model. First, we model and compare the active and reactive power characteristics of ELZs powered by TRs and IGBT-Rs. Second, we consider the investment of ELZs, rectifiers, and var resources and coordinate the operation of renewables, energy storage, var resources, and the on-off switching and load allocation of multiple ELZs. Subsequently, a two-stage stochastic programming (SP) model based on weighted information gap decision theory (W-IGDT) is developed to address the uncertainties of the renewable power and hydrogen price, and we apply the progressive hedging (PH) algorithm to accelerate its solution. Case studies demonstrate that optimal rectifier configurations increase revenue by at most 13.78% compared with configurations using only TRs or IGBT-Rs, existing project setups, or intuitive designs. Under the optimal portfolio, reactive power compensation investment is nearly eliminated, with a preferred TR-to-IGBT-R ratio of 3:1.
comment: Accepted in IEEE Transactions on Sustainable Energy \copyright 2025 IEEE
♻ ☆ QPPG: Quantum-Preconditioned Policy Gradient for Link Adaptation in Rayleigh Fading Channels
Reliable link adaptation is critical for efficient wireless communications in dynamic fading environments. However, reinforcement learning (RL) solutions often suffer from unstable convergence due to poorly conditioned policy gradients, hindering their practical application. We propose the quantum-preconditioned policy gradient (QPPG) algorithm, which leverages Fisher-information-based preconditioning to stabilise and accelerate policy updates. Evaluations in Rayleigh fading scenarios show that QPPG achieves faster convergence, a 28.6% increase in average throughput, and a 43.8% decrease in average transmit power compared to classical methods. This work introduces quantum-geometric conditioning to link adaptation, marking a significant advance in developing robust, quantum-inspired reinforcement learning for future 6G networks, thereby enhancing communication reliability and energy efficiency.
comment: Submitted to IEEE Wireless Communications Letters
♻ ☆ Time-causal and time-recursive wavelets
When to apply wavelet analysis to real-time temporal signals, where the future cannot be accessed, it is essential to base all the steps in the signal processing pipeline on computational mechanisms that are truly time-causal. This paper describes how a time-causal wavelet analysis can be performed based on concepts developed in the area of temporal scale-space theory, originating from a complete classification of temporal smoothing kernels that guarantee non-creation of new structures from finer to coarser temporal scale levels. By necessity, convolution with truncated exponential kernels in cascade constitutes the only permissable class of kernels, as well as their temporal derivatives as a natural complement to fulfil the admissibility conditions of wavelet representations. For a particular way of choosing the time constants in the resulting infinite convolution of truncated exponential kernels, to ensure temporal scale covariance and thus self-similarity over temporal scales, we describe how mother wavelets can be chosen as temporal derivatives of the resulting time-causal limit kernel. By developing connections between wavelet theory and scale-space theory, we characterize and quantify how the continuous scaling properties transfer to the discrete implementation, demonstrating how the proposed time-causal wavelet representation can reflect the duration of locally dominant temporal structures in the input signals. We propose that this notion of time-causal wavelet analysis could be a valuable tool for signal processing tasks, where streams of signals are to be processed in real time, specifically for signals that may contain local variations over a rich span of temporal scales, or more generally for analysing physical or biophysical temporal phenomena, where a fully time-causal analysis is called for to be physically realistic.
comment: 25 pages, 10 figures
♻ ☆ Physics-Informed Reinforcement Learning for Large-Scale EV Smart Charging Considering Distribution Network Voltage Constraints
Electric Vehicles (EVs) offer substantial flexibility for grid services, yet large-scale, uncoordinated charging can threaten voltage stability in distribution networks. Existing Reinforcement Learning (RL) approaches for smart charging often disregard physical grid constraints or have limited performance for complex large-scale tasks, limiting their scalability and real-world applicability. This paper introduces a physics-informed (PI) RL algorithm that integrates a differentiable power flow model and voltage-based reward design into the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, enabling EVs to deliver real-time voltage support while meeting user demands. The resulting PI-TD3 algorithm achieves faster convergence, improved sample efficiency, and reliable voltage magnitude regulation under uncertain and overloaded conditions. Benchmarks on the IEEE 34-bus and 123-bus networks show that the proposed PI-TD3 outperforms both model-free RL and optimization-based baselines in grid constraint management, user satisfaction, and economic metrics, even as the system scales to hundreds of EVs. These advances enable robust, scalable, and practical EV charging strategies that enhance grid resilience and support distribution networks operation.
♻ ☆ Optimization and Control Technologies for Renewable-Dominated Hydrogen-Blended Integrated Gas-Electricity System: A Review
The growing coupling among electricity, gas, and hydrogen systems is driven by green hydrogen blending into existing natural gas pipelines, paving the way toward a renewable-dominated energy future. However, the integration poses significant challenges, particularly ensuring efficient and safe operation under varying hydrogen penetration and infrastructure adaptability. This paper reviews progress in optimization and control technologies for hydrogen-blended integrated gas-electricity system. First, key technologies and international demonstration projects are introduced to provide an overview of current developments. Besides, advances in gas-electricity system integration, including modeling, scheduling, planning and market design, are reviewed respectively. Then, the potential for cross-system fault propagation is highlighted, and practical methods for safety analysis and control are proposed. Finally, several possible research directions are introduced, aiming to ensure efficient renewable integration and reliable operation.
comment: Accepted by CSEE Journal of Power and Energy Systems in Oct. 2025
♻ ☆ Equity-aware Design and Timing of Fare-free Transit Zoning under Demand Uncertainty
We propose the first analytical stochastic model for optimizing the configuration and implementation policies of fare-free transit. The model focuses on a transportation corridor with two transportation modes: automobiles and buses. The corridor is divided into two sections, an inner one with fare-free transit service and an outer one with fare-based transit service. Under the static version of the model, the optimized length and frequency of the fare-free transit zone can be determined by maximizing total social welfare. The findings indicate that implementing fare-free transit can increase transit ridership and reduce automobile use within the fare-free zone while social equity among the demand groups can be enhanced by lengthening the fare-free zone. Notably, the optimal zone length increases when both social welfare and equity are considered jointly, compared to only prioritizing social welfare. The dynamic model, framed within a market entry and exit real options approach, solves the fare policy switching problem, establishing optimal timing policies for activating or terminating fare-free service. The results from dynamic models reveal earlier implementation and extended durations of fare-free transit in the social welfare-aware regime, driven by lower thresholds compared to the social equity-aware regime.
Graphics 6
☆ A New Type of Adversarial Examples
Most machine learning models are vulnerable to adversarial examples, which poses security concerns on these models. Adversarial examples are crafted by applying subtle but intentionally worst-case modifications to examples from the dataset, leading the model to output a different answer from the original example. In this paper, adversarial examples are formed in an exactly opposite manner, which are significantly different from the original examples but result in the same answer. We propose a novel set of algorithms to produce such adversarial examples, including the negative iterative fast gradient sign method (NI-FGSM) and the negative iterative fast gradient method (NI-FGM), along with their momentum variants: the negative momentum iterative fast gradient sign method (NMI-FGSM) and the negative momentum iterative fast gradient method (NMI-FGM). Adversarial examples constructed by these methods could be used to perform an attack on machine learning systems in certain occasions. Moreover, our results show that the adversarial examples are not merely distributed in the neighbourhood of the examples from the dataset; instead, they are distributed extensively in the sample space.
☆ Interactive Hypergraph Visual Analytics for Exploring Large and Complex Image Collections
Analyzing large complex image collections in domains like forensics, accident investigation, or social media analysis involves interpreting intricate, overlapping relationships among images. Traditional clustering and classification methods fail to adequately represent these complex relationships, particularly when labeled data or suitable pre-trained models are unavailable. Hypergraphs effectively capture overlapping relationships, but to translate their complexity into information and insights for domain expert users visualization is essential. We propose an interactive visual analytics approach specifically designed for the construction, exploration, and analysis of hypergraphs on large-scale complex image collections. Our core contributions include: (1) a scalable pipeline for constructing hypergraphs directly from raw image data, including a similarity measure to evaluate constructed hypergraphs against a ground truth, (2) interactive visualization techniques that integrate spatial hypergraph representations, interactive grids, and matrix visualizations, enabling users to dynamically explore and interpret relationships without becoming overwhelmed and disoriented, and (3) practical insights on how domain experts can effectively use the application, based on evaluation with real-life image collections. Our results demonstrate that our visual analytics approach facilitates iterative exploration, enabling domain experts to efficiently derive insights from image collections containing tens of thousands of images.
☆ Extreme Views: 3DGS Filter for Novel View Synthesis from Out-of-Distribution Camera Poses
When viewing a 3D Gaussian Splatting (3DGS) model from camera positions significantly outside the training data distribution, substantial visual noise commonly occurs. These artifacts result from the lack of training data in these extrapolated regions, leading to uncertain density, color, and geometry predictions from the model. To address this issue, we propose a novel real-time render-aware filtering method. Our approach leverages sensitivity scores derived from intermediate gradients, explicitly targeting instabilities caused by anisotropic orientations rather than isotropic variance. This filtering method directly addresses the core issue of generative uncertainty, allowing 3D reconstruction systems to maintain high visual fidelity even when users freely navigate outside the original training viewpoints. Experimental evaluation demonstrates that our method substantially improves visual quality, realism, and consistency compared to existing Neural Radiance Field (NeRF)-based approaches such as BayesRays. Critically, our filter seamlessly integrates into existing 3DGS rendering pipelines in real-time, unlike methods that require extensive post-hoc retraining or fine-tuning. Code and results at https://damian-bowness.github.io/EV3DGS
☆ Improving the Physics of Video Generation with VJEPA-2 Reward Signal
This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.
comment: 2 pages
♻ ☆ Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation
The field of Novel View Synthesis has been revolutionized by 3D Gaussian Splatting (3DGS), which enables high-quality scene reconstruction that can be rendered in real-time. 3DGS-based techniques typically suffer from high GPU memory and disk storage requirements which limits their practical application on consumer-grade devices. We propose Opti3DGS, a novel frequency-modulated coarse-to-fine optimization framework that aims to minimize the number of Gaussian primitives used to represent a scene, thus reducing memory and storage demands. Opti3DGS leverages image frequency modulation, initially enforcing a coarse scene representation and progressively refining it by modulating frequency details in the training images. On the baseline 3DGS, we demonstrate an average reduction of 62% in Gaussians, a 40% reduction in the training GPU memory requirements and a 20% reduction in optimization time without sacrificing the visual quality. Furthermore, we show that our method integrates seamlessly with many 3DGS-based techniques, consistently reducing the number of Gaussian primitives while maintaining, and often improving, visual quality. Additionally, Opti3DGS inherently produces a level-of-detail scene representation at no extra cost, a natural byproduct of the optimization pipeline. Results and code will be made publicly available.
♻ ☆ LookUp3D: Data-Driven 3D Scanning
High speed, high-resolution, and accurate 3D scanning would open doors to many new applications in graphics, robotics, science, and medicine by enabling the accurate scanning of deformable objects during interactions. Past attempts to use structured light, time-of-flight, and stereo in high-speed settings have usually required tradeoffs in resolution or inaccuracy. In this paper, we introduce a method that enables, for the first time, 3D scanning at 450 frames per second at 1~Megapixel, or 1,450 frames per second at 0.4~Megapixel in an environment with controlled lighting. The key idea is to use a per-pixel lookup table that maps colors to depths, which is built using a linear stage. Imperfections, such as lens-distortion and sensor defects are baked into the calibration. We describe our method and test it on a novel hardware prototype. We compare the system with both ground-truth geometry as well as commercially available dynamic sensors like the Microsoft Kinect and Intel Realsense. Our results show the system acquiring geometry of objects undergoing high-speed deformations and oscillations and demonstrate the ability to recover physical properties from the reconstructions.
comment: Giancarlo Pereira, Yidan Gao, and Yurii Piadyk are joint first authors with equal contribution. 11 pages of main paper, 9 pages of supplemental text (all combined into a single document)
Computation and Language 129
☆ olmOCR 2: Unit Test Rewards for Document OCR
We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.
comment: https://olmocr.allen.ai/
☆ Hubble: a Model Suite to Advance the Study of LLM Memorization
We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.
☆ Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
☆ Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.
comment: Code: https://github.com/dvlab-research/Scaf-GRPO
☆ The Art of Asking: Multilingual Prompt Optimization for Synthetic Data
Synthetic data has become a cornerstone for scaling large language models, yet its multilingual use remains bottlenecked by translation-based prompts. This strategy inherits English-centric framing and style and neglects cultural dimensions, ultimately constraining model generalization. We argue that the overlooked prompt space-the very inputs that define training distributions-offers a more powerful lever for improving multilingual performance. We introduce a lightweight framework for prompt-space optimization, where translated prompts are systematically transformed for Naturalness, Cultural Adaptation, and Difficulty Enhancement. Using an off-the-shelf multilingual LLM, we apply these transformations to prompts for 12 languages spanning 7 families. Under identical data conditions, our approaches achieve substantial and consistent downstream improvements over the translation-only baseline: +4.7% on Global-MMLU accuracy, +2.4% on Flores XCometXL and +35.3% wins in preferences on mArenaHard. We establish prompt-space optimization as a simple yet powerful paradigm for building multilingual LLMs that are more robust, culturally grounded, and globally capable.
☆ Blackbox Model Provenance via Palimpsestic Membership Inference
Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.
☆ ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers
Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM's context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD's. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.
☆ Adapting Multilingual Models to Code-Mixed Tasks via Model Merging
We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2--5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.
comment: 9 pages, 5 tables, CODS 2025
☆ AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.
☆ GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters
Sparse fine-tuning techniques adapt LLMs to downstream tasks by only tuning a sparse subset of model parameters. However, the effectiveness of sparse adaptation depends on optimally selecting the model parameters to be fine-tuned. In this work, we introduce a novel sparse fine-tuning technique named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which fine-tunes only those model parameters which have the largest gradient magnitudes on downstream tasks and the smallest pre-trained magnitudes, intuitively prioritizing parameters that are highly task-relevant, but minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3 8B and Gemma 2B as base models shows that GaLLoP consistently improves or matches the in-distribution as well as out-of-distribution performance obtained via the usage of other leading parameter-efficient fine-tuning techniques, including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates catastrophic forgetting and memorization of task data, as important pre-trained parameters remain unchanged, and stabilizes performance relative to other fine-tuning techniques, robustly generalizing across most random seeds.
☆ SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration
The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ''underthinking'', where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model's reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a "deepening prompt" to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.
comment: Code: https://github.com/dvlab-research/SmartSwitch
☆ Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning
Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.
☆ From Answers to Guidance: A Proactive Dialogue System for Legal Documents
The accessibility of legal information remains a constant challenge, particularly for laypersons seeking to understand and apply complex institutional texts. While the European Union provides open access to legislation, parliamentary responses, and regulatory documents, these resources can be challenging for laypeople to explore. In this paper, we introduce EUDial, a proactive multi-turn dialogue dataset constructed from 204 blogs curated by the Citizens' Enquiries Unit (AskEP) of the European Parliamentary Research Service. EUDial contains 880 dialogue turns (averaging 4.3 turns per dialogue), where each dialogue includes initial questions, structured answers, and follow-up questions. Beyond dataset construction, we propose the LexGuide framework that leverages retrieval-augmented generation with hierarchical topic organization to structure dialogue progression, ensuring both comprehensive coverage of legal aspects and coherence across conversational turns. The results demonstrate that proactive, structured navigation closes the gap between the availability of legal information and citizen comprehension, establishing EUDial and LexGuide as practical resources for advancing proactive legal dialogue systems.
comment: 21 pages, 3 figures, 2 tables, 2 prompts
☆ Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings
Prompting is a common approach for leveraging LMs in zero-shot settings. However, the underlying mechanisms that enable LMs to perform diverse tasks without task-specific supervision remain poorly understood. Studying the relationship between prompting and the quality of internal representations can shed light on how pre-trained embeddings may support in-context task solving. In this empirical study, we conduct a series of probing experiments on prompt embeddings, analyzing various combinations of prompt templates for zero-shot classification. Our findings show that while prompting affects the quality of representations, these changes do not consistently correlate with the relevance of the prompts to the target task. This result challenges the assumption that more relevant prompts necessarily lead to better representations. We further analyze potential factors that may contribute to this unexpected behavior.
☆ Are Large Language Models Sensitive to the Motives Behind Communication? NeurIPS 2025
Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans' intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source -- for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs' behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents' information ecosystems. In these settings, we find that LLMs' inferences do not track the rational models' predictions nearly as closely -- partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.
comment: NeurIPS 2025
☆ CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.
comment: 19 pages,8 figures
☆ DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22--25\% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM's final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.
☆ Unraveling Emotions with Pre-Trained Models
Transformer models have significantly advanced the field of emotion recognition. However, there are still open challenges when exploring open-ended queries for Large Language Models (LLMs). Although current models offer good results, automatic emotion analysis in open texts presents significant challenges, such as contextual ambiguity, linguistic variability, and difficulty interpreting complex emotional expressions. These limitations make the direct application of generalist models difficult. Accordingly, this work compares the effectiveness of fine-tuning and prompt engineering in emotion detection in three distinct scenarios: (i) performance of fine-tuned pre-trained models and general-purpose LLMs using simple prompts; (ii) effectiveness of different emotion prompt designs with LLMs; and (iii) impact of emotion grouping techniques on these models. Experimental tests attain metrics above 70% with a fine-tuned pre-trained model for emotion recognition. Moreover, the findings highlight that LLMs require structured prompt engineering and emotion grouping to enhance their performance. These advancements improve sentiment analysis, human-computer interaction, and understanding of user behavior across various domains.
☆ From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.
comment: Accepted by NuerIPS 2025 (Poster)
☆ LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation
Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.
☆ Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent
With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD's strong attack performance. We also show SAD's potential threats to multimodal tasks including text-to-image and text-to-speech generation.
☆ HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application
Effective deep search agents must not only access open-domain and domain-specific knowledge but also apply complex rules-such as legal clauses, medical manuals and tariff rules. These rules often feature vague boundaries and implicit logic relationships, making precise application challenging for agents. However, this critical capability is largely overlooked by current agent benchmarks. To fill this gap, we introduce HSCodeComp, the first realistic, expert-level e-commerce benchmark designed to evaluate deep search agents in hierarchical rule application. In this task, the deep reasoning process of agents is guided by these rules to predict 10-digit Harmonized System Code (HSCode) of products with noisy but realistic descriptions. These codes, established by the World Customs Organization, are vital for global supply chain efficiency. Built from real-world data collected from large-scale e-commerce platforms, our proposed HSCodeComp comprises 632 product entries spanning diverse product categories, with these HSCodes annotated by several human experts. Extensive experimental results on several state-of-the-art LLMs, open-source, and closed-source agents reveal a huge performance gap: best agent achieves only 46.8% 10-digit accuracy, far below human experts at 95.0%. Besides, detailed analysis demonstrates the challenges of hierarchical rule application, and test-time scaling fails to improve performance further.
☆ CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English
In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison offers a promising approach to verify information by leveraging external sources in different languages (Chen and Shu, 2024). However, existing datasets for cross-lingual news analysis (Chen et al., 2022a) were manually curated by journalists and experts, limiting their scalability and adaptability to new languages. In this work, we address this gap by introducing a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment. Using this pipeline, we collected a novel dataset CrossNews-UA of news pairs in Ukrainian as a central language with linguistically and contextually relevant languages-Polish, Russian, and English. Each news pair is annotated for semantic similarity with detailed justifications based on the 4W criteria (Who, What, Where, When). We further tested a range of models, from traditional bag-of-words, Transformer-based architectures to large language models (LLMs). Our results highlight the challenges in multilingual news analysis and offer insights into models performance.
☆ PBBQ: A Persian Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models
With the increasing adoption of large language models (LLMs), ensuring their alignment with social norms has become a critical concern. While prior research has examined bias detection in various languages, there remains a significant gap in resources addressing social biases within Persian cultural contexts. In this work, we introduce PBBQ, a comprehensive benchmark dataset designed to evaluate social biases in Persian LLMs. Our benchmark, which encompasses 16 cultural categories, was developed through questionnaires completed by 250 diverse individuals across multiple demographics, in close collaboration with social science experts to ensure its validity. The resulting PBBQ dataset contains over 37,000 carefully curated questions, providing a foundation for the evaluation and mitigation of bias in Persian language models. We benchmark several open-source LLMs, a closed-source model, and Persian-specific fine-tuned models on PBBQ. Our findings reveal that current LLMs exhibit significant social biases across Persian culture. Additionally, by comparing model outputs to human responses, we observe that LLMs often replicate human bias patterns, highlighting the complex interplay between learned representations and cultural stereotypes.Upon acceptance of the paper, our PBBQ dataset will be publicly available for use in future work. Content warning: This paper contains unsafe content.
☆ Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1
In the quest for scientific progress, communicating research is as vital as the discovery itself. Yet, researchers are often sidetracked by the manual, repetitive chore of building project webpages to make their dense papers accessible. While automation has tackled static slides and posters, the dynamic, interactive nature of webpages has remained an unaddressed challenge. To bridge this gap, we reframe the problem, arguing that the solution lies not in a single command, but in a collaborative, hierarchical process. We introduce $\textbf{AutoPage}$, a novel multi-agent system that embodies this philosophy. AutoPage deconstructs paper-to-page creation into a coarse-to-fine pipeline from narrative planning to multimodal content generation and interactive rendering. To combat AI hallucination, dedicated "Checker" agents verify each step against the source paper, while optional human checkpoints ensure the final product aligns perfectly with the author's vision, transforming the system from a mere tool into a powerful collaborative assistant. To rigorously validate our approach, we also construct $\textbf{PageBench}$, the first benchmark for this new task. Experiments show AutoPage not only generates high-quality, visually appealing pages but does so with remarkable efficiency in under 15 minutes for less than \$0.1. Code and dataset will be released at $\href{https://mqleet.github.io/AutoPage_ProjectPage/}{Webpage}$.
☆ Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.
comment: Under review. Both the dataset and code will be published
☆ [De|Re]constructing VLMs' Reasoning in Counting
Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.
comment: This work has been submitted to the IEEE for possible publication
☆ Conditions for Catastrophic Forgetting in Multilingual Translation
Fine-tuning multilingual foundation models on specific languages often induces catastrophic forgetting, degrading performance on languages unseen in fine-tuning. While this phenomenon is widely-documented, the literature presents fragmented results about when forgetting occurs. To address this ambiguity, we conduct a systematic empirical study using machine translation as a testbed to identify the conditions that trigger catastrophic forgetting in multilingual fine-tuning. Through controlled experiments across different model architectures, data scales, and fine-tuning approaches, we reveal that the relative scale between model and data size is a primary determinant of forgetting. Moreover, we demonstrate that a model's instruction-following ability is more critical for retaining multilingual knowledge than its architecture. Contrary to assumptions, parameter-efficient fine-tuning offers no clear advantage over full fine-tuning in mitigating forgetting. Lastly, we show that cross-lingual alignment can mitigate forgetting while also facilitating positive transfer to unseen target languages.
comment: Multilingual Representation Learning (MRL) Workshop 2025
☆ Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment
Speech foundation models have recently achieved remarkable capabilities across a wide range of tasks. However, their evaluation remains disjointed across tasks and model types. Different models excel at distinct aspects of speech processing and thus require different evaluation protocols. This paper proposes a unified taxonomy that addresses the question: Which evaluation is appropriate for which model? The taxonomy defines three orthogonal axes: the \textbf{evaluation aspect} being measured, the model capabilities required to attempt the task, and the task or protocol requirements needed to perform it. We classify a broad set of existing evaluations and benchmarks along these axes, spanning areas such as representation learning, speech generation, and interactive dialogue. By mapping each evaluation to the capabilities a model exposes (e.g., speech generation, real-time processing) and to its methodological demands (e.g., fine-tuning data, human judgment), the taxonomy provides a principled framework for aligning models with suitable evaluation methods. It also reveals systematic gaps, such as limited coverage of prosody, interaction, or reasoning, that highlight priorities for future benchmark design. Overall, this work offers a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models.
comment: 57 pages (26 main, 25 appendix, 6 references)
☆ Lookahead Routing for Large Language Models
Large language model (LLM) routers improve the efficiency of multi-model systems by directing each query to the most appropriate model while leveraging the diverse strengths of heterogeneous LLMs. Most existing approaches frame routing as a classification problem based solely on the input query. While this reduces overhead by avoiding inference across all models, it overlooks valuable information that could be gleaned from potential outputs and fails to capture implicit intent or contextual nuances that often emerge only during response generation. These limitations can result in suboptimal routing decisions, particularly for complex or ambiguous queries that require deeper semantic understanding. To address this challenge, we propose Lookahead, a routing framework that "foresees" potential model outputs by predicting their latent representations and uses these predictions to guide model selection, thus enabling more informed routing without full inference. Within this framework, we implement two approaches based on causal and masked language models. Empirical evaluations across seven public benchmarks - spanning instruction following, mathematical reasoning, and code generation - show that Lookahead consistently outperforms existing routing baselines, achieving an average performance gain of 7.7% over the state-of-the-art. Our code is available at https://github.com/huangcb01/lookahead-routing.
☆ What is the Best Sequence Length for BABYLM? EMNLP
Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.
comment: Paper Accepted at the 2025 BabyLM Workshop @ EMNLP (Suzhou, China)
☆ Machine Text Detectors are Membership Inference Attacks
Although membership inference attacks (MIAs) and machine-generated text detection target different goals, identifying training samples and synthetic texts, their methods often exploit similar signals based on a language model's probability distribution. Despite this shared methodological foundation, the two tasks have been independently studied, which may lead to conclusions that overlook stronger methods and valuable insights developed in the other task. In this work, we theoretically and empirically investigate the transferability, i.e., how well a method originally developed for one task performs on the other, between MIAs and machine text detection. For our theoretical contribution, we prove that the metric that achieves the asymptotically highest performance on both tasks is the same. We unify a large proportion of the existing literature in the context of this optimal metric and hypothesize that the accuracy with which a given method approximates this metric is directly correlated with its transferability. Our large-scale empirical experiments, including 7 state-of-the-art MIA methods and 5 state-of-the-art machine text detectors across 13 domains and 10 generators, demonstrate very strong rank correlation (rho > 0.6) in cross-task performance. We notably find that Binoculars, originally designed for machine text detection, achieves state-of-the-art performance on MIA benchmarks as well, demonstrating the practical impact of the transferability. Our findings highlight the need for greater cross-task awareness and collaboration between the two research communities. To facilitate cross-task developments and fair evaluations, we introduce MINT, a unified evaluation suite for MIAs and machine-generated text detection, with implementation of 15 recent methods from both tasks.
☆ VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.
comment: 8 pages, 6 figures
☆ Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition
Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at https://github.com/CyberAgentAILab/mbr-for-asr
☆ MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models
Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs' ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.
comment: project page:https://mined-lmm.github.io/
☆ LLM Unlearning with LLM Beliefs
Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs. Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses. However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets. We refer to this as the squeezing effect, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success. To address this, we propose a bootstrapping (BS) framework that explicitly links the squeezing effect with the model's own high-confidence generations, namely its model beliefs. Since model beliefs inherently capture the very high-likelihood regions where probability mass is squeezed, incorporating them into the unlearning objective directly counters the squeezing effect. By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations, together achieving more thorough forgetting while preserving utility. Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of our approach.
☆ BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models EMNLP
To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model's alignment with the systematic patterns of human language acquisition.
comment: Accepted Paper at the BabyLM Workshop 2025 @ EMNLP (Presentation in Suzhou, China)
☆ Spatio-temporal Sign Language Representation and Translation
This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5\pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11\pm0.06$ BLEU points.
☆ ToMMeR -- Efficient Entity Mention Detection from Large Language Models
Identifying which text spans refer to entities -- mention detection -- is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93\% recall zero-shot, with over 90\% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75\%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87\% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.
comment: Code is available at https://github.com/VictorMorand/llm2ner
☆ SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision
Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.
☆ ColorAgent: Building A Robust, Personalized, and Interactive OS Agent
With the advancements in hardware, software, and large language model technologies, the interaction between humans and operating systems has evolved from the command-line interface to the rapidly emerging AI agent interactions. Building an operating system (OS) agent capable of executing user instructions and faithfully following user desires is becoming a reality. In this technical report, we present ColorAgent, an OS agent designed to engage in long-horizon, robust interactions with the environment while also enabling personalized and proactive user interaction. To enable long-horizon interactions with the environment, we enhance the model's capabilities through step-wise reinforcement learning and self-evolving training, while also developing a tailored multi-agent framework that ensures generality, consistency, and robustness. In terms of user interaction, we explore personalized user intent recognition and proactive engagement, positioning the OS agent not merely as an automation tool but as a warm, collaborative partner. We evaluate ColorAgent on the AndroidWorld and AndroidLab benchmarks, achieving success rates of 77.2% and 50.7%, respectively, establishing a new state of the art. Nonetheless, we note that current benchmarks are insufficient for a comprehensive evaluation of OS agents and propose further exploring directions in future work, particularly in the areas of evaluation paradigms, agent collaboration, and security. Our code is available at https://github.com/MadeAgents/mobile-use.
☆ Sign Language Translation with Sentence Embedding Supervision
State-of-the-art sign language translation (SLT) systems facilitate the learning process through gloss annotations, either in an end2end manner or by involving an intermediate step. Unfortunately, gloss labelled sign language data is usually not available at scale and, when available, gloss annotations widely differ from dataset to dataset. We present a novel approach using sentence embeddings of the target sentences at training time that take the role of glosses. The new kind of supervision does not need any manual annotation but it is learned on raw textual data. As our approach easily facilitates multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and American (How2Sign) sign languages and experiment with mono- and multilingual sentence embeddings and translation systems. Our approach significantly outperforms other gloss-free approaches, setting the new state-of-the-art for data sets where glosses are not available and when no additional SLT datasets are used for pretraining, diminishing the gap between gloss-free and gloss-dependent systems.
☆ MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a "quality cliff", offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained "sub-experts." This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, preserving functional locality without requiring retraining. Second, an \emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9\% under a strict latency budget or reduce latency by up to 10.36\% under limited resources. MoE-Prism provides the critical "control knob" to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.
☆ The Massive Legal Embedding Benchmark (MLEB)
We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.
comment: 15 pages, 2 figures
☆ LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.
☆ AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation
The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.
comment: Work in progress
☆ M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models LREC 2026
We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.
comment: Submitted to LREC 2026. 11 pages, 5 figures
☆ Modeling Turn-Taking with Semantically Informed Gestures
In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.
☆ Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system
Removing Personally Identifiable Information (PII) from clinical notes in Electronic Health Records (EHRs) is essential for research and AI development. While Large Language Models (LLMs) are powerful, their high computational costs and the data privacy risks of API-based services limit their use, especially in low-resource settings. To address this, we developed LOGICAL (Local Obfuscation by GLINER for Impartial Context-Aware Lineage), an efficient, locally deployable PII removal system built on a fine-tuned Generalist and Lightweight Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a psychiatric hospital's EHR system. We defined nine PII categories for removal. A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and evaluated on a test set of 376 instances using character-level precision, recall, and F1-score. We compared its performance against Microsoft Azure NER, Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980, significantly outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution. The model operated efficiently on a standard laptop without a dedicated GPU. However, a 2% entity-level false negative rate underscores the need for human-in-the-loop validation across all tested systems. Fine-tuned, specialised transformer models like GLiNER offer an accurate, computationally efficient, and secure solution for PII removal from clinical notes. This "sanitisation at the source" approach is a practical alternative to resource-intensive LLMs, enabling the creation of de-identified datasets for research and AI development while preserving data privacy, particularly in resource-constrained environments.
comment: 30 pages, 15 main text and 15 supplementary material
☆ Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
comment: 20 pages, 13 figures
☆ Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection
In this paper, we investigate how personalising Large Language Models (Persona-LLMs) with annotator personas affects their sensitivity to hate speech, particularly regarding biases linked to shared or differing identities between annotators and targets. To this end, we employ Google's Gemini and OpenAI's GPT-4.1-mini models and two persona-prompting methods: shallow persona prompting and a deeply contextualised persona development based on Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles. We analyse the impact of using in-group and out-group annotator personas on the models' detection performance and fairness across diverse social groups. This work bridges psychological insights on group identity with advanced NLP techniques, demonstrating that incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection. Our results highlight both the potential and limitations of persona-based approaches in reducing bias, offering valuable insights for developing more equitable hate speech detection systems.
comment: This paper has been accepted for the upcoming 59th Hawaii International Conference on System Sciences (HICSS-59), 2026, Hawaii, USA. The final published version will appear in the official conference proceedings
☆ Slot Filling as a Reasoning Task for SpeechLLMs
We propose integration of reasoning into speech large language models (speechLLMs) for the end-to-end slot-filling task. Inspired by the recent development of reasoning LLMs, we use a chain-of-thought framework to decompose the slot-filling task into multiple reasoning steps, create a reasoning dataset and apply the supervised fine-tuning strategy to a speechLLM. We distinguish between regular and reasoning speechLLMs and experiment with different types and sizes of LLMs as their text foundation models. We demonstrate performance improvements by introducing reasoning (intermediate) steps. However, we show that a reasoning textual LLM developed mainly for math, logic and coding domains might be inferior as a foundation model for a reasoning speechLLM. We further show that hybrid speechLLMs, built on a hybrid text foundation LLM and fine-tuned to preserve both direct and reasoning modes of operation, have better performance than those fine-tuned employing only one mode of operation.
☆ Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization
Text summarization is a crucial task that requires the simultaneous optimization of multiple objectives, including consistency, coherence, relevance, and fluency, which presents considerable challenges. Although large language models (LLMs) have demonstrated remarkable performance, enhanced by reinforcement learning (RL), few studies have focused on optimizing the multi-objective problem of summarization through RL based on LLMs. In this paper, we introduce hypervolume optimization (HVO), a novel optimization strategy that dynamically adjusts the scores between groups during the reward process in RL by using the hypervolume method. This method guides the model's optimization to progressively approximate the pareto front, thereby generating balanced summaries across multiple objectives. Experimental results on several representative summarization datasets demonstrate that our method outperforms group relative policy optimization (GRPO) in overall scores and shows more balanced performance across different dimensions. Moreover, a 7B foundation model enhanced by HVO performs comparably to GPT-4 in the summarization task, while maintaining a shorter generation length. Our code is publicly available at https://github.com/ai4business-LiAuto/HVO.git
☆ HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy
The increasing reliance on natural language generation (NLG) models, particularly large language models, has raised concerns about the reliability and accuracy of their outputs. A key challenge is hallucination, where models produce plausible but incorrect information. As a result, hallucination detection has become a critical task. In this work, we introduce a comprehensive hallucination taxonomy with 11 categories across various NLG tasks and propose the HAllucination Detection (HAD) models https://github.com/pku0xff/HAD, which integrate hallucination detection, span-level identification, and correction into a single inference process. Trained on an elaborate synthetic dataset of about 90K samples, our HAD models are versatile and can be applied to various NLG tasks. We also carefully annotate a test set for hallucination detection, called HADTest, which contains 2,248 samples. Evaluations on in-domain and out-of-domain test sets show that our HAD models generally outperform the existing baselines, achieving state-of-the-art results on HaluEval, FactCHD, and FaithBench, confirming their robustness and versatility.
☆ KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints
Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM's linear layer activations and initializes the adapter by projecting the original weights into the matrix's null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.
comment: project page: https://kore-lmm.github.io/
☆ JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation
Current large language models (LLMs) often suffer from hallucination issues, i,e, generating content that appears factual but is actually unreliable. A typical hallucination detection pipeline involves response decomposition (i.e., claim extraction), query generation, evidence collection (i.e., search or retrieval), and claim verification. However, existing methods exhibit limitations in the first two stages, such as context loss during claim extraction and low specificity in query generation, resulting in degraded performance across the hallucination detection pipeline. In this work, we introduce JointCQ https://github.com/pku0xff/JointCQ, a joint claim-and-query generation framework designed to construct an effective and efficient claim-query generator. Our framework leverages elaborately designed evaluation criteria to filter synthesized training data, and finetunes a language model for joint claim extraction and query generation, providing reliable and informative inputs for downstream search and verification. Experimental results demonstrate that our method outperforms previous methods on multiple open-domain QA hallucination detection benchmarks, advancing the goal of more trustworthy and transparent language model systems.
☆ TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.
comment: Code: https://github.com/Reza-esfandiarpoor/the-mcp-company
☆ Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization
Difficulty-controllable question generation for reading comprehension has gained significant attention in the field of education as a fundamental tool for adaptive learning support. Although several neural question generation methods have recently succeeded in controlling difficulty, conventional approaches still face two major limitations. First, they cannot directly generate multiple-choice questions, which are the most widely used question type in educational contexts. Second, they are not explicitly trained to optimize the accuracy of difficulty control, leaving room for further improvement in difficulty controllability. To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization technique to improve the accuracy of difficulty control.
comment: This work has been submitted to the IEEE for possible publication
☆ SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets
Understanding and reasoning over complex spreadsheets remain fundamental challenges for large language models (LLMs), which often struggle with accurately capturing the complex structure of tables and ensuring reasoning correctness. In this work, we propose SheetBrain, a neuro-symbolic dual workflow agent framework designed for accurate reasoning over tabular data, supporting both spreadsheet question answering and manipulation tasks. SheetBrain comprises three core modules: an understanding module, which produces a comprehensive overview of the spreadsheet - including sheet summary and query-based problem insight to guide reasoning; an execution module, which integrates a Python sandbox with preloaded table-processing libraries and an Excel helper toolkit for effective multi-turn reasoning; and a validation module, which verifies the correctness of reasoning and answers, triggering re-execution when necessary. We evaluate SheetBrain on multiple public tabular QA and manipulation benchmarks, and introduce SheetBench, a new benchmark targeting large, multi-table, and structurally complex spreadsheets. Experimental results show that SheetBrain significantly improves accuracy on both existing benchmarks and the more challenging scenarios presented in SheetBench. Our code is publicly available at https://github.com/microsoft/SheetBrain.
☆ Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+
Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. One, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data, and two, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. In selecting transfer languages, our representations and composite distances consistently improve performance across a wide range of NLP tasks, providing a more principled and effective toolkit for multilingual research.
☆ DiSRouter: Distributed Self-Routing for LLM Selections
The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self-Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness, its ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two-stage Self-Awareness Training pipeline that enhances each LLM's self-awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks. Our work validates that leveraging an LLM's intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.
☆ Aligning Multilingual News for Stock Return Prediction
News spreads rapidly across languages and regions, but translations may lose subtle nuances. We propose a method to align sentences in multilingual news articles using optimal transport, identifying semantically similar content across languages. We apply this method to align more than 140,000 pairs of Bloomberg English and Japanese news articles covering around 3500 stocks in Tokyo exchange over 2012-2024. Aligned sentences are sparser, more interpretable, and exhibit higher semantic similarity. Return scores constructed from aligned sentences show stronger correlations with realized stock returns, and long-short trading strategies based on these alignments achieve 10\% higher Sharpe ratios than analyzing the full text sample.
comment: 6 pages, 4 tables, 2 figures, AI for Finance Symposium'25 Workshop at ICAIF'25
☆ Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems
Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents' tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases, and SCOPE, an evaluation framework that automatically discovers diverse error patterns and evaluation rubrics in tool-augmented dialogues. Experiments show SCOPE significantly outperforms the baseline, particularly on challenging cases where user satisfaction signals are misleading.
comment: The first two authors contributed equally. Manuscript under submission
☆ Interpretable Question Answering with Knowledge Graphs
This paper presents a question answering system that operates exclusively on a knowledge graph retrieval without relying on retrieval augmented generation (RAG) with large language models (LLMs). Instead, a small paraphraser model is used to paraphrase the entity relationship edges retrieved from querying the knowledge graph. The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. The second stage converts these QAs into a knowledge graph from which graph-based retrieval is performed using embeddings and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to generate a final answer. This work includes an evaluation using LLM-as-a-judge on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using LLAMA-3.2 and GPT-3.5-Turbo, respectively.
☆ The Zero-Step Thinking: An Empirical Study of Mode Selection as Harder Early Exit in Reasoning Models NeurIPS'25
Reasoning models have demonstrated exceptional performance in tasks such as mathematics and logical reasoning, primarily due to their ability to engage in step-by-step thinking during the reasoning process. However, this often leads to overthinking, resulting in unnecessary computational overhead. To address this issue, Mode Selection aims to automatically decide between Long-CoT (Chain-of-Thought) or Short-CoT by utilizing either a Thinking or NoThinking mode. Simultaneously, Early Exit determines the optimal stopping point during the iterative reasoning process. Both methods seek to reduce the computational burden. In this paper, we first identify Mode Selection as a more challenging variant of the Early Exit problem, as they share similar objectives but differ in decision timing. While Early Exit focuses on determining the best stopping point for concise reasoning at inference time, Mode Selection must make this decision at the beginning of the reasoning process, relying on pre-defined fake thoughts without engaging in an explicit reasoning process, referred to as zero-step thinking. Through empirical studies on nine baselines, we observe that prompt-based approaches often fail due to their limited classification capabilities when provided with minimal hand-crafted information. In contrast, approaches that leverage internal information generally perform better across most scenarios but still exhibit issues with stability. Our findings indicate that existing methods relying solely on the information provided by models are insufficient for effectively addressing Mode Selection in scenarios with limited information, highlighting the ongoing challenges of this task. Our code is available at https://github.com/Trae1ounG/Zero_Step_Thinking.
comment: Accepted by NeurIPS'25 Efficient Reasoning Workshop
☆ When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA
LLMs often fail to handle temporal knowledge conflicts--contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.
comment: Under submission
☆ Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG NeurIPS 2025
Multi-hop retrieval-augmented generation (RAG) is a promising strategy for complex reasoning, yet existing iterative prompting approaches remain inefficient. They often regenerate predictable token sequences at every step and rely on stochastic stopping, leading to excessive token usage and unstable termination. We propose TSSS (Think Straight, Stop Smart), a structured multi-hop RAG framework designed for efficiency. TSSS introduces (i) a template-based reasoning that caches recurring prefixes and anchors sub-queries to the main question, reducing token generation cost while promoting stable reasoning, and (ii) a retriever-based terminator, which deterministically halts reasoning once additional sub-queries collapse into repetition. This separation of structured reasoning and termination control enables both faster inference and more reliable answers. On HotpotQA, 2WikiMultiHop, and MuSiQue, TSSS achieves state-of-the-art accuracy and competitive efficiency among RAG-CoT approaches, highlighting its effectiveness in efficiency-constrained scenarios such as on-device inference.
comment: Accepted at NeurIPS 2025 Workshop
☆ OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform
As large language models (LLMs) become increasingly integrated into real-world applications, safeguarding them against unsafe, malicious, or privacy-violating content is critically important. We present OpenGuardrails, the first open-source project to provide both a context-aware safety and manipulation detection model and a deployable platform for comprehensive AI guardrails. OpenGuardrails protects against content-safety risks, model-manipulation attacks (e.g., prompt injection, jailbreaking, code-interpreter abuse, and the generation/execution of malicious code), and data leakage. Content-safety and model-manipulation detection are implemented by a unified large model, while data-leakage identification and redaction are performed by a separate lightweight NER pipeline (e.g., Presidio-style models or regex-based detectors). The system can be deployed as a security gateway or an API-based service, with enterprise-grade, fully private deployment options. OpenGuardrails achieves state-of-the-art (SOTA) performance on safety benchmarks, excelling in both prompt and response classification across English, Chinese, and multilingual tasks. All models are released under the Apache 2.0 license for public use.
☆ "You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations
With the proliferation of the internet and the rapid advancement of Artificial Intelligence, leading technology companies face an urgent annual demand for a considerable number of software and algorithm engineers. To efficiently and effectively identify high-potential candidates from thousands of applicants, these firms have established a multi-stage selection process, which crucially includes a standardized hiring evaluation designed to assess job-specific competencies. Motivated by the demonstrated prowess of Large Language Models (LLMs) in coding and reasoning tasks, this paper investigates a critical question: Can LLMs successfully pass these hiring evaluations? To this end, we conduct a comprehensive examination of a widely used professional assessment questionnaire. We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance. Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions. Our empirical findings lead to a striking conclusion: All evaluated LLMs fails to pass the hiring evaluation.
comment: Technical Report, 14 pages, 8 figures
☆ Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in LLMs. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data sparsity, orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.
☆ A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Despite the rapid expansion of Large Language Models (LLMs) in healthcare, the ability of these systems to assess clinical trial reporting according to CONSORT standards remains unclear, particularly with respect to their cognitive and reasoning strategies. This study applies a behavioral and metacognitive analytic approach with expert-validated data, systematically comparing two representative LLMs under three prompt conditions. Clear differences emerged in how the models approached various CONSORT items, and prompt types, including shifts in reasoning style, explicit uncertainty, and alternative interpretations shaped response patterns. Our results highlight the current limitations of these systems in clinical compliance automation and underscore the importance of understanding their cognitive adaptations and strategic behavior in developing more explainable and reliable medical AI.
☆ LLMs can hide text in other text of the same length.ipynb
A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.
comment: 21 pages, main paper 9 pages
☆ Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training
Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.
comment: 6 pages, 4 figures
☆ From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge
Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.
comment: 4 pages
☆ Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM Interactions
Large language model (LLM)-powered chatbots are increasingly used for opinion exploration. Prior research examined how LLMs alter user views, yet little work extended beyond one-way influence to address how user input can affect LLM responses and how such bi-directional influence manifests throughout the multi-turn conversations. This study investigates this dynamic through 50 controversial-topic discussions with participants (N=266) across three conditions: static statements, standard chatbot, and personalized chatbot. Results show that human opinions barely shifted, while LLM outputs changed more substantially, narrowing the gap between human and LLM stance. Personalization amplified these shifts in both directions compared to the standard setting. Analysis of multi-turn conversations further revealed that exchanges involving participants' personal stories were most likely to trigger stance changes for both humans and LLMs. Our work highlights the risk of over-alignment in human-LLM interaction and the need for careful design of personalized chatbots to more thoughtfully and stably align with users.
comment: 26 pages, 8 figures
☆ ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering
Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto-Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state-of-the-art LLMs and three open-source tool-use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope's effectiveness in enhancing LLM tool use.
comment: Preprint under review
☆ Improving Transfer Learning for Sequence Labeling Tasks by Adapting Pre-trained Neural Language Models
This doctoral thesis improves the transfer learning for sequence labeling tasks by adapting pre-trained neural language models. The proposed improvements in transfer learning involve introducing a multi-task model that incorporates an additional signal, a method based on architectural modifications in autoregressive large language models, and a sequence labeling framework for autoregressive large language models utilizing supervised in-context fine-tuning combined with response-oriented adaptation strategies. The first improvement is given in the context of domain transfer for the event trigger detection task. The domain transfer of the event trigger detection task can be improved by incorporating an additional signal obtained from a domain-independent text processing system into a multi-task model. The second improvement involves modifying the model's architecture. For that purpose, a method is proposed to enable bidirectional information flow across layers of autoregressive large language models. The third improvement utilizes autoregressive large language models as text generators through a generative supervised in-context fine-tuning framework. The proposed model, method, and framework demonstrate that pre-trained neural language models achieve their best performance on sequence labeling tasks when adapted through targeted transfer learning paradigms.
♻ ☆ LoRA vs Full Fine-tuning: An Illusion of Equivalence
Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to effectively fine-tune LLMs with an extreme reduction in trainable parameters. But, \emph{are their learned solutions really equivalent?} We study how LoRA and full-finetuning change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that LoRA and full fine-tuning yield weight matrices whose singular value decompositions exhibit very different structure: weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}, while those trained with full fine-tuning do not. Further, we extend the finding that LoRA forgets less than full fine-tuning and find its forgetting is vastly localized to the intruder dimension -- by causally intervening on the intruder dimensions by changing their associated singular values post-fine-tuning, we show that they cause forgetting. Moreover, scaling them down significantly improves modeling of the pre-training distribution with a minimal drop in downstream task performance. Given this, we should expect accumulating intruder dimensions to be harmful and lead to more forgetting. This will be amplified during continual learning because of sequentially fine-tuning, and we show that LoRA models do accumulate intruder dimensions here tend to perform worse in this setting, emphasizing the practicality of our findings.
♻ ☆ Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM
Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.
comment: Accepted by Neurips 2025
♻ ☆ Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) have been increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances--such as code execution and knowledge bases--that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.
comment: Published in Transactions of Machine Learning Research (TMLR), 10/2025 https://openreview.net/forum?id=MB0TCLfLn1
♻ ☆ Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models
Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework's broad applicability, we evaluate four HE methods -- three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task -- across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.
♻ ☆ WikiVideo: Article Generation from Multiple Videos
We introduce the task of grounded article generation with the goal of creating a Wikipedia-style article from multiple diverse videos about real-world events -- from natural disasters to political elections -- where all the information in the article is supported by video evidence. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text while existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
comment: Repo can be found here: https://github.com/alexmartin1722/wikivideo
♻ ☆ The Coverage Principle: How Pre-Training Enables Post-Training
Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross-entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of \emph{coverage}, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of \emph{the coverage principle}, a phenomenon whereby next-token prediction (more generally, maximum likelihood) implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: \emph{coverage generalizes faster than cross-entropy}, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.
♻ ☆ GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks
This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess eight commercial LLMs (Claude Sonnet 3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o, GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks in four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test rejection accuracy. We develop a LLM-as-Judge evaluation framework to compare agent solutions against reference solutions. Results show o4-mini and Claude 3.5 Sonnet achieve the best overall performance, OpenAI's GPT-4.1, GPT-4o and Google's Gemini 2.5 Pro Preview do not fall far behind, but the last two are more efficient in identifying unsolvable tasks. Claude Sonnet 4, due its preference to provide any solution rather than reject a task, proved to be less accurate. We observe significant differences in token usage, with Anthropic models consuming more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources (available at https://github.com/Solirinai/GeoBenchX), providing one more standardized method for the ongoing evaluation of LLMs for GeoAI.
comment: Github with code and benchmark set: https://github.com/Solirinai/GeoBenchX
♻ ☆ LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits NeurIPS 2025
Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g. writing) generalizes to new tasks (e.g. math) is often not known a priori, often making using only one fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs simultaneously can incur a prohibitively high computational cost and lead to conflicting signals from different RMs that may degrade performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which frames reward model selection as a multi-armed bandit problem, efficiently and iteratively training LLMs using multiple RMs by selecting the most well-suited RM for each instance. On commonsense and math reasoning tasks, we show that LASeR boosts iterative LLM training, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of RM scores while also showing superior efficiency (e.g., a 2x speedup). Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to long-context generation, LASeR improves by 2.96 F1 points (avg.) on single-document QA tasks and 2.97 F1 points on few-shot learning over the RM score ensemble baseline with best-of-n sampling.
comment: NeurIPS 2025 camera-ready. First two authors contributed equally. Code: https://github.com/duykhuongnguyen/LASeR-MAB
♻ ☆ metaTextGrad: Automatically optimizing language model optimizers
Large language models (LLMs) are increasingly used in learning algorithms, evaluations, and optimization tasks. Recent studies have shown that using LLM-based optimizers to automatically optimize model prompts, demonstrations, predictions themselves, or other components can significantly enhance the performance of AI systems, as demonstrated by frameworks such as DSPy and TextGrad. However, optimizers built on language models themselves are usually designed by humans with manual design choices; optimizers themselves are not optimized. Moreover, these optimizers are general purpose by design, to be useful to a broad audience, and are not tailored for specific tasks. To address these challenges, we propose metaTextGrad, which focuses on designing a meta-optimizer to further enhance existing optimizers and align them to be good optimizers for a given task. Our approach consists of two key components: a meta prompt optimizer and a meta structure optimizer. The combination of these two significantly improves performance across multiple benchmarks, achieving an average absolute performance improvement of up to 6% compared to the best baseline.
comment: 21 pages, 2 figures
♻ ☆ Test-time Prompt Intervention
Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning capabilities. However, growing evidence reveals that such reasoning models often produce CoTs plagued by excessive redundancy, including unnecessary verification steps and repetitive reasoning shifts. The root cause lies in post-training of them that overly rely on outcome reward paradigms, as the data of process reward paradigms, which regulate intermediate reasoning steps, is difficult to construct at scale. To address this, we propose PI, a novel framework for Test-time Prompt Intervention. PI provides an interface to dynamically guide and regulate reasoning paths during inference through timely (When module) and proper (How module) interventions and post-intervention sampling (Which module). This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs' reasoning processes, enhancing controllability and interpretability. Extensive experiments across multiple models and datasets demonstrate that PI significantly shortens CoTs while reducing hallucination, yielding more concise and reliable reasoning.
comment: 24 pages, 20 figures, under review
♻ ☆ Learning Linear Attention in Polynomial Time
Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.
♻ ☆ dInfer: An Efficient Inference Framework for Diffusion Language Models
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components--model, diffusion iteration manager, decoding strategy, and KV-cache manager--and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers a $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared to the AR model (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with the latest vLLM inference engine, dInfer still delivers a $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.
♻ ☆ Unveiling Transformer Perception by Exploring Input Manifolds
This paper introduces a general method for the exploration of equivalence classes in the input space of Transformer models. The proposed approach is based on sound mathematical theory which describes the internal layers of a Transformer architecture as sequential deformations of the input manifold. Using eigendecomposition of the pullback of the distance metric defined on the output space through the Jacobian of the model, we are able to reconstruct equivalence classes in the input space and navigate across them. Our method enables two complementary exploration procedures: the first retrieves input instances that produce the same class probability distribution as the original instance-thus identifying elements within the same equivalence class-while the second discovers instances that yield a different class probability distribution, effectively navigating toward distinct equivalence classes. Finally, we demonstrate how the retrieved instances can be meaningfully interpreted by projecting their embeddings back into a human-readable format.
comment: 11 pages, 4 figures
♻ ☆ Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens
Cultural evaluation of large language models has become increasingly important, yet current benchmarks often reduce culture to static facts or homogeneous values. This view conflicts with anthropological accounts that emphasize culture as dynamic, historically situated, and enacted in practice. To analyze this gap, we introduce a four-part framework that categorizes how benchmarks frame culture, such as knowledge, preference, performance, or bias. Using this lens, we qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues, including treating countries as cultures, overlooking within-culture diversity, and relying on oversimplified survey formats. Drawing on established anthropological methods, we propose concrete improvements: incorporating real-world narratives and scenarios, involving cultural communities in design and validation, and evaluating models in context rather than isolation. Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks and more accurately capture the responses of the models to complex cultural situations.
comment: 12 pages; 2 figures; First two author contributed equally
♻ ☆ InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models
Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) --a critical phase for enhancing LLM performance--largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.
♻ ☆ MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research NeurIPS 2025
Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.
comment: 49 pages, 9 figures. Accepted by NeurIPS 2025 D&B Track
♻ ☆ Mathematics with large language models as provers and verifiers
During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman & Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology is by no means complete or exact. It was nonetheless able to solve five out of six 2025 IMO problems, and close about a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].
♻ ☆ Using (Not-so) Large Language Models to Generate Simulation Models in a Formal DSL: A Study on Reaction Networks
Formal languages are an integral part of modeling and simulation. They allow the distillation of knowledge into concise simulation models amenable to automatic execution, interpretation, and analysis. However, the arguably most humanly accessible means of expressing models is through natural language, which is not easily interpretable by computers. Here, we evaluate how a Large Language Model (LLM) might be used for formalizing natural language into simulation models. Existing studies only explored using very large LLMs, like the commercial GPT models, without fine-tuning model weights. To close this gap, we show how an open-weights, 7B-parameter Mistral model can be fine-tuned to translate natural language descriptions to reaction network models in a domain-specific language, offering a self-hostable, compute-efficient, and memory efficient alternative. To this end, we develop a synthetic data generator to serve as the basis for fine-tuning and evaluation. Our quantitative evaluation shows that our fine-tuned Mistral model can recover the ground truth simulation model in up to 84.5% of cases. In addition, our small-scale user study demonstrates the model's practical potential for one-time generation as well as interactive modeling in various domains. While promising, in its current form, the fine-tuned small LLM cannot catch up with large LLMs. We conclude that higher-quality training data are required, and expect future small and open-source LLMs to offer new opportunities.
comment: 27 pages, 5 figures; supplemental material available at https://doi.org/10.1145/3733719
♻ ☆ Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning EMNLP 2025
Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our Middo consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are publicly available at https://github.com/Word2VecT/Middo.
comment: Accepted by EMNLP 2025 (Main)
♻ ☆ Quantum Natural Language Processing: A Comprehensive Review of Models, Methods, and Applications
In recent developments, deep learning methodologies applied to Natural Language Processing (NLP) have revealed a paradox: They improve performance but demand considerable data and resources for their training. Alternatively, quantum computing exploits the principles of quantum mechanics to overcome the computational limitations of current methodologies, thereby establishing an emerging field known as quantum natural language processing (QNLP). This domain holds the potential to attain a quantum advantage in the processing of linguistic structures, surpassing classical models in both efficiency and accuracy. In this paper, it is proposed to categorise QNLP models based on quantum computing principles, architecture, and computational approaches. This paper attempts to provide a survey on how quantum meets language by mapping state-of-the-art in this area, embracing quantum encoding techniques for classical data, QNLP models for prevalent NLP tasks, and quantum optimisation techniques for hyper parameter tuning. The landscape of quantum computing approaches applied to various NLP tasks is summarised by showcasing the specific QNLP methods used, and the popularity of these methods is indicated by their count. From the findings, it is observed that QNLP approaches are still limited to small data sets, with only a few models explored extensively, and there is increasing interest in the application of quantum computing to natural language processing tasks.
♻ ☆ Can Large Language Models be Effective Online Opinion Miners? EMNLP 2025
The surge of user-generated online content presents a wealth of insights into customer preferences and market trends. However, the highly diverse, complex, and context-rich nature of such contents poses significant challenges to traditional opinion mining approaches. To address this, we introduce Online Opinion Mining Benchmark (OOMB), a novel dataset and evaluation protocol designed to assess the ability of large language models (LLMs) to mine opinions effectively from diverse and intricate online environments. OOMB provides extensive (entity, feature, opinion) tuple annotations and a comprehensive opinion-centric summary that highlights key opinion topics within each content, thereby enabling the evaluation of both the extractive and abstractive capabilities of models. Through our proposed benchmark, we conduct a comprehensive analysis of which aspects remain challenging and where LLMs exhibit adaptability, to explore whether they can effectively serve as opinion miners in realistic online scenarios. This study lays the foundation for LLM-based opinion mining and discusses directions for future research in this field.
comment: Accepted to EMNLP 2025 Main
♻ ☆ Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs
Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: https://correctbench.github.io/
comment: 47 pages, 25 figures, 10 tables
♻ ☆ From TOWER to SPIRE: Adding the Speech Modality to a Translation-Specialist LLM EMNLP 2025
We introduce Spire, a speech-augmented language model (LM) capable of both translating and transcribing speech input from English into 10 other languages as well as translating text input in both language directions. Spire integrates the speech modality into an existing multilingual LM via speech discretization and continued pre-training using only 42.5K hours of speech. In particular, we adopt the pretraining framework of multilingual LMs and treat discretized speech input as an additional translation language. This approach not only equips the model with speech capabilities, but also preserves its strong text-based performance. We achieve this using significantly less data than existing speech LMs, demonstrating that discretized speech input integration as an additional language is feasible during LM adaptation. We make our code and models available to the community.
comment: EMNLP 2025 (Findings) camera ready
♻ ☆ Natural Language Processing for Cardiology: A Narrative Review
Cardiovascular diseases are becoming increasingly prevalent in modern society, with a profound impact on global health and well-being. These Cardiovascular disorders are complex and multifactorial, influenced by genetic predispositions, lifestyle choices, and diverse socioeconomic and clinical factors. Information about these interrelated factors is dispersed across multiple types of textual data, including patient narratives, medical records, and scientific literature. Natural language processing (NLP) has emerged as a powerful approach for analysing such unstructured data, enabling healthcare professionals and researchers to gain deeper insights that may transform the diagnosis, treatment, and prevention of cardiac disorders. This review provides a comprehensive overview of NLP research in cardiology from 2014 to 2025. We systematically searched six literature databases for studies describing NLP applications across a range of cardiovascular diseases. After a rigorous screening process, we identified 265 relevant articles. Each study was analysed across multiple dimensions, including NLP paradigms, cardiology-related tasks, disease types, and data sources. Our findings reveal substantial diversity within these dimensions, reflecting the breadth and evolution of NLP research in cardiology. A temporal analysis further highlights methodological trends, showing a progression from rule-based systems to large language models. Finally, we discuss key challenges and future directions, such as developing interpretable LLMs and integrating multimodal data. To the best of our knowledge, this review represents the most comprehensive synthesis of NLP research in cardiology to date.
♻ ☆ Evaluating NLP Embedding Models for Handling Science-Specific Symbolic Expressions in Student Texts
In recent years, natural language processing (NLP) has become integral to educational data mining, particularly in the analysis of student-generated language products. For research and assessment purposes, so-called embedding models are typically employed to generate numeric representations of text that capture its semantic content for use in subsequent quantitative analyses. Yet when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing research studies and practical applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased research findings and diminished performance of practical applications. This study therefore explores how contemporary embedding models differ in their capability to process and interpret science-related symbolic expressions. To this end, various embedding models are evaluated using physics-specific symbolic expressions drawn from authentic student responses, with performance assessed via two approaches: 1) similarity-based analyses and 2) integration into a machine learning pipeline. Our findings reveal significant differences in model performance, with OpenAI's GPT-text-embedding-3-large outperforming all other examined models, though its advantage over other models was moderate rather than decisive. Overall, this study underscores the importance for educational data mining researchers and practitioners of carefully selecting NLP embedding models when working with science-related language products that include symbolic expressions. The code and (partial) data are available at https://doi.org/10.17605/OSF.IO/6XQVG.
♻ ☆ Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations EMNLP 2025
The advent of Large Language Models (LLMs) has revolutionized product recommenders, yet their susceptibility to adversarial manipulation poses critical challenges, particularly in real-world commercial applications. Our approach is the first one to tap into human psychological principles, seamlessly modifying product descriptions, making such manipulations hard to detect. In this work, we investigate cognitive biases as black-box adversarial strategies, drawing parallels between their effects on LLMs and human purchasing behavior. Through extensive evaluation across models of varying scale, we find that certain biases, such as social proof, consistently boost product recommendation rate and ranking, while others, like scarcity and exclusivity, surprisingly reduce visibility. Our results demonstrate that cognitive biases are deeply embedded in state-of-the-art LLMs, leading to highly unpredictable behavior in product recommendations and posing significant challenges for effective mitigation.
comment: Accepted at EMNLP 2025
♻ ☆ Memorization-Compression Cycles Improve Generalization NeurIPS2025
We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.
comment: 12 pages, 6 figures, NeurIPS2025 NEGEL Workshop
♻ ☆ Meeseeks: A Feedback-Driven, Iterative Self-Correction Benchmark evaluating LLMs' Instruction Following Capability
The capability to precisely adhere to instructions is a cornerstone for Large Language Models (LLMs) to function as dependable agents in real-world scenarios. However, confronted with complex prompts, LLMs frequently encounter difficulties in fulfilling all specified requirements within a single response. Drawing inspiration from recent advancements in Chain-of-Thought (CoT) prompting and self-correction methodologies, we introduce Meeseeks (The name is inspired by Mr. Meeseeks from "Rick and Morty," a character renowned for efficiently accomplishing assigned tasks. See: https://en.wikipedia.org/wiki/Mr._Meeseeks), a fully automated iterative instruction-following benchmark equipped with an integrated feedback mechanism. Meeseeks identifies erroneous components in model responses and provides corresponding feedback accurately, thereby iteratively guiding the model toward self-correction. The dataset contains over 700 curated instances annotated by 32 distinct capability tags in Chinese and English. Extensive experimental results reveal that different state-of-the-art commercial and open-source LLMs exhibit vastly disparate performance, and even after 20 turns of iterative feedback-driven self-correction, nearly all models demonstrate suboptimal performance. We conducted comprehensive analysis from both macro and instance levels, uncovering numerous common issues prevalent in current state-of-the-art models, as well as several counterintuitive phenomena. We've open-sourced our work on https://github.com/ADoublLEN/Meeseeks.
♻ ☆ LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.
♻ ☆ Reasoning Models Better Express Their Confidence NeurIPS 2025
Despite their strengths, large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models that engage in extended chain-of-thought (CoT) reasoning exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models (e.g., exploring alternative approaches and backtracking) which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that non-reasoning models also demonstrate enhanced calibration when simply guided to slow think via in-context learning, fully isolating slow thinking as the source of the calibration gains.
comment: Accepted to NeurIPS 2025
♻ ☆ FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: https://github.com/Starscream-11813/Frugal-ICL.
♻ ☆ Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.
♻ ☆ Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search
Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at https://github.com/manglu097/Chiron-o1
♻ ☆ DeCAL Tokenwise Compression
This paper introduces DeCAL, a new method for tokenwise compression. DeCAL uses an encoder-decoder language model pretrained with denoising to learn to produce high-quality, general-purpose compressed representations from the encoder. DeCAL applies small modifications to the encoder, with the emphasis on maximizing compression quality, even at the expense of compute. We show that DeCAL at 2x compression can match uncompressed on several downstream tasks, with usually only a minor dropoff in metrics up to 8x compression, among question-answering, summarization, and multi-vector retrieval tasks. DeCAL offers significant savings where pre-computed dense representations can be utilized, and we believe the approach can be further developed to be more broadly applicable.
♻ ☆ Flexible-length Text Infilling for Discrete Diffusion Models EMNLP
Discrete diffusion models are a new class of text generators that offer advantages such as bidirectional context use, parallelizable generation, and flexible prompting compared to autoregressive models. However, a critical limitation of discrete diffusion models is their inability to perform flexible-length or flexible-position text infilling without access to ground-truth positional data. We introduce \textbf{DDOT} (\textbf{D}iscrete \textbf{D}iffusion with \textbf{O}ptimal \textbf{T}ransport Position Coupling), the first discrete diffusion model to overcome this challenge. DDOT jointly denoises token values and token positions, employing a novel sample-level Optimal Transport (OT) coupling. This coupling preserves relative token ordering while dynamically adjusting the positions and length of infilled segments, a capability previously missing in text diffusion. Our method is orthogonal to existing discrete text diffusion methods and is compatible with various pretrained text denoisers. Extensive experiments on text infilling benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms naive diffusion baselines. Furthermore, DDOT achieves performance on par with state-of-the-art non-autoregressive models and enables significant improvements in training efficiency and flexibility.
comment: Major edit of methodology section. Matches EMNLP camera-ready version
♻ ☆ NAACL2025 Tutorial: Adaptation of Large Language Models NAACL2025
This tutorial on adaptation of LLMs is designed to address the growing demand for models that go beyond the static capabilities of generic LLMs by providing an overview of dynamic, domain-specific, and task-adaptive LLM adaptation techniques. While general LLMs have demonstrated strong generalization across a variety of tasks, they often struggle to perform well in specialized domains such as finance, healthcare, and code generation for underrepresented languages. Additionally, their static nature limits their ability to evolve with the changing world, and they are often extremely large in size, making them impractical and costly to deploy at scale. As a result, the adaptation of LLMs has drawn much attention since the birth of LLMs and is of core importance, both for industry, which focuses on serving its targeted users, and academia, which can greatly benefit from small but powerful LLMs. To address this gap, this tutorial aims to provide an overview of the LLM adaptation techniques. We start with an introduction to LLM adaptation, from both the data perspective and the model perspective. We then emphasize how the evaluation metrics and benchmarks are different from other techniques. After establishing the problems, we explore various adaptation techniques. We categorize adaptation techniques into two main families. The first is parametric knowledge adaptation, which focuses on updating the parametric knowledge within LLMs. Additionally, we will discuss real-time adaptation techniques, including model editing, which allows LLMs to be updated dynamically in production environments. The second kind of adaptation is semi-parametric knowledge adaptation, where the goal is to update LLM parameters to better leverage external knowledge or tools through techniques like retrieval-augmented generation (RAG) and agent-based systems.
comment: NAACL2025 Tutorial
♻ ☆ Demystifying Domain-adaptive Post-training for Financial LLMs EMNLP 2025
Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation into domain-adaptive post-training of LLMs for the finance domain. Our approach consists of four key components: FinCap, which defines the core capabilities required for the target domain; FinRec, an effective training recipe that jointly optimizes continual pre-training and instruction-following, along with a novel preference data distillation method leveraging process signals from a generative reward model; FinTrain, a curated set of training datasets supporting FinRec; and FinEval, a comprehensive evaluation suite aligned with FinCap. The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks. Our analysis also highlights how each post-training stage contributes to distinct capabilities, uncovering specific challenges and effective solutions, providing valuable insights for domain adaptation of LLMs
comment: EMNLP 2025 (Oral, ARR best paper nomination)
♻ ☆ Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs ICCV 2025
Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97\% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.
comment: Accepted at ICCV 2025
♻ ☆ GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models
Unlearning in large language models is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this problem, we propose GUARD, a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the alignment between the Forget and Retain sets while remaining computationally efficient. Building on this, we design a novel unlearning objective that assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores. Through such a reallocation of unlearning power, GUARD mitigates unintended retention loss. We also provide rigorous theoretical guarantees that GUARD significantly improves retention while maintaining forgetting metrics comparable to prior methods. Extensive experiments on the TOFU and MUSE benchmarks across multiple LLM architectures demonstrate that GUARD reduces utility sacrifice on the TOFU Retain Set by up to 194.92 percent in terms of Truth Ratio when forgetting 10 percent of the training data, and improves knowledge retention on the MUSE NEWS Retain Set by 16.20 percent, with comparable or very moderate increases in privacy loss compared to state-of-the-art methods.
♻ ☆ Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing
Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.
comment: This work has been accepted to DAI 2025
♻ ☆ Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs EMNLP 2025
Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.
comment: Accepted to EMNLP 2025 Findings ("Text or Pixels? Evaluating Efficiency and Understanding of LLMs with Visual Text Inputs")
♻ ☆ Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness NeurIPS
Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.
comment: NeurIPS Camera-Ready Version. Code available at: https://github.com/Graph-COM/Knowledge_Unlearning
♻ ☆ PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits
Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.
comment: First two authors have equal author contributions
♻ ☆ LongCodeBench: Evaluating Coding LLMs at 1M Context Windows
Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales -- ranging from Qwen2.5 14B Instruct to Google's flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5. The LCB dataset is available publicly at https://huggingface.co/datasets/Steefano/LCB and the codebase to replicate the work on this paper at https://github.com/Zteefano/long-code-bench.
♻ ☆ Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation
Wordle presents an algorithmically rich testbed for constraint satisfaction problem (CSP) solving. While existing solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, we present the first comprehensive CSP formulation of Wordle with novel constraint-aware solving strategies. We introduce CSP-Aware Entropy, computing information gain after constraint propagation rather than on raw candidate sets, and a Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints. Through evaluation on 2,315 English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate, a statistically significant 1.7% improvement over Forward Checking (t=-4.82, p<0.001, Cohen's d=0.07) with 46% faster runtime (12.9ms versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3 percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic CSP achieves 100% success across all noise levels (0-20%) through constraint recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates 88% success with zero language-specific tuning, validating that core CSP principles transfer across languages despite an 11.2 percentage point gap from linguistic differences (p<0.001, Fisher's exact test). Our open-source implementation with 34 unit tests achieving 91% code coverage provides reproducible infrastructure for CSP research. The combination of formal CSP treatment, constraint-aware heuristics, probabilistic-logical integration, robustness analysis, and cross-lexicon validation establishes new performance benchmarks demonstrating that principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains.
comment: 35 pages, 14 figures, 10 tables. Open-source implementation with 91% test coverage available at https://github.com/jahidul-arafat/constraint_satisfaction_wordle_arxiv_preprint
♻ ☆ Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems NeurIPS 2025
We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by jointly optimizing model roles and weights. We represent multi-LLM systems as directed acyclic graphs (DAGs) of LLMs with topological message passing for collaborative generation. Given a pool of LLM experts and a utility function, Heterogeneous Swarms employs two iterative steps: role-step and weight-step. For role-step, we interpret model roles as learning a DAG that specifies the flow of inputs and outputs between LLMs. Starting from a swarm of random continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs in topological order, evaluate on the utility function (e.g. accuracy on a task), and optimize the adjacency matrices with particle swarm optimization based on the utility score. For weight-step, we assess the contribution of individual LLMs in the multi-LLM systems and optimize model weights with swarm intelligence. We propose JFK-score to quantify the individual contribution of each LLM in the best-found DAG of the role-step, then optimize model weights with particle swarm optimization based on the JFK-score. Experiments demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based baselines by 18.5% on average across 12 tasks. Further analysis reveals that Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles and substantial collaborative gains, and benefits from the diversity of language models.
comment: NeurIPS 2025
♻ ☆ Rope to Nope and Back Again: A New Hybrid Attention Strategy
Long-context large language models (LLMs) have achieved remarkable advancements, driven by techniques like Rotary Position Embedding (RoPE) (Su et al., 2023) and its extensions (Chen et al., 2023; Liu et al., 2024c; Peng et al., 2023). By adjusting RoPE parameters and incorporating training data with extended contexts, we can train performant models with considerably longer input sequences. However, existing RoPE-based methods exhibit performance limitations when applied to extended context lengths. This paper presents a comprehensive analysis of various attention mechanisms, including RoPE, No Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm), identifying their strengths and shortcomings in long-context modeling. Our investigation identifies distinctive attention patterns in these methods and highlights their impact on long-context performance, providing valuable insights for architectural design. Building on these findings, we propose a novel architecture featuring a hybrid attention mechanism that integrates global and local attention spans. This design not only surpasses conventional RoPE-based transformer models with full attention in both long and short context tasks but also delivers substantial efficiency gains during training and inference.
♻ ☆ Permutative Preference Alignment from Listwise Ranking of Human Judgments EMNLP 2025
Aligning Large Language Models (LLMs) with human preferences is crucial in ensuring desirable and controllable model behaviors. Current methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on the Bradley-Terry (B-T) model to maximize the likelihood of pairwise choices. However, when multiple responses are available, the B-T model fails to guarantee an accurate list ranking of the responses. To address this issue, we propose Permutative Preference Alignment (PPA), a novel offline listwise approach that incorporates the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, as an alternative training objective for LLM alignment. We develop an end-to-end alignment algorithm by approximating NDCG with a differentiable surrogate loss. Experiments demonstrate that PPA outperforms existing pairwise and listwise methods on evaluation sets and general benchmarks such as AlpacaEval. Furthermore, we show that NDCG-based approaches improve ranking accuracy more effectively than B-T-based methods and provide a theoretical explanation for this improvement.
comment: Published at EMNLP 2025 Main Conference
♻ ☆ SpecEval: Evaluating Model Adherence to Behavior Specifications
Companies that develop foundation models publish behavioral guidelines they pledge their models will follow, but it remains unclear if models actually do so. While providers such as OpenAI, Anthropic, and Google have published detailed specifications describing both desired safety constraints and qualitative traits for their models, there has been no systematic audit of adherence to these guidelines. We introduce an automated framework that audits models against their providers specifications by parsing behavioral statements, generating targeted prompts, and using models to judge adherence. Our central focus is on three way consistency between a provider specification, its model outputs, and its own models as judges; an extension of prior two way generator validator consistency. This establishes a necessary baseline: at minimum, a foundation model should consistently satisfy the developer behavioral specifications when judged by the developer evaluator models. We apply our framework to 16 models from six developers across more than 100 behavioral statements, finding systematic inconsistencies including compliance gaps of up to 20 percent across providers.
♻ ☆ Language Models (Mostly) Know When to Stop Reading NeurIPS 2025
Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient when the information required to answer a query is localized within the context. We present dynamic context cutoff, a novel method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" -- detectable through lightweight classifiers -- that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B-70B) demonstrate 3.4% accuracy improvement while achieving 1.33x token reduction on average. Furthermore, our method demonstrates superior performance compared to other context efficiency methods at equivalent token reduction rates. Additionally, we observe an emergent scaling phenomenon: while smaller models require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.
comment: Accepted to NeurIPS 2025. Project website: https://royxie.com/when-to-stop-project
♻ ☆ Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models NeurIPS
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.
comment: The first two authors contributed equally. This work has been accepted to the Neural Information Processing Systems (NeurIPS) 2025 Datasets & Benchmark Track. Project Page: https://rf100-vl.org/
♻ ☆ DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products NeurIPS 2025
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. Diagonal matrices, used in models such as Mamba, GLA, or mLSTM, yield fast runtime but have limited expressivity. To address this, recent architectures such as DeltaNet and RWKV-7 adopted a diagonal plus rank--1 structure, which allows simultaneous token and channel mixing, improving associative recall and, as recently shown, state-tracking when allowing state-transition matrices to have negative eigenvalues. Building on the interpretation of DeltaNet's recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus rank--$n_h$ state-transition matrices, formed as products of $n_h$ generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency. We provide a detailed theoretical characterization of the state-tracking capability of DeltaProduct in finite precision, showing how it improves by increasing $n_h$. Our extensive experiments demonstrate that DeltaProduct outperforms DeltaNet in both state-tracking and language modeling, while also showing significantly improved length extrapolation capabilities.
comment: v5: Characterization of DeltaProduct's state-tracking ability. Analysis of hidden state's effective rank. Improved scaling analysis. v6: Added analysis for products of RWKV-7 matrices, v6: Accepted at NeurIPS 2025
Information Retrieval 17
☆ ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers
Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM's context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD's. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.
☆ Top-P Masking for Cross Language Information Retrieval
Top-K masking schemes have been proposed as a method to promote sparse representations in Information Retrieval (IR) tasks, as a simple alternative to Floating Point Operations per Second (FLOPS) regularization. Algorithms such as Bilingual Lexical and Document Expansion Model (BLADE), adopt this approach as a post-processing stage. We propose using Top-P Dynamic Masking similar to Nucleus Sampling in Large Language Models, and demonstrate better performance than Top-K masking. Specifically, we evaluate our methods in the domain of Cross Language Information Retrieval (CLIR)
comment: Unsubmitted
☆ A Matter of Time: Revealing the Structure of Time in Vision-Language Models
Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.
☆ The Massive Legal Embedding Benchmark (MLEB)
We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.
comment: 15 pages, 2 figures
☆ CoRECT: A Framework for Evaluating Embedding Compression Techniques at Scale
Dense retrieval systems have proven to be effective across various benchmarks, but require substantial memory to store large search indices. Recent advances in embedding compression show that index sizes can be greatly reduced with minimal loss in ranking quality. However, existing studies often overlook the role of corpus complexity -- a critical factor, as recent work shows that both corpus size and document length strongly affect dense retrieval performance. In this paper, we introduce CoRECT (Controlled Retrieval Evaluation of Compression Techniques), a framework for large-scale evaluation of embedding compression methods, supported by a newly curated dataset collection. To demonstrate its utility, we benchmark eight representative types of compression methods. Notably, we show that non-learned compression achieves substantial index size reduction, even on up to 100M passages, with statistically insignificant performance loss. However, selecting the optimal compression method remains challenging, as performance varies across models. Such variability highlights the necessity of CoRECT to enable consistent comparison and informed selection of compression methods. All code, data, and results are available on GitHub and HuggingFace.
☆ Metadata Extraction Leveraging Large Language Models
The advent of Large Language Models has revolutionized tasks across domains, including the automation of legal document analysis, a critical component of modern contract management systems. This paper presents a comprehensive implementation of LLM-enhanced metadata extraction for contract review, focusing on the automatic detection and annotation of salient legal clauses. Leveraging both the publicly available Contract Understanding Atticus Dataset (CUAD) and proprietary contract datasets, our work demonstrates the integration of advanced LLM methodologies with practical applications. We identify three pivotal elements for optimizing metadata extraction: robust text conversion, strategic chunk selection, and advanced LLM-specific techniques, including Chain of Thought (CoT) prompting and structured tool calling. The results from our experiments highlight the substantial improvements in clause identification accuracy and efficiency. Our approach shows promise in reducing the time and cost associated with contract review while maintaining high accuracy in legal clause identification. The results suggest that carefully optimized LLM systems could serve as valuable tools for legal professionals, potentially increasing access to efficient contract review services for organizations of all sizes.
☆ C2T-ID: Converting Semantic Codebooks to Textual Document Identifiers for Generative Search
Designing document identifiers (docids) that carry rich semantic information while maintaining tractable search spaces is a important challenge in generative retrieval (GR). Popular codebook methods address this by building a hierarchical semantic tree and constraining generation to its child nodes, yet their numeric identifiers cannot leverage the large language model's pretrained natural language understanding. Conversely, using text as docid provides more semantic expressivity but inflates the decoding space, making the system brittle to early-step errors. To resolve this trade-off, we propose C2T-ID: (i) first construct semantic numerical docid via hierarchical clustering; (ii) then extract high-frequency metadata keywords and iteratively replace each numeric label with its cluster's top-K keywords; and (iii) an optional two-level semantic smoothing step further enhances the fluency of C2T-ID. Experiments on Natural Questions and Taobao's product search demonstrate that C2T-ID significantly outperforms atomic, semantic codebook, and pure-text docid baselines, demonstrating its effectiveness in balancing semantic expressiveness with search space constraints.
☆ Automating Iconclass: LLMs and RAG for Large-Scale Classification of Religious Woodcuts
This paper presents a novel methodology for classifying early modern religious images by using Large Language Models (LLMs) and vector databases in combination with Retrieval-Augmented Generation (RAG). The approach leverages the full-page context of book illustrations from the Holy Roman Empire, allowing the LLM to generate detailed descriptions that incorporate both visual and textual elements. These descriptions are then matched to relevant Iconclass codes through a hybrid vector search. This method achieves 87% and 92% precision at five and four levels of classification, significantly outperforming traditional image and keyword-based searches. By employing full-page descriptions and RAG, the system enhances classification accuracy, offering a powerful tool for large-scale analysis of early modern visual archives. This interdisciplinary approach demonstrates the growing potential of LLMs and RAG in advancing research within art history and digital humanities.
comment: 29 pages, 7 figures. First presented at the "Digital Humanities and Artificial Intelligence" conference at the University of Reading on 17 June 2024
☆ Development of an Automated Web Application for Efficient Web Scraping: Design and Implementation
This paper presents the design and implementation of a user-friendly, automated web application that simplifies and optimizes the web scraping process for non-technical users. The application breaks down the complex task of web scraping into three main stages: fetching, extraction, and execution. In the fetching stage, the application accesses target websites using the HTTP protocol, leveraging the requests library to retrieve HTML content. The extraction stage utilizes powerful parsing libraries like BeautifulSoup and regular expressions to extract relevant data from the HTML. Finally, the execution stage structures the data into accessible formats, such as CSV, ensuring the scraped content is organized for easy use. To provide personalized and secure experiences, the application includes user registration and login functionalities, supported by MongoDB, which stores user data and scraping history. Deployed using the Flask framework, the tool offers a scalable, robust environment for web scraping. Users can easily input website URLs, define data extraction parameters, and download the data in a simplified format, without needing technical expertise. This automated tool not only enhances the efficiency of web scraping but also democratizes access to data extraction by empowering users of all technical levels to gather and manage data tailored to their needs. The methodology detailed in this paper represents a significant advancement in making web scraping tools accessible, efficient, and easy to use for a broader audience.
☆ ODataX: A Progressive Evolution of the Open Data Protocol
The Open Data Protocol (OData) provides a standardized approach for building and consuming RESTful APIs with rich query capabilities. Despite its power and maturity, OData adoption remains confined primarily to enterprise environments, particularly within Microsoft and SAP ecosystems. This paper analyzes the key barriers preventing wider OData adoption and introduces ODataX, an evolved version of the protocol designed to address these limitations. ODataX maintains backward compatibility with OData v4 while introducing progressive complexity disclosure through simplified query syntax, built-in performance guardrails via query cost estimation, and enhanced caching mechanisms. This work aims to bridge the gap between enterprise-grade query standardization and the simplicity demanded by modern web development practices.
♻ ☆ OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search
Traditional e-commerce search systems employ multi-stage cascading architectures (MCA) that progressively filter items through recall, pre-ranking, and ranking stages. While effective at balancing computational efficiency with business conversion, these systems suffer from fragmented computation and optimization objective collisions across stages, which ultimately limit their performance ceiling. To address these, we propose \textbf{OneSearch}, the first industrial-deployed end-to-end generative framework for e-commerce search. This framework introduces three key innovations: (1) a Keyword-enhanced Hierarchical Quantization Encoding (KHQE) module, to preserve both hierarchical semantics and distinctive item attributes while maintaining strong query-item relevance constraints; (2) a multi-view user behavior sequence injection strategy that constructs behavior-driven user IDs and incorporates both explicit short-term and implicit long-term sequences to model user preferences comprehensively; and (3) a Preference-Aware Reward System (PARS) featuring multi-stage supervised fine-tuning and adaptive reward-weighted ranking to capture fine-grained user preferences. Extensive offline evaluations on large-scale industry datasets demonstrate OneSearch's superior performance for high-quality recall and ranking. The rigorous online A/B tests confirm its ability to enhance relevance in the same exposure position, achieving statistically significant improvements: +1.67% item CTR, +2.40% buyer, and +3.22% order volume. Furthermore, OneSearch reduces operational expenditure by 75.40% and improves Model FLOPs Utilization from 3.26% to 27.32%. The system has been successfully deployed across multiple search scenarios in Kuaishou, serving millions of users, generating tens of millions of PVs daily.
♻ ☆ A Generative Framework for Personalized Sticker Retrieval EMNLP2025
Formulating information retrieval as a variant of generative modeling, specifically using autoregressive models to generate relevant identifiers for a given query, has recently attracted considerable attention. However, its application to personalized sticker retrieval remains largely unexplored and presents unique challenges: existing relevance-based generative retrieval methods typically lack personalization, leading to a mismatch between diverse user expectations and the retrieved results. To address this gap, we propose PEARL, a novel generative framework for personalized sticker retrieval, and make two key contributions: (i) To encode user-specific sticker preferences, we design a representation learning model to learn discriminative user representations. It is trained on three prediction tasks that leverage personal information and click history; and (ii) To generate stickers aligned with a user's query intent, we propose a novel intent-aware learning objective that prioritizes stickers associated with higher-ranked intents. Empirical results from both offline evaluations and online tests demonstrate that PEARL significantly outperforms state-of-the-art methods.
comment: Findings of EMNLP2025
♻ ☆ Embedding in Recommender Systems: A Survey
Recommender systems have become an essential component of many online platforms, providing personalized recommendations to users. A crucial aspect is embedding techniques that convert the high-dimensional discrete features, such as user and item IDs, into low-dimensional continuous vectors, which can enhance the recommendation performance. Embedding techniques have revolutionized the capture of complex entity relationships, generating significant research interest. This survey presents a comprehensive analysis of recent advances in recommender system embedding techniques. We examine centralized embedding approaches across matrix, sequential, and graph structures. In matrix-based scenarios, collaborative filtering generates embeddings that effectively model user-item preferences, particularly in sparse data environments. For sequential data, we explore various approaches including recurrent neural networks and self-supervised methods such as contrastive and generative learning. In graph-structured contexts, we analyze techniques like node2vec that leverage network relationships, along with applicable self-supervised methods. Our survey addresses critical scalability challenges in embedding methods and explores innovative directions in recommender systems. We introduce emerging approaches, including AutoML, hashing techniques, and quantization methods, to enhance performance while reducing computational complexity. Additionally, we examine the promising role of Large Language Models (LLMs) in embedding enhancement. Through detailed discussion of various architectures and methodologies, this survey aims to provide a thorough overview of state-of-the-art embedding techniques in recommender systems, while highlighting key challenges and future research directions.
comment: 47 pages
♻ ☆ On Efficiency-Effectiveness Trade-off of Diffusion-based Recommenders
Diffusion models have emerged as a powerful paradigm for generative sequential recommendation, which typically generate next items to recommend guided by user interaction histories with a multi-step denoising process. However, the multi-step process relies on discrete approximations, introducing discretization error that creates a trade-off between computational efficiency and recommendation effectiveness. To address this trade-off, we propose TA-Rec, a two-stage framework that achieves one-step generation by smoothing the denoising function during pretraining while alleviating trajectory deviation by aligning with user preferences during fine-tuning. Specifically, to improve the efficiency without sacrificing the recommendation performance, TA-Rec pretrains the denoising model with Temporal Consistency Regularization (TCR), enforcing the consistency between the denoising results across adjacent steps. Thus, we can smooth the denoising function to map the noise as oracle items in one step with bounded error. To further enhance effectiveness, TA-Rec introduces Adaptive Preference Alignment (APA) that aligns the denoising process with user preference adaptively based on preference pair similarity and timesteps. Extensive experiments prove that TA-Rec's two-stage objective effectively mitigates the discretization errors-induced trade-off, enhancing both efficiency and effectiveness of diffusion-based recommenders.
♻ ☆ Shilling Recommender Systems by Generating Side-feature-aware Fake User Profiles
Recommender systems (RS) greatly influence users' consumption decisions, making them attractive targets for malicious shilling attacks that inject fake user profiles to manipulate recommendations. Existing shilling methods can generate effective and stealthy fake profiles when training data only contain rating matrix, but they lack comprehensive solutions for scenarios where side features are present and utilized by the recommender. To address this gap, we extend the Leg-UP framework by enhancing the generator architecture to incorporate side features, enabling the generation of side-feature-aware fake user profiles. Experiments on benchmarks show that our method achieves strong attack performance while maintaining stealthiness.
♻ ☆ Stop Playing the Guessing Game! Target-free User Simulation for Evaluating Conversational Recommender Systems EMNLP 2025
Recent approaches in Conversational Recommender Systems (CRSs) have tried to simulate real-world users engaging in conversations with CRSs to create more realistic testing environments that reflect the complexity of human-agent dialogue. Despite the significant advancements, reliably evaluating the capability of CRSs to elicit user preferences still faces a significant challenge. Existing evaluation metrics often rely on target-biased user simulators that assume users have predefined preferences, leading to interactions that devolve into simplistic guessing game. These simulators typically guide the CRS toward specific target items based on fixed attributes, limiting the dynamic exploration of user preferences and struggling to capture the evolving nature of real-user interactions. Additionally, current evaluation metrics are predominantly focused on single-turn recall of target items, neglecting the intermediate processes of preference elicitation. To address this, we introduce PEPPER, a novel CRS evaluation protocol with target-free user simulators constructed from real-user interaction histories and reviews. PEPPER enables realistic user-CRS dialogues without falling into simplistic guessing games, allowing users to gradually discover their preferences through enriched interactions, thereby providing a more accurate and reliable assessment of the CRS's ability to elicit personal preferences. Furthermore, PEPPER presents detailed measures for comprehensively evaluating the preference elicitation capabilities of CRSs, encompassing both quantitative and qualitative measures that capture four distinct aspects of the preference elicitation process. Through extensive experiments, we demonstrate the validity of PEPPER as a simulation environment and conduct a thorough analysis of how effectively existing CRSs perform in preference elicitation and recommendation.
comment: Accepted to EMNLP 2025 Findings
♻ ☆ LLMs as Sparse Retrievers:A Framework for First-Stage Product Search
Product search is a crucial component of modern e-commerce platforms, with billions of user queries every day. In product search systems, first-stage retrieval should achieve high recall while ensuring efficient online deployment. Sparse retrieval is particularly attractive in this context due to its interpretability and storage efficiency. However, sparse retrieval methods suffer from severe vocabulary mismatch issues, leading to suboptimal performance in product search scenarios. With their potential for semantic analysis, large language models (LLMs) offer a promising avenue for mitigating vocabulary mismatch issues and thereby improving retrieval quality. Directly applying LLMs to sparse retrieval in product search exposes two key challenges:(1)Queries and product titles are typically short and highly susceptible to LLM-induced hallucinations, such as generating irrelevant expansion terms or underweighting critical literal terms like brand names and model numbers;(2)The large vocabulary space of LLMs leads to difficulty in initializing training effectively, making it challenging to learn meaningful sparse representations in such ultra-high-dimensional spaces.To address these challenges, we propose PROSPER, a framework for PROduct search leveraging LLMs as SParsE Retrievers. PROSPER incorporates: (1)A literal residual network that alleviates hallucination in lexical expansion by reinforcing underweighted literal terms through a residual compensation mechanism; and (2)A lexical focusing window that facilitates effective training initialization via a coarse-to-fine sparsification strategy.Extensive offline and online experiments show that PROSPER significantly outperforms sparse baselines and achieves recall performance comparable to advanced dense retrievers, while also achieving revenue increments online.
comment: 16 pages
Multimedia 10
☆ LifeSync-Games: Toward a Video Game Paradigm for Promoting Responsible Gaming and Human Development
Technological advancements have made video games a central part of the digital lives of nearly 3 billion people worldwide. Although games can address various social, physical, and psychological needs, their potential to support human development and well-being remains underutilized. Research highlights both negative effects, such as addiction and isolation, and positive outcomes like cognitive improvements and problem-solving skills. However, public discourse and regulation often focus more on risks than benefits. To address this imbalance, we present LifeSync-Games, a framework leveraging simplified digital twins to connect virtual gameplay with real-life activities. This reciprocal relationship aims to enhance the developmental value of gaming by promoting self-regulation and fostering growth across physical, mental, and social domains. We present the framework's theoretical foundations, technological components, design guidelines, and evaluation approaches. Additionally, we present early applications in both new and bestselling games to demonstrate its versatility and practical relevance.
comment: 8 pages, 4 figures, 1 table, 66 references
☆ A Matter of Time: Revealing the Structure of Time in Vision-Language Models
Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.
☆ CDI-DTI: A Strong Cross-domain Interpretable Drug-Target Interaction Prediction Framework Based on Multi-Strategy Fusion
Accurate prediction of drug-target interactions (DTI) is pivotal for drug discovery, yet existing methods often fail to address challenges like cross-domain generalization, cold-start prediction, and interpretability. In this work, we propose CDI-DTI, a novel cross-domain interpretable framework for DTI prediction, designed to overcome these limitations. By integrating multi-modal features-textual, structural, and functional-through a multi-strategy fusion approach, CDI-DTI ensures robust performance across different domains and in cold-start scenarios. A multi-source cross-attention mechanism is introduced to align and fuse features early, while a bidirectional cross-attention layer captures fine-grained intra-modal drug-target interactions. To enhance model interpretability, we incorporate Gram Loss for feature alignment and a deep orthogonal fusion module to eliminate redundancy. Experimental results on several benchmark datasets demonstrate that CDI-DTI significantly outperforms existing methods, particularly in cross-domain and cold-start tasks, while maintaining high interpretability for practical applications in drug-target interaction prediction.
☆ Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis
Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.
comment: Accepted by ACM Multimedia 2025
☆ See, Think, Act: Online Shopper Behavior Simulation with VLM Agents
LLMs have recently demonstrated strong potential in simulating online shopper behavior. Prior work has improved action prediction by applying SFT on action traces with LLM-generated rationales, and by leveraging RL to further enhance reasoning capabilities. Despite these advances, current approaches rely on text-based inputs and overlook the essential role of visual perception in shaping human decision-making during web GUI interactions. In this paper, we investigate the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs, leveraging OPeRA dataset. By grounding agent decision-making in both textual and visual modalities, we aim to narrow the gap between synthetic agents and real-world users, thereby enabling more cognitively aligned simulations of online shopping behavior. Specifically, we employ SFT for joint action prediction and rationale generation, conditioning on the full interaction context, which comprises action history, past HTML observations, and the current webpage screenshot. To further enhance reasoning capabilities, we integrate RL with a hierarchical reward structure, scaled by a difficulty-aware factor that prioritizes challenging decision points. Empirically, our studies show that incorporating visual grounding yields substantial gains: the combination of text and image inputs improves exact match accuracy by more than 6% over text-only inputs. These results indicate that multi-modal grounding not only boosts predictive accuracy but also enhances simulation fidelity in visually complex environments, which captures nuances of human attention and decision-making that text-only agents often miss. Finally, we revisit the design space of behavior simulation frameworks, identify key methodological limitations, and propose future research directions toward building efficient and effective human behavior simulators.
☆ Step-Aware Residual-Guided Diffusion for EEG Spatial Super-Resolution ICLR 2026
For real-world BCI applications, lightweight Electroencephalography (EEG) systems offer the best cost-deployment balance. However, such spatial sparsity of EEG limits spatial fidelity, hurting learning and introducing bias. EEG spatial super-resolution methods aim to recover high-density EEG signals from sparse measurements, yet is often hindered by distribution shift and signal distortion and thus reducing fidelity and usability for EEG analysis and visualization. To overcome these challenges, we introduce SRGDiff, a step-aware residual-guided diffusion model that formulates EEG spatial super-resolution as dynamic conditional generation. Our key idea is to learn a dynamic residual condition from the low-density input that predicts the step-wise temporal and spatial details to add and uses the evolving cue to steer the denoising process toward high-density reconstructions. At each denoising step, the proposed residual condition is additively fused with the previous denoiser feature maps, then a step-dependent affine modulation scales and shifts the activation to produce the current features. This iterative procedure dynamically extracts step-wise temporal rhythms and spatial-topographic cues to steer high-density recovery and maintain a fidelity-consistency balance. We adopt a comprehensive evaluation protocol spanning signal-, feature-, and downstream-level metrics across SEED, SEED-IV, and Localize-MI and multiple upsampling scales. SRGDiff achieves consistent gains of up to 40% over strong baselines, proving its superiority in the task of EEG spatial super-resolution. Moreover, topographic visualizations comparison and substantial EEG-FID gains jointly indicate that our SR EEG mitigates the spatial-spectral shift between low- and high-density recordings.
comment: ICLR 2026 Conference Submission
♻ ☆ Variable Rate Image Compression via N-Gram Context based Swin-transformer
This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86\% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.
comment: Accepted at ISVC 2025
♻ ☆ ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4).Upon thorough examination, we observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding. While existing DGM4 methods pay rare attention to the cross-modal alignment, hampering the accuracy of manipulation detecting to step further. To remedy this issue, this work targets to advance the semantic alignment learning to promote this task. Particularly, we utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs, especially for the manipulated instances. Subsequently, a cross-modal alignment learning is performed to enhance the semantic alignment. Besides the explicit auxiliary clues, we further design a Manipulation-Guided Cross Attention (MGCA) to provide implicit guidance for augmenting the manipulation perceiving. With the grounding truth available during training, MGCA encourages the model to concentrate more on manipulated components while downplaying normal ones, enhancing the model's ability to capture manipulations. Extensive experiments are conducted on the DGM4 dataset, the results demonstrate that our model can surpass the comparison method with a clear margin.
comment: 12 pages, 6 figures
♻ ☆ TimeWak: Temporal Chained-Hashing Watermark for Time Series Data
Synthetic time series generated by diffusion models enable sharing privacy-sensitive datasets, such as patients' functional MRI records. Key criteria for synthetic data include high data utility and traceability to verify the data source. Recent watermarking methods embed in homogeneous latent spaces, but state-of-the-art time series generators operate in data space, making latent-based watermarking incompatible. This creates the challenge of watermarking directly in data space while handling feature heterogeneity and temporal dependencies. We propose TimeWak, the first watermarking algorithm for multivariate time series diffusion models. To handle temporal dependence and spatial heterogeneity, TimeWak embeds a temporal chained-hashing watermark directly within the temporal-feature data space. The other unique feature is the $\epsilon$-exact inversion, which addresses the non-uniform reconstruction error distribution across features from inverting the diffusion process to detect watermarks. We derive the error bound of inverting multivariate time series while preserving robust watermark detectability. We extensively evaluate TimeWak on its impact on synthetic data quality, watermark detectability, and robustness under various post-editing attacks, against five datasets and baselines of different temporal lengths. Our results show that TimeWak achieves improvements of 61.96% in context-FID score, and 8.44% in correlational scores against the strongest state-of-the-art baseline, while remaining consistently detectable.
♻ ☆ EgoBlind: Towards Egocentric Visual Assistance for the Blind NeurIPS'25
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.
comment: NeurIPS'25 (D&B Track)
Image and Video Processing 15
☆ Automated Morphological Analysis of Neurons in Fluorescence Microscopy Using YOLOv8
Accurate segmentation and precise morphological analysis of neuronal cells in fluorescence microscopy images are crucial steps in neuroscience and biomedical imaging applications. However, this process is labor-intensive and time-consuming, requiring significant manual effort and expertise to ensure reliable outcomes. This work presents a pipeline for neuron instance segmentation and measurement based on a high-resolution dataset of stem-cell-derived neurons. The proposed method uses YOLOv8, trained on manually annotated microscopy images. The model achieved high segmentation accuracy, exceeding 97%. In addition, the pipeline utilized both ground truth and predicted masks to extract biologically significant features, including cell length, width, area, and grayscale intensity values. The overall accuracy of the extracted morphological measurements reached 75.32%, further supporting the effectiveness of the proposed approach. This integrated framework offers a valuable tool for automated analysis in cell imaging and neuroscience research, reducing the need for manual annotation and enabling scalable, precise quantification of neuron morphology.
comment: 7 pages, 2 figures and 2 tables
☆ Res-DPU: Resource-shared Digital Processing-in-memory Unit for Edge-AI Workloads
Processing-in-memory (PIM) has emerged as the go to solution for addressing the von Neumann bottleneck in edge AI accelerators. However, state-of-the-art (SoTA) digital PIM approaches suffer from low compute density, primarily due to the use of bulky bit cells and transistor-heavy adder trees, which impose limitations on macro scalability and energy efficiency. This work introduces Res-DPU, a resource-shared digital PIM unit, with a dual-port 5T SRAM latch and shared 2T AND compute logic. This reflects the per-bit multiplication cost to just 5.25T and reduced the transistor count of the PIM array by up to 56% over the SoTA works. Furthermore, a Transistor-Reduced 2D Interspersed Adder Tree (TRAIT) with FA-7T and PG-FA-26T helps reduce the power consumption of the adder tree by up to 21.35% and leads to improved energy efficiency by 59% compared to conventional 28T RCA designs. We propose a Cycle-controlled Iterative Approximate-Accurate Multiplication (CIA2M) approach, enabling run-time accuracy-latency trade-offs without requiring error-correction circuitry. The 16 KB REP-DPIM macro achieves 0.43 TOPS throughput and 87.22 TOPS/W energy efficiency in TSMC 65nm CMOS, with 96.85% QoR for ResNet-18 or VGG-16 on CIFAR-10, including 30% pruning. The proposed results establish a Res-DPU module for highly scalable and energy-efficient real-time edge AI accelerators.
☆ TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models
Foundation models for medical imaging demonstrate superior generalization capabilities across diverse anatomical structures and clinical applications. Their outstanding performance relies on substantial computational resources, limiting deployment in resource-constrained clinical environments. This paper presents TinyUSFM, the first lightweight ultrasound foundation model that maintains superior organ versatility and task adaptability of our large-scale Ultrasound Foundation Model (USFM) through knowledge distillation with strategically curated small datasets, delivering significant computational efficiency without sacrificing performance. Considering the limited capacity and representation ability of lightweight models, we propose a feature-gradient driven coreset selection strategy to curate high-quality compact training data, avoiding training degradation from low-quality redundant images. To preserve the essential spatial and frequency domain characteristics during knowledge transfer, we develop domain-separated masked image modeling assisted consistency-driven dynamic distillation. This novel framework adaptively transfers knowledge from large foundation models by leveraging teacher model consistency across different domain masks, specifically tailored for ultrasound interpretation. For evaluation, we establish the UniUS-Bench, the largest publicly available ultrasound benchmark comprising 8 classification and 10 segmentation datasets across 15 organs. Using only 200K images in distillation, TinyUSFM matches USFM's performance with just 6.36% of parameters and 6.40% of GFLOPs. TinyUSFM significantly outperforms the vanilla model by 9.45% in classification and 7.72% in segmentation, surpassing all state-of-the-art lightweight models, and achieving 84.91% average classification accuracy and 85.78% average segmentation Dice score across diverse medical devices and centers.
comment: Submit to JBHI, 14 pages, 6 figures
☆ Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets
Developing embodied AI agents requires scalable training environments that balance content diversity with physics accuracy. World simulators provide such environments but face distinct limitations: video-based methods generate diverse content but lack real-time physics feedback for interactive learning, while physics-based engines provide accurate dynamics but face scalability limitations from costly manual asset creation. We present Seed3D 1.0, a foundation model that generates simulation-ready 3D assets from single images, addressing the scalability challenge while maintaining physics rigor. Unlike existing 3D generation models, our system produces assets with accurate geometry, well-aligned textures, and realistic physically-based materials. These assets can be directly integrated into physics engines with minimal configuration, enabling deployment in robotic manipulation and simulation training. Beyond individual objects, the system scales to complete scene generation through assembling objects into coherent environments. By enabling scalable simulation-ready content creation, Seed3D 1.0 provides a foundation for advancing physics-based world simulators. Seed3D 1.0 is now available on https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?modelId=doubao-seed3d-1-0-250928&tab=Gen3D
comment: Seed3D 1.0 Technical Report; Official Page on https://seed.bytedance.com/seed3d
☆ Visible Iris Area as a Quality Metric for Reliable Iris Recognition Under Pupil Dilation and Eyelid Occlusion
With the increasing adoption of iris recognition systems and the expansion of large-scale enrollment databases, there is a growing need to efficiently assess iris image quality at the time of acquisition, particularly to model user non-compliance in real time. Image quality may degrade due to eyelid occlusion or pupil dilation. Although previous studies have shown that occlusion and changes in the pupil-to-iris ratio negatively impact recognition performance, these investigations were typically limited by small sample sizes and did not examine the combined effects of eyelid and pupil variations. In this study, we analyze both dilation and eyelid occlusion using a large dataset of 555 distinct irises and demonstrate a strong correlation between probe image visible iris area and the Hamming distance of iris code pairs. These results suggest that visible iris area is a robust indicator of probe image quality and could be efficiently incorporated into the iris acquisition process to improve confidence in match predictions.
comment: 9 pages, 9 figures, 1 table. This work has been submitted to IEEE for possible publication
☆ Lightweight Classifier for Detecting Intracranial Hemorrhage in Ultrasound Data
Intracranial hemorrhage (ICH) secondary to Traumatic Brain Injury (TBI) represents a critical diagnostic challenge, with approximately 64,000 TBI-related deaths annually in the United States. Current diagnostic modalities including Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) have significant limitations: high cost, limited availability, and infrastructure dependence, particularly in resource-constrained environments. This study investigates machine learning approaches for automated ICH detection using Ultrasound Tissue Pulsatility Imaging (TPI), a portable technique measuring tissue displacement from hemodynamic forces during cardiac cycles. We analyze ultrasound TPI signals comprising 30 temporal frames per cardiac cycle with recording angle information, collected from TBI patients with CT-confirmed ground truth labels. Our preprocessing pipeline employs z-score normalization and Principal Component Analysis (PCA) for dimensionality reduction, retaining components explaining 95% of cumulative variance. We systematically evaluate multiple classification algorithms spanning probabilistic, kernel-based, neural network, and ensemble learning approaches across three feature representations: original 31-dimensional space, reduced subset, and PCA-transformed space. Results demonstrate that PCA transformation substantially improves classifier performance, with ensemble methods achieving 98.0% accuracy and F1-score of 0.890, effectively balancing precision and recall despite class imbalance. These findings establish the feasibility of machine learning-based ICH detection in TBI patients using portable ultrasound devices, with applications in emergency medicine, rural healthcare, and military settings where traditional imaging is unavailable.
♻ ☆ Variable Rate Image Compression via N-Gram Context based Swin-transformer
This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86\% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.
comment: Accepted at ISVC 2025
♻ ☆ Discretized Gaussian Representation for Tomographic Reconstruction ICCV 2025
Computed Tomography (CT) enables detailed cross-sectional imaging but continues to face challenges in balancing reconstruction quality and computational efficiency. While deep learning-based methods have significantly improved image quality and noise reduction, they typically require large-scale training data and intensive computation. Recent advances in scene reconstruction, such as Neural Radiance Fields and 3D Gaussian Splatting, offer alternative perspectives but are not well-suited for direct volumetric CT reconstruction. In this work, we propose Discretized Gaussian Representation (DGR), a novel framework that reconstructs the 3D volume directly using a set of discretized Gaussian functions in an end-to-end manner. To further enhance efficiency, we introduce Fast Volume Reconstruction, a highly parallelized technique that aggregates Gaussian contributions into the voxel grid with minimal overhead. Extensive experiments on both real-world and synthetic datasets demonstrate that DGR achieves superior reconstruction quality and runtime performance across various CT reconstruction scenarios. Our code is publicly available at https://github.com/wskingdom/DGR.
comment: Accepted to ICCV 2025
♻ ☆ AI-Driven Radio Propagation Prediction in Automated Warehouses using Variational Autoencoders
The next decade will usher in a profound transformation of wireless communication, driven by the ever-increasing demand for data-intensive applications and the rapid adoption of emerging technologies. To fully unlock the potential of 5G and beyond, substantial advancements are required in signal processing techniques, innovative network architectures, and efficient spectrum utilization strategies. These advancements facilitate seamless integration of emerging technologies, driving industrial digital transformation and connectivity. This paper introduces a novel Variational Autoencoder (VAE)-based framework, Wireless Infrastructure for Smart Warehouses using VAE (WISVA), designed for accurate indoor radio propagation modeling in automated Industry 4.0 environments such as warehouses and factory floors operating within 5G wireless bands. The research delves into the meticulous creation of training data tensors, capturing complex electromagnetic (EM) wave behaviors influenced by diverse obstacles, and outlines the architecture and training methodology of the proposed VAE model. The model's robustness and adaptability are showcased through its ability to predict signal-to-interference-plus-noise ratio (SINR) heatmaps across various scenarios, including denoising tasks, validation datasets, extrapolation to unseen configurations, and previously unencountered warehouse layouts. Compelling reconstruction error heatmaps are presented, highlighting the superior accuracy of WISVA compared to traditional autoencoder models. The paper also analyzes the model's performance in handling complex smart warehouse environments, demonstrating its potential as a key enabler for optimizing wireless infrastructure in Industry 4.0.
comment: Found a fundamental error (data leakage) in cross-validation setup affecting both papers. This issue compromises the model training and results, making performance claims unreliable and potentially misleading. We request withdrawal of current versions (v1) to prevent the dissemination of incorrect scientific findings. Corrected versions will be submitted later
♻ ☆ LookUp3D: Data-Driven 3D Scanning
High speed, high-resolution, and accurate 3D scanning would open doors to many new applications in graphics, robotics, science, and medicine by enabling the accurate scanning of deformable objects during interactions. Past attempts to use structured light, time-of-flight, and stereo in high-speed settings have usually required tradeoffs in resolution or inaccuracy. In this paper, we introduce a method that enables, for the first time, 3D scanning at 450 frames per second at 1~Megapixel, or 1,450 frames per second at 0.4~Megapixel in an environment with controlled lighting. The key idea is to use a per-pixel lookup table that maps colors to depths, which is built using a linear stage. Imperfections, such as lens-distortion and sensor defects are baked into the calibration. We describe our method and test it on a novel hardware prototype. We compare the system with both ground-truth geometry as well as commercially available dynamic sensors like the Microsoft Kinect and Intel Realsense. Our results show the system acquiring geometry of objects undergoing high-speed deformations and oscillations and demonstrate the ability to recover physical properties from the reconstructions.
comment: Giancarlo Pereira, Yidan Gao, and Yurii Piadyk are joint first authors with equal contribution. 11 pages of main paper, 9 pages of supplemental text (all combined into a single document)
♻ ☆ Learning Differential Pyramid Representation for Tone Mapping
Existing tone mapping methods operate on downsampled inputs and rely on handcrafted pyramids to recover high-frequency details. These designs typically fail to preserve fine textures and structural fidelity in complex HDR scenes. Furthermore, most methods lack an effective mechanism to jointly model global tone consistency and local contrast enhancement, leading to globally flat or locally inconsistent outputs such as halo artifacts. We present the Differential Pyramid Representation Network (DPRNet), an end-to-end framework for high-fidelity tone mapping. At its core is a learnable differential pyramid that generalizes traditional Laplacian and Difference-of-Gaussian pyramids through content-aware differencing operations across scales. This allows DPRNet to adaptively capture high-frequency variations under diverse luminance and contrast conditions. To enforce perceptual consistency, DPRNet incorporates global tone perception and local tone tuning modules operating on downsampled inputs, enabling efficient yet expressive tone adaptation. Finally, an iterative detail enhancement module progressively restores the full-resolution output in a coarse-to-fine manner, reinforcing structure and sharpness. Experiments show that DPRNet achieves state-of-the-art results, improving PSNR by 2.39 dB on the 4K HDR+ dataset and 3.01 dB on the 4K HDRI Haven dataset, while producing perceptually coherent, detail-preserving results. \textit{We provide an anonymous online demo at https://xxxxxxdprnet.github.io/DPRNet/.
♻ ☆ Time-causal and time-recursive wavelets
When to apply wavelet analysis to real-time temporal signals, where the future cannot be accessed, it is essential to base all the steps in the signal processing pipeline on computational mechanisms that are truly time-causal. This paper describes how a time-causal wavelet analysis can be performed based on concepts developed in the area of temporal scale-space theory, originating from a complete classification of temporal smoothing kernels that guarantee non-creation of new structures from finer to coarser temporal scale levels. By necessity, convolution with truncated exponential kernels in cascade constitutes the only permissable class of kernels, as well as their temporal derivatives as a natural complement to fulfil the admissibility conditions of wavelet representations. For a particular way of choosing the time constants in the resulting infinite convolution of truncated exponential kernels, to ensure temporal scale covariance and thus self-similarity over temporal scales, we describe how mother wavelets can be chosen as temporal derivatives of the resulting time-causal limit kernel. By developing connections between wavelet theory and scale-space theory, we characterize and quantify how the continuous scaling properties transfer to the discrete implementation, demonstrating how the proposed time-causal wavelet representation can reflect the duration of locally dominant temporal structures in the input signals. We propose that this notion of time-causal wavelet analysis could be a valuable tool for signal processing tasks, where streams of signals are to be processed in real time, specifically for signals that may contain local variations over a rich span of temporal scales, or more generally for analysing physical or biophysical temporal phenomena, where a fully time-causal analysis is called for to be physically realistic.
comment: 25 pages, 10 figures
♻ ☆ Advancing Image Super-resolution Techniques in Remote Sensing: A Comprehensive Survey SP
Remote sensing image super-resolution (RSISR) is a crucial task in remote sensing image processing, aiming to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Despite the growing number of RSISR methods proposed in recent years, a systematic and comprehensive review of these methods is still lacking. This paper presents a thorough review of RSISR algorithms, covering methodologies, datasets, and evaluation metrics. We provide an in-depth analysis of RSISR methods, categorizing them into supervised, unsupervised, and quality evaluation approaches, to help researchers understand current trends and challenges. Our review also discusses the strengths, limitations, and inherent challenges of these techniques. Notably, our analysis reveals significant limitations in existing methods, particularly in preserving fine-grained textures and geometric structures under large-scale degradation. Based on these findings, we outline future research directions, highlighting the need for domain-specific architectures and robust evaluation protocols to bridge the gap between synthetic and real-world RSISR scenarios.
comment: Accepted by ISPRS Journal of Photogrammetry and Remote Sensing
♻ ☆ Untangling Vascular Trees for Surgery and Interventional Radiology
The diffusion of minimally invasive, endovascular interventions motivates the development of visualization methods for complex vascular networks. We propose a planar representation of blood vessel trees which preserves the properties that are most relevant to catheter navigation: topology, length and curvature. Taking as input a three-dimensional digital angiography, our algorithm produces a faithful two-dimensional map of the patient's vessels within a few seconds. To this end, we propose optimized implementations of standard morphological filters and a new recursive embedding algorithm that preserves the global orientation of the vascular network. We showcase our method on peroperative images of the brain, pelvic and knee artery networks. On the clinical side, our method simplifies the choice of devices prior to and during the intervention. This lowers the risk of failure during navigation or device deployment and may help to reduce the gap between expert and common intervention centers. From a research perspective, our method simulates the cadaveric display of artery trees from anatomical dissections. This opens the door to large population studies on the branching patterns and tortuosity of fine human blood vessels. Our code is released under the permissive MIT license as part of the scikit-shapes Python library (https://scikit-shapes.github.io ).
♻ ☆ DIPLI: Deep Image Prior Lucky Imaging for Blind Astronomical Image Restoration
Modern image restoration and super-resolution methods utilize deep learning due to its superior performance compared to traditional algorithms. However, deep learning typically requires large training datasets, which are rarely available in astrophotography. Deep Image Prior (DIP) bypasses this constraint by performing blind training on a single image. Although effective in some cases, DIP often suffers from overfitting, artifact generation, and instability. To overcome these issues and improve general performance, this work proposes DIPLI - a framework that shifts from single-frame to multi-frame training using the Back Projection technique, combined with optical flow estimation via the TVNet model, and replaces deterministic predictions with unbiased Monte Carlo estimation obtained through Langevin dynamics. A comprehensive evaluation compares the method against Lucky Imaging, a classical computer vision technique still widely used in astronomical image reconstruction, DIP, the transformer-based model RVRT, and the diffusion-based model DiffIR2VR-Zero. Experiments on synthetic datasets demonstrate consistent improvements, with the method outperforming baselines for SSIM, PSNR, LPIPS, and DISTS metrics in the majority of cases. In addition to superior reconstruction quality, the model also requires far fewer input images than Lucky Imaging and is less prone to overfitting or artifact generation. Evaluation on real-world astronomical data, where domain shifts typically hinder generalization, shows that the method maintains high reconstruction quality, confirming practical robustness.
comment: 10 pages, 7 figures, 2 tables
Computer Vision and Pattern Recognition 150
☆ Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.
☆ DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models' reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.
☆ LightMem: Lightweight and Efficient Memory-Augmented Generation
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.
comment: Work in progress
☆ DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution NeurIPS 2025
Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.
comment: Accept by NeurIPS 2025
☆ See the Text: From Tokenization to Visual Reading
People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
☆ FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning
Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.
comment: Accepted at MM 2025
☆ Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework NeurIPS 2025
Graph Transformers (GTs) have emerged as a powerful paradigm for graph representation learning due to their ability to model diverse node interactions. However, existing GTs often rely on intricate architectural designs tailored to specific interactions, limiting their flexibility. To address this, we propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction. This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks. Theoretical analysis under this framework demonstrates that the probability of correct classification positively correlates with the receptive field size and label consistency, leading to a fundamental design principle: an effective attention mask should ensure both a sufficiently large receptive field and a high level of label consistency. While no single existing mask satisfies this principle across all scenarios, our analysis reveals that hierarchical masks offer complementary strengths, motivating their effective integration. Then, we introduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer with Multi-Level Masking and Dual Attention Computation. M3Dphormer incorporates three theoretically grounded hierarchical masks and employs a bi-level expert routing mechanism to adaptively integrate multi-level interaction information. To ensure scalability, we further introduce a dual attention computation scheme that dynamically switches between dense and sparse modes based on local mask sparsity. Extensive experiments across multiple benchmarks demonstrate that M3Dphormer achieves state-of-the-art performance, validating the effectiveness of our unified framework and model design.
comment: Accepted by NeurIPS 2025 (Poster)
☆ SAM 2++: Tracking Anything at Any Granularity
Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
comment: 8 pages, and 10 pages in Supplementary Material
☆ An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection
Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas. Early detection is vital for treatment, yet the lack of skilled radiologists underscores the need for artificial intelligence (AI)-driven screening tools. Developing reliable AI models is challenging due to the necessity for large, high-quality datasets, which are costly to obtain. To tackle this, we propose a teacher--student framework which enhances both disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head. Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection, significantly outperforming baselines. The explainability assessments also show the model bases its predictions on relevant anatomical features, demonstrating promise for deployment in clinical screening and triage settings.
comment: 16 pages, 3 figures
☆ A Geometric Approach to Steerable Convolutions
In contrast to the somewhat abstract, group theoretical approach adopted by many papers, our work provides a new and more intuitive derivation of steerable convolutional neural networks in $d$ dimensions. This derivation is based on geometric arguments and fundamental principles of pattern matching. We offer an intuitive explanation for the appearance of the Clebsch--Gordan decomposition and spherical harmonic basis functions. Furthermore, we suggest a novel way to construct steerable convolution layers using interpolation kernels that improve upon existing implementation, and offer greater robustness to noisy data.
☆ ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP
comment: 17 pages, 5 fiugres
☆ Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection
A recent class of hyperspectral anomaly detection methods that can be trained once on background datasets and then universally deployed -- without per-scene retraining or parameter tuning -- has demonstrated remarkable efficiency and robustness. Building upon this paradigm, we focus on the integration of spectral and spatial cues and introduce a novel "Rebellious Student" framework for complementary feature learning. Unlike conventional teacher-student paradigms driven by imitation, our method intentionally trains the spatial branch to diverge from the spectral teacher, thereby learning complementary spatial patterns that the teacher fails to capture. A two-stage learning strategy is adopted: (1) a spectral enhancement network is first trained via reverse distillation to obtain robust background spectral representations; and (2) a spatial network -- the rebellious student -- is subsequently optimized using decorrelation losses that enforce feature orthogonality while maintaining reconstruction fidelity to avoid irrelevant noise. Once trained, the framework enhances both spectral and spatial background features, enabling parameter-free and training-free anomaly detection when paired with conventional detectors. Extensive experiments on the HAD100 benchmark show substantial improvements over several established baselines with minimal computational overhead, confirming the effectiveness and generality of the proposed complementary learning paradigm. Our code is publicly available at https://github.com/xjpp2016/FERS.
☆ UltraGen: High-Resolution Video Generation with Hierarchical Attention
Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.
☆ Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction NeurIPS 2025
As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data, yet conventional machine learning models with limited data often produce inaccurate predictions, particularly in underserved areas. Geospatial foundation models trained on global unstructured data offer a promising alternative by demonstrating strong generalization and requiring only minimal fine-tuning. In this study, an empirical ground truth of urban heat patterns is established by quantifying cooling effects from green spaces and benchmarking them against model predictions to evaluate the model's accuracy. The foundation model is subsequently fine-tuned to predict land surface temperatures under future climate scenarios, and its practical value is demonstrated through a simulated inpainting that highlights its role for mitigation support. The results indicate that foundation models offer a powerful way for evaluating urban heat island mitigation strategies in data-scarce regions to support more climate-resilient cities.
comment: 10 pages, 9 figures. Accepted at the NeurIPS 2025 Workshop on Tackling Climate Change with Machine Learning
☆ Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation
Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.
☆ SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery NeurIPS 2025
This paper investigates the problem of Generalized Category Discovery (GCD). Given a partially labelled dataset, GCD aims to categorize all unlabelled images, regardless of whether they belong to known or unknown classes. Existing approaches typically depend on either single-level semantics or manually designed abstract hierarchies, which limit their generalizability and scalability. To address these limitations, we introduce a SEmantic-aware hierArchical Learning framework (SEAL), guided by naturally occurring and easily accessible hierarchical structures. Within SEAL, we propose a Hierarchical Semantic-Guided Soft Contrastive Learning approach that exploits hierarchical similarity to generate informative soft negatives, addressing the limitations of conventional contrastive losses that treat all negatives equally. Furthermore, a Cross-Granularity Consistency (CGC) module is designed to align the predictions from different levels of granularity. SEAL consistently achieves state-of-the-art performance on fine-grained benchmarks, including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and further demonstrates generalization on coarse-grained datasets. Project page: https://visual-ai.github.io/seal/
comment: Accepted to NeurIPS 2025
☆ Moving Light Adaptive Colonoscopy Reconstruction via Illumination-Attenuation-Aware 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for real-time view synthesis in colonoscopy, enabling critical applications such as virtual colonoscopy and lesion tracking. However, the vanilla 3DGS assumes static illumination and that observed appearance depends solely on viewing angle, which causes incompatibility with the photometric variations in colonoscopic scenes induced by dynamic light source/camera. This mismatch forces most 3DGS methods to introduce structure-violating vaporous Gaussian blobs between the camera and tissues to compensate for illumination attenuation, ultimately degrading the quality of 3D reconstructions. Previous works only consider the illumination attenuation caused by light distance, ignoring the physical characters of light source and camera. In this paper, we propose ColIAGS, an improved 3DGS framework tailored for colonoscopy. To mimic realistic appearance under varying illumination, we introduce an Improved Appearance Modeling with two types of illumination attenuation factors, which enables Gaussians to adapt to photometric variations while preserving geometry accuracy. To ensure the geometry approximation condition of appearance modeling, we propose an Improved Geometry Modeling using high-dimensional view embedding to enhance Gaussian geometry attribute prediction. Furthermore, another cosine embedding input is leveraged to generate illumination attenuation solutions in an implicit manner. Comprehensive experimental results on standard benchmarks demonstrate that our proposed ColIAGS achieves the dual capabilities of novel view synthesis and accurate geometric reconstruction. It notably outperforms other state-of-the-art methods by achieving superior rendering fidelity while significantly reducing Depth MSE. Code will be available.
☆ IF-VidCap: Can Video Caption Models Follow Instructions?
Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
comment: https://github.com/NJU-LINK/IF-VidCap
☆ SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive experiments demonstrate that the proposed method achieves a 5$\times$ reduction in memory usage and a notable 6.6$\times$ speedup in overall throughput with only minimal visual quality loss, thereby enabling highly efficient native autoregressive image generation on resource-constrained hardware.
☆ PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting NeurIPS 2025
This paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. The project page is available at https://lck666666.github.io/plana3r
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025). The project page is available at: https://lck666666.github.io/plana3r
☆ A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
Recently, action recognition has been dominated by transformer-based methods, thanks to their spatiotemporal contextual aggregation capacities. However, despite the significant progress achieved on scene-related datasets, they do not perform well on motion-sensitive datasets due to the lack of elaborate motion modeling designs. Meanwhile, we observe that the widely-used cost volume in traditional action recognition is highly similar to the affinity matrix defined in self-attention, but equipped with powerful motion modeling capacities. In light of this, we propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way, with the proposal of the Explicit Motion Information Mining module (EMIM). In EMIM, we propose to construct the desirable affinity matrix in a cost volume style, where the set of key candidate tokens is sampled from the query-based neighboring area in the next frame in a sliding-window manner. Then, the constructed affinity matrix is used to aggregate contextual information for appearance modeling and is converted into motion features for motion modeling as well. We validate the motion modeling capacities of our method on four widely-used datasets, and our method performs better than existing state-of-the-art approaches, especially on motion-sensitive datasets, i.e., Something-Something V1 & V2.
comment: accepted by Pattern Recognition. We have been always curious to see whether our designs could be beneficial in other scenarios, such as embedding it into the DiT model or 3D-VAE for video generation. If you are interested in it, why not give it a shot?
☆ Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: https://github.com/showlab/VC2L.
comment: Project page: this https://linyq17.github.io/VC2L/
☆ UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.
comment: Project page: codegoat24.github.io/UniGenBench/
☆ MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
comment: 15 pages, 12 figures
☆ Beyond the Pipeline: Analyzing Key Factors in End-to-End Deep Learning for Historical Writer Identification
This paper investigates various factors that influence the performance of end-to-end deep learning approaches for historical writer identification (HWI), a task that remains challenging due to the diversity of handwriting styles, document degradation, and the limited number of labelled samples per writer. These conditions often make accurate recognition difficult, even for human experts. Traditional HWI methods typically rely on handcrafted image processing and clustering techniques, which tend to perform well on small and carefully curated datasets. In contrast, end-to-end pipelines aim to automate the process by learning features directly from document images. However, our experiments show that many of these models struggle to generalise in more realistic, document-level settings, especially under zero-shot scenarios where writers in the test set are not present in the training data. We explore different combinations of pre-processing methods, backbone architectures, and post-processing strategies, including text segmentation, patch sampling, and feature aggregation. The results suggest that most configurations perform poorly due to weak capture of low-level visual features, inconsistent patch representations, and high sensitivity to content noise. Still, we identify one end-to-end setup that achieves results comparable to the top-performing system, despite using a simpler design. These findings point to key challenges in building robust end-to-end systems and offer insight into design choices that improve performance in historical document writer identification.
comment: Published in The 12th IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2025
☆ Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches
The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. We train and validate our model on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by three orders of magnitude compared to the state-of-the-art while maintaining competitive accuracy. We demonstrate the applicability of our proposed model on medical edge devices by analyzing energy consumption on a microcontroller and an experimental sensor device setup, confirming that on-device inference can be more energy-efficient than continuous data streaming.
comment: Submitted to the IEEE Journal of Biomedical And Health Informatics
☆ Image augmentation with invertible networks in interactive satellite image change detection
This paper devises a novel interactive satellite image change detection algorithm based on active learning. Our framework employs an iterative process that leverages a question-and-answer model. This model queries the oracle (user) about the labels of a small subset of images (dubbed as display), and based on the oracle's responses, change detection model is dynamically updated. The main contribution of our framework resides in a novel invertible network that allows augmenting displays, by mapping them from highly nonlinear input spaces to latent ones, where augmentation transformations become linear and more tractable. The resulting augmented data are afterwards mapped back to the input space, and used to retrain more effective change detection criteria in the subsequent iterations of active learning. Experimental results demonstrate superior performance of our proposed method compared to the related work.
☆ Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression NeurIPS 2025
This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.
comment: Accepted to NeurIPS 2025
☆ ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
comment: 10 pages main text, 17 pages total
☆ C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression BMVC2025
Neural network compression has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts sparsity in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured pruning is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies pruning directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot pruning framework that relies on explainable deep learning. First, we introduce a causal-aware pruning approach that leverages cause-effect relations between model predictions and structures in a progressive pruning process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision transformer baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.
comment: 10 pages, BMVC2025
☆ Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at https://github.com/zhangquanchen/3DThinker.
comment: 12 pages, 4 figures
☆ CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.
comment: 24 pages, 6 figures
☆ CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder NeurIPS 2025
Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
comment: NeurIPS 2025
☆ Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.
comment: 11 pages, 6 figures
☆ Descriptor: Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving
Robust perception in automated driving requires reliable performance under adverse conditions, where sensors may be affected by partial failures or environmental occlusions. Although existing autonomous driving datasets inherently contain sensor noise and environmental variability, very few enable controlled, parameterised, and reproducible degradations across multiple sensing modalities. This gap limits the ability to systematically evaluate how perception and fusion architectures perform under well-defined adverse conditions. To address this limitation, we introduce the Occluded nuScenes Dataset, a novel extension of the widely used nuScenes benchmark. For the camera modality, we release both the full and mini versions with four types of occlusions, two adapted from public implementations and two newly designed. For radar and LiDAR, we provide parameterised occlusion scripts that implement three types of degradations each, enabling flexible and repeatable generation of occluded data. This resource supports consistent, reproducible evaluation of perception models under partial sensor failures and environmental interference. By releasing the first multi-sensor occlusion dataset with controlled and reproducible degradations, we aim to advance research on robust sensor fusion, resilience analysis, and safety-critical perception in automated driving.
☆ GBlobs: Local LiDAR Geometry for Improved Sensor Placement Generalization IROS'25
This technical report outlines the top-ranking solution for RoboSense 2025: Track 3, achieving state-of-the-art performance on 3D object detection under various sensor placements. Our submission utilizes GBlobs, a local point cloud feature descriptor specifically designed to enhance model generalization across diverse LiDAR configurations. Current LiDAR-based 3D detectors often suffer from a \enquote{geometric shortcut} when trained on conventional global features (\ie, absolute Cartesian coordinates). This introduces a position bias that causes models to primarily rely on absolute object position rather than distinguishing shape and appearance characteristics. Although effective for in-domain data, this shortcut severely limits generalization when encountering different point distributions, such as those resulting from varying sensor placements. By using GBlobs as network input features, we effectively circumvent this geometric shortcut, compelling the network to learn robust, object-centric representations. This approach significantly enhances the model's ability to generalize, resulting in the exceptional performance demonstrated in this challenge.
comment: 1st place at the IROS'25 RoboSense Challenge, Track #3: Cross-Sensor Placement 3D Object Detection
☆ RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation
Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.
☆ DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices
The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) using a subset of our own waste data set and annotated it using the custom tool Annotated Lab. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy (~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 77% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes (< 7MB), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of "Greener AI" models to support real-time, sustainable waste sorting on edge devices.
comment: 8 pages, 8 figures
☆ Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
comment: Accepted by The 38th Conference of Open Innovations Association FRUCT, 2025
☆ Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.
comment: Project page is available at https://liujf1226.github.io/Mono4DGS-HDR/
☆ Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
comment: Code and models available at: https://github.com/tianciB/VFM-VAE
☆ LAND: Lung and Nodule Diffusion for 3D Chest CT Synthesis with Anatomical Guidance
This work introduces a new latent diffusion model to generate high-quality 3D chest CT scans conditioned on 3D anatomical masks. The method synthesizes volumetric images of size 256x256x256 at 1 mm isotropic resolution using a single mid-range GPU, significantly lowering the computational cost compared to existing approaches. The conditioning masks delineate lung and nodule regions, enabling precise control over the output anatomical features. Experimental results demonstrate that conditioning solely on nodule masks leads to anatomically incorrect outputs, highlighting the importance of incorporating global lung structure for accurate conditional synthesis. The proposed approach supports the generation of diverse CT volumes with and without lung nodules of varying attributes, providing a valuable tool for training AI models or healthcare professionals.
☆ Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection ICCV 2025
At the core of Camouflaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudo-labels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouflaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-then-Retrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogram-based image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms state-of-the-art unsupervised and prompt-based methods. Code is available at https://github.com/xiaohainku/RISE.
comment: ICCV 2025
☆ ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.
☆ ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters
Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$\times$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.
☆ Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling
This paper presents an automated system for cricket video analysis that leverages deep learning techniques to extract wicket-taking deliveries, detect cricket balls, and model ball trajectories. The system employs the YOLOv8 architecture for pitch and ball detection, combined with optical character recognition (OCR) for scorecard extraction to identify wicket-taking moments. Through comprehensive image preprocessing, including grayscale transformation, power transformation, and morphological operations, the system achieves robust text extraction from video frames. The pitch detection model achieved 99.5% mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the ball detection model using transfer learning attained 99.18% mAP50 with 0.968 precision and 0.978 recall. The system enables trajectory modeling on detected pitches, providing data-driven insights for identifying batting weaknesses. Experimental results on multiple cricket match videos demonstrate the effectiveness of this approach for automated cricket analytics, offering significant potential for coaching and strategic decision-making.
comment: 6 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering 2025
☆ Bayesian Fully-Connected Tensor Network for Hyperspectral-Multispectral Image Fusion
Tensor decomposition is a powerful tool for data analysis and has been extensively employed in the field of hyperspectral-multispectral image fusion (HMF). Existing tensor decomposition-based fusion methods typically rely on disruptive data vectorization/reshaping or impose rigid constraints on the arrangement of factor tensors, hindering the preservation of spatial-spectral structures and the modeling of cross-dimensional correlations. Although recent advances utilizing the Fully-Connected Tensor Network (FCTN) decomposition have partially alleviated these limitations, the process of reorganizing data into higher-order tensors still disrupts the intrinsic spatial-spectral structure. Furthermore, these methods necessitate extensive manual parameter tuning and exhibit limited robustness against noise and spatial degradation. To alleviate these issues, we propose the Bayesian FCTN (BFCTN) method. Within this probabilistic framework, a hierarchical sparse prior that characterizing the sparsity of physical elements, establishes connections between the factor tensors. This framework explicitly models the intrinsic physical coupling among spatial structures, spectral signatures, and local scene homogeneity. For model learning, we develop a parameter estimation method based on Variational Bayesian inference (VB) and the Expectation-Maximization (EM) algorithm, which significantly reduces the need for manual parameter tuning. Extensive experiments demonstrate that BFCTN not only achieves state-of-the-art fusion accuracy and strong robustness but also exhibits practical applicability in complex real-world scenarios.
☆ Entropy-Enhanced Conformal Features from Ricci Flow for Robust Alzheimer's Disease Classification
Background and Objective: In brain imaging, geometric surface models are essential for analyzing the 3D shapes of anatomical structures. Alzheimer's disease (AD) is associated with significant cortical atrophy, making such shape analysis a valuable diagnostic tool. The objective of this study is to introduce and validate a novel local surface representation method for the automated and accurate diagnosis of AD. Methods: The study utilizes T1-weighted MRI scans from 160 participants (80 AD patients and 80 healthy controls) from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Cortical surface models were reconstructed from the MRI data using Freesurfer. Key geometric attributes were computed from the 3D meshes. Area distortion and conformal factor were derived using Ricci flow for conformal parameterization, while Gaussian curvature was calculated directly from the mesh geometry. Shannon entropy was applied to these three features to create compact and informative feature vectors. The feature vectors were used to train and evaluate a suite of classifiers (e.g. XGBoost, MLP, Logistic Regression, etc.). Results: Statistical significance of performance differences between classifiers was evaluated using paired Welch's t-test. The method proved highly effective in distinguishing AD patients from healthy controls. The Multi-Layer Perceptron (MLP) and Logistic Regression classifiers outperformed all others, achieving an accuracy and F$_1$ Score of 98.62%. Conclusions: This study confirms that the entropy of conformally-derived geometric features provides a powerful and robust metric for cortical morphometry. The high classification accuracy underscores the method's potential to enhance the study and diagnosis of Alzheimer's disease, offering a straightforward yet powerful tool for clinical research applications.
☆ S2AP: Score-space Sharpness Minimization for Adversarial Pruning
Adversarial pruning methods have emerged as a powerful tool for compressing neural networks while preserving robustness against adversarial attacks. These methods typically follow a three-step pipeline: (i) pretrain a robust model, (ii) select a binary mask for weight pruning, and (iii) finetune the pruned model. To select the binary mask, these methods minimize a robust loss by assigning an importance score to each weight, and then keep the weights with the highest scores. However, this score-space optimization can lead to sharp local minima in the robust loss landscape and, in turn, to an unstable mask selection, reducing the robustness of adversarial pruning methods. To overcome this issue, we propose a novel plug-in method for adversarial pruning, termed Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we introduce the concept of score-space sharpness minimization, which operates during the mask search by perturbing importance scores and minimizing the corresponding robust loss. Extensive experiments across various datasets, models, and sparsity levels demonstrate that S2AP effectively minimizes sharpness in score space, stabilizing the mask selection, and ultimately improving the robustness of adversarial pruning methods.
☆ Cross-Modal Scene Semantic Alignment for Image Complexity Assessment
Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at https://github.com/XQ2K/First-Cross-Model-ICA.
comment: 14 pages,2 figures, British Machine Vision Conference
☆ FeatureFool: Zero-Query Fooling of Video Models via Feature Map
The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70\% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.
☆ Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers
Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.
☆ Learning Human-Object Interaction as Groups
Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla transformer-based interaction decoder with local contextual cues from HO-pair features. Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.
☆ Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.
☆ AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering
Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model's ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality's contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.
comment: 13 pages, 9 figures
☆ GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data
Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.
comment: This work was initially drafted in November 2022
☆ ViSE: A Systematic Approach to Vision-Only Street-View Extrapolation
Realistic view extrapolation is critical for closed-loop simulation in autonomous driving, yet it remains a significant challenge for current Novel View Synthesis (NVS) methods, which often produce distorted and inconsistent images beyond the original trajectory. This report presents our winning solution which ctook first place in the RealADSim Workshop NVS track at ICCV 2025. To address the core challenges of street view extrapolation, we introduce a comprehensive four-stage pipeline. First, we employ a data-driven initialization strategy to generate a robust pseudo-LiDAR point cloud, avoiding local minima. Second, we inject strong geometric priors by modeling the road surface with a novel dimension-reduced SDF termed 2D-SDF. Third, we leverage a generative prior to create pseudo ground truth for extrapolated viewpoints, providing auxilary supervision. Finally, a data-driven adaptation network removes time-specific artifacts. On the RealADSim-NVS benchmark, our method achieves a final score of 0.441, ranking first among all participants.
☆ Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net
The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.
comment: Submitted to a SN journal
☆ Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding
Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes applications. The code is available at https://github.com/jinlin2021/ATED.
☆ OmniNWM: Omniscient Driving Navigation World Models
Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at https://github.com/Arlo0o/OmniNWM.
comment: https://arlo0o.github.io/OmniNWM/
☆ The Impact of Image Resolution on Biomedical Multimodal Large Language Models
Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.
comment: Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025
☆ Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models
Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model's proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at https://github.com/xmed-lab/Med-RwR.
comment: Work in progress
☆ GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation ICCV
We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.
comment: Accepted to ICCV Findings 2025. The first two authors contributed equally. The last two authors share co-corresponding authorship
☆ Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models
Identity preserving editing of faces is a generative task that enables modifying the illumination, adding/removing eyeglasses, face aging, editing hairstyles, modifying expression etc., while preserving the identity of the face. Recent progress in 2D generative models have enabled photorealistic editing of faces using simple techniques leveraging the compositionality in GANs. However, identity preserving editing for 3D faces with a given set of attributes is a challenging task as the generative model must reason about view consistency from multiple poses and render a realistic 3D face. Further, 3D portrait editing requires large-scale attribute labelled datasets and presents a trade-off between editability in low-resolution and inflexibility to editing in high resolution. In this work, we aim to alleviate some of the constraints in editing 3D faces by identifying latent space directions that correspond to photorealistic edits. To address this, we present a method that builds on recent advancements in 3D-aware deep generative models and 2D portrait editing techniques to perform efficient few-shot identity preserving attribute editing for 3D-aware generative models. We aim to show from experimental results that using just ten or fewer labelled images of an attribute is sufficient to estimate edit directions in the latent space that correspond to 3D-aware attribute editing. In this work, we leverage an existing face dataset with masks to obtain the synthetic images for few attribute examples required for estimating the edit directions. Further, to demonstrate the linearity of edits, we investigate one-shot stylization by performing sequential editing and use the (2D) Attribute Style Manipulation (ASM) technique to investigate a continuous style manifold for 3D consistent identity preserving face aging. Code and results are available at: https://vishal-vinod.github.io/gmpi-edit/
comment: 14 pages, 7 figures
☆ StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
☆ TreeFedDG: Alleviating Global Drift in Federated Domain Generalization for Medical Image Segmentation
In medical image segmentation tasks, Domain Generalization (DG) under the Federated Learning (FL) framework is crucial for addressing challenges related to privacy protection and data heterogeneity. However, traditional federated learning methods fail to account for the imbalance in information aggregation across clients in cross-domain scenarios, leading to the Global Drift (GD) problem and a consequent decline in model generalization performance. This motivates us to delve deeper and define a new critical issue: global drift in federated domain generalization for medical imaging (FedDG-GD). In this paper, we propose a novel tree topology framework called TreeFedDG. First, starting from the distributed characteristics of medical images, we design a hierarchical parameter aggregation method based on a tree-structured topology to suppress deviations in the global model direction. Second, we introduce a parameter difference-based style mixing method (FedStyle), which enforces mixing among clients with maximum parameter differences to enhance robustness against drift. Third, we develop a a progressive personalized fusion strategy during model distribution, ensuring a balance between knowledge transfer and personalized features. Finally, during the inference phase, we use feature similarity to guide the retrieval of the most relevant model chain from the tree structure for ensemble decision-making, thereby fully leveraging the advantages of hierarchical knowledge. We conducted extensive experiments on two publicly available datasets. The results demonstrate that our method outperforms other state-of-the-art domain generalization approaches in these challenging tasks and achieves better balance in cross-domain performance.
☆ Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization ICME2025
Existing 3D human mesh recovery methods often fail to fully exploit the latent information (e.g., human motion, shape alignment), leading to issues with limb misalignment and insufficient local details in the reconstructed human mesh (especially in complex scenes). Furthermore, the performance improvement gained by modelling mesh vertices and pose node interactions using attention mechanisms comes at a high computational cost. To address these issues, we propose a two-stage network for human mesh recovery based on latent information and low dimensional learning. Specifically, the first stage of the network fully excavates global (e.g., the overall shape alignment) and local (e.g., textures, detail) information from the low and high-frequency components of image features and aggregates this information into a hybrid latent frequency domain feature. This strategy effectively extracts latent information. Subsequently, utilizing extracted hybrid latent frequency domain features collaborates to enhance 2D poses to 3D learning. In the second stage, with the assistance of hybrid latent features, we model the interaction learning between the rough 3D human mesh template and the 3D pose, optimizing the pose and shape of the human mesh. Unlike existing mesh pose interaction methods, we design a low-dimensional mesh pose interaction method through dimensionality reduction and parallel optimization that significantly reduces computational costs without sacrificing reconstruction accuracy. Extensive experimental results on large publicly available datasets indicate superiority compared to the most state-of-the-art.
comment: Accepted by ICME2025
☆ From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation
Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
☆ UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding
Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.
comment: We have released V1, which only reports the test results. Our work is still ongoing, and the next version will be coming soon
☆ Hyperbolic Space Learning Method Leveraging Temporal Motion Priors for Human Mesh Recovery ICME2025
3D human meshes show a natural hierarchical structure (like torso-limbs-fingers). But existing video-based 3D human mesh recovery methods usually learn mesh features in Euclidean space. It's hard to catch this hierarchical structure accurately. So wrong human meshes are reconstructed. To solve this problem, we propose a hyperbolic space learning method leveraging temporal motion prior for recovering 3D human meshes from videos. First, we design a temporal motion prior extraction module. This module extracts the temporal motion features from the input 3D pose sequences and image feature sequences respectively. Then it combines them into the temporal motion prior. In this way, it can strengthen the ability to express features in the temporal motion dimension. Since data representation in non-Euclidean space has been proved to effectively capture hierarchical relationships in real-world datasets (especially in hyperbolic space), we further design a hyperbolic space optimization learning strategy. This strategy uses the temporal motion prior information to assist learning, and uses 3D pose and pose motion information respectively in the hyperbolic space to optimize and learn the mesh features. Then, we combine the optimized results to get an accurate and smooth human mesh. Besides, to make the optimization learning process of human meshes in hyperbolic space stable and effective, we propose a hyperbolic mesh optimization loss. Extensive experimental results on large publicly available datasets indicate superiority in comparison with most state-of-the-art.
comment: Accepted by ICME2025
☆ OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion
Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.
☆ BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets -- consisting of a point cloud, image, and text description -- mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5\% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27\%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3\% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on https://github.com/kesu1/BlendCLIP.
comment: Under Review
☆ DeepSeek-OCR: Contexts Optical Compression
We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.
☆ Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis
This paper presents a generation-based debiasing framework for object detection. Prior debiasing methods are often limited by the representation diversity of samples, while naive generative augmentation often preserves the biases it aims to solve. Moreover, our analysis reveals that simply generating more data for rare classes is suboptimal due to two core issues: i) instance frequency is an incomplete proxy for the true data needs of a model, and ii) current layout-to-image synthesis lacks the fidelity and control to generate high-quality, complex scenes. To overcome this, we introduce the representation score (RS) to diagnose representational gaps beyond mere frequency, guiding the creation of new, unbiased layouts. To ensure high-quality synthesis, we replace ambiguous text prompts with a precise visual blueprint and employ a generative alignment strategy, which fosters communication between the detector and generator. Our method significantly narrows the performance gap for underrepresented object groups, \eg, improving large/rare instances by 4.4/3.6 mAP over the baseline, and surpassing prior L2I synthesis models by 15.9 mAP for layout accuracy in generated images.
☆ DualHash: A Stochastic Primal-Dual Algorithm with Theoretical Guarantee for Deep Hashing
Deep hashing converts high-dimensional feature vectors into compact binary codes, enabling efficient large-scale retrieval. A fundamental challenge in deep hashing stems from the discrete nature of quantization in generating the codes. W-type regularizations, such as $||z|-1|$, have been proven effective as they encourage variables toward binary values. However, existing methods often directly optimize these regularizations without convergence guarantees. While proximal gradient methods offer a promising solution, the coupling between W-type regularizers and neural network outputs results in composite forms that generally lack closed-form proximal solutions. In this paper, we present a stochastic primal-dual hashing algorithm, referred to as DualHash, that provides rigorous complexity bounds. Using Fenchel duality, we partially transform the nonconvex W-type regularization optimization into the dual space, which results in a proximal operator that admits closed-form solutions. We derive two algorithm instances: a momentum-accelerated version with $\mathcal{O}(\varepsilon^{-4})$ complexity and an improved $\mathcal{O}(\varepsilon^{-3})$ version using variance reduction. Experiments on three image retrieval databases demonstrate the superior performance of DualHash.
☆ VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
comment: 10 pages, 5 figures, 4 tables. Under review
☆ EMA-SAM: Exponential Moving-average for SAM-based PTMC Segmentation
Papillary thyroid microcarcinoma (PTMC) is increasingly managed with radio-frequency ablation (RFA), yet accurate lesion segmentation in ultrasound videos remains difficult due to low contrast, probe-induced motion, and heat-related artifacts. The recent Segment Anything Model 2 (SAM-2) generalizes well to static images, but its frame-independent design yields unstable predictions and temporal drift in interventional ultrasound. We introduce \textbf{EMA-SAM}, a lightweight extension of SAM-2 that incorporates a confidence-weighted exponential moving average pointer into the memory bank, providing a stable latent prototype of the tumour across frames. This design preserves temporal coherence through probe pressure and bubble occlusion while rapidly adapting once clear evidence reappears. On our curated PTMC-RFA dataset (124 minutes, 13 patients), EMA-SAM improves \emph{maxDice} from 0.82 (SAM-2) to 0.86 and \emph{maxIoU} from 0.72 to 0.76, while reducing false positives by 29\%. On external benchmarks, including VTUS and colonoscopy video polyp datasets, EMA-SAM achieves consistent gains of 2--5 Dice points over SAM-2. Importantly, the EMA pointer adds \textless0.1\% FLOPs, preserving real-time throughput of $\sim$30\,FPS on a single A100 GPU. These results establish EMA-SAM as a robust and efficient framework for stable tumour tracking, bridging the gap between foundation models and the stringent demands of interventional ultrasound. Codes are available here \hyperref[code {https://github.com/mdialameh/EMA-SAM}.
♻ ☆ Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
comment: homepage: https://open-bee.github.io/
♻ ☆ Fourier Transform Multiple Instance Learning for Whole Slide Image Classification
Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.
♻ ☆ UniVideo: Unified Understanding, Generation, and Editing for Videos
Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
comment: Project Website https://congwei1230.github.io/UniVideo/
♻ ☆ Glyph: Scaling Context Windows via Visual-Text Compression
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
♻ ☆ H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical Imaging
3D landmark detection is a critical task in medical image analysis, and accurately detecting anatomical landmarks is essential for subsequent medical imaging tasks. However, mainstream deep learning methods in this field struggle to simultaneously capture fine-grained local features and model global spatial relationships, while maintaining a balance between accuracy and computational efficiency. Local feature extraction requires capturing fine-grained anatomical details, while global modeling requires understanding the spatial relationships within complex anatomical structures. The high-dimensional nature of 3D volume further exacerbates these challenges, as landmarks are sparsely distributed, leading to significant computational costs. Therefore, achieving efficient and precise 3D landmark detection remains a pressing challenge in medical image analysis. In this work, We propose a \textbf{H}ybrid \textbf{3}D \textbf{DE}tection \textbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature extraction with a lightweight attention mechanism designed to efficiently capture global dependencies in 3D volumetric data. This mechanism employs a hierarchical routing strategy to reduce computational cost while maintaining global context modeling. To our knowledge, H3DE-Net is the first 3D landmark detection model that integrates such a lightweight attention mechanism with CNNs. Additionally, integrating multi-scale feature fusion further enhances detection accuracy and robustness. Experimental results on a public CT dataset demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance, significantly improving accuracy and robustness, particularly in scenarios with missing landmarks or complex anatomical variations. We aready open-source our project, including code, data and model weights.
♻ ☆ Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific normalization.We propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter tuning.Across three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.
♻ ☆ Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning NeurIPS 2025
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.
comment: Accepted by NeurIPS 2025
♻ ☆ VideoVerse: How Far is Your T2V Generator from a World Model?
The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.
comment: 24 Pages, 8 Figures, 11 Tables
♻ ☆ Interpretable Decision-Making for End-to-End Autonomous Driving ICCV 2025
Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.
comment: Accepted to the ICCV 2025 2nd Workshop on the Challenge Of Out-of-Label Hazards in Autonomous Driving (2COOOL)
♻ ☆ Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
comment: 14 pages, 8 figures, project page: https://hcplab-sysu.github.io/TAVP
♻ ☆ SimCortex: Collision-free Simultaneous Cortical Surfaces Reconstruction
Accurate cortical surface reconstruction from magnetic resonance imaging (MRI) data is crucial for reliable neuroanatomical analyses. Current methods have to contend with complex cortical geometries, strict topological requirements, and often produce surfaces with overlaps, self-intersections, and topological defects. To overcome these shortcomings, we introduce SimCortex, a deep learning framework that simultaneously reconstructs all brain surfaces (left/right white-matter and pial) from T1-weighted(T1w) MRI volumes while preserving topological properties. Our method first segments the T1w image into a nine-class tissue label map. From these segmentations, we generate subject-specific, collision-free initial surface meshes. These surfaces serve as precise initializations for subsequent multiscale diffeomorphic deformations. Employing stationary velocity fields (SVFs) integrated via scaling-and-squaring, our approach ensures smooth, topology-preserving transformations with significantly reduced surface collisions and self-intersections. Evaluations on standard datasets demonstrate that SimCortex dramatically reduces surface overlaps and self-intersections, surpassing current methods while maintaining state-of-the-art geometric accuracy.
comment: Metadata update: added journal reference and DOI linking to the published chapter (Springer)
♻ ☆ Increasing the Utility of Synthetic Images through Chamfer Guidance NeurIPS 2025
Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4% in terms of precision, and 86.4% in terms of distributional coverage, which increase to 97.5% and 92.7%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15% for in-distribution over the baselines, and up to 16% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.
comment: Accepted to NeurIPS 2025
♻ ☆ RODS: Robust Optimization Inspired Diffusion Sampling for Detecting and Reducing Hallucination in Generative Models
Diffusion models have achieved state-of-the-art performance in generative modeling, yet their sampling procedures remain vulnerable to hallucinations-often stemming from inaccuracies in score approximation. In this work, we reinterpret diffusion sampling through the lens of optimization and introduce RODS (Robust Optimization-inspired Diffusion Sampler), a novel method that detects and corrects high-risk sampling steps using geometric cues from the loss landscape. RODS enforces smoother sampling trajectories and adaptively adjusts perturbations, reducing hallucinations without retraining and at minimal additional inference cost. Experiments on AFHQv2, FFHQ, and 11k-hands demonstrate that RODS maintains comparable image quality and preserves generation diversity. More importantly, it improves both sampling fidelity and robustness, detecting over 70% of hallucinated samples and correcting more than 25%, all while avoiding the introduction of new artifacts. We release our code at https://github.com/Yiqi-Verna-Tian/RODS.
♻ ☆ Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation NeurIPS 2025
Common computer vision systems typically assume ideal pinhole cameras but fail when facing real-world camera effects such as fisheye distortion and rolling shutter, mainly due to the lack of learning from training data with camera effects. Existing data generation approaches suffer from either high costs, sim-to-real gaps or fail to accurately model camera effects. To address this bottleneck, we propose 4D Gaussian Ray Tracing (4D-GRT), a novel two-stage pipeline that combines 4D Gaussian Splatting with physically-based ray tracing for camera effect simulation. Given multi-view videos, 4D-GRT first reconstructs dynamic scenes, then applies ray tracing to generate videos with controllable, physically accurate camera effects. 4D-GRT achieves the fastest rendering speed while performing better or comparable rendering quality compared to existing baselines. Additionally, we construct eight synthetic dynamic scenes in indoor environments across four camera effects as a benchmark to evaluate generated videos with camera effects.
comment: Paper accepted to NeurIPS 2025 Workshop SpaVLE. Project page: https://shigon255.github.io/4DGRT-project-page/
♻ ☆ VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
comment: Project page: https://faceong.github.io/VIKI-R/
♻ ☆ Neural 3D Object Reconstruction with Small-Scale Unmanned Aerial Vehicles
Small Unmanned Aerial Vehicles (UAVs) exhibit immense potential for navigating indoor and hard-to-reach areas, yet their significant constraints in payload and autonomy have largely prevented their use for complex tasks like high-quality 3-Dimensional (3D) reconstruction. To overcome this challenge, we introduce a novel system architecture that enables fully autonomous, high-fidelity 3D scanning of static objects using UAVs weighing under 100 grams. Our core innovation lies in a dual-reconstruction pipeline that creates a real-time feedback loop between data capture and flight control. A near-real-time (near-RT) process uses Structure from Motion (SfM) to generate an instantaneous pointcloud of the object. The system analyzes the model quality on the fly and dynamically adapts the UAV's trajectory to intelligently capture new images of poorly covered areas. This ensures comprehensive data acquisition. For the final, detailed output, a non-real-time (non-RT) pipeline employs a Neural Radiance Fields (NeRF)-based Neural 3D Reconstruction (N3DR) approach, fusing SfM-derived camera poses with precise Ultra Wide-Band (UWB) location data to achieve superior accuracy. We implemented and validated this architecture using Crazyflie 2.1 UAVs. Our experiments, conducted in both single- and multi-UAV configurations, conclusively show that dynamic trajectory adaptation consistently improves reconstruction quality over static flight paths. This work demonstrates a scalable and autonomous solution that unlocks the potential of miniaturized UAVs for fine-grained 3D reconstruction in constrained environments, a capability previously limited to much larger platforms.
comment: 13 pages, 16 figures, 3 tables, 45 references
♻ ☆ Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning
Medical Vision Foundation Models (Med-VFMs) have superior capabilities of interpreting medical images due to the knowledge learned from self-supervised pre-training with extensive unannotated images. To improve their performance on adaptive downstream evaluations, especially segmentation, a few samples from target domains are selected randomly for fine-tuning them. However, there lacks works to explore the way of adapting Med-VFMs to achieve the optimal performance on target domains efficiently. Thus, it is highly demanded to design an efficient way of fine-tuning Med-VFMs by selecting informative samples to maximize their adaptation performance on target domains. To achieve this, we propose an Active Source-Free Domain Adaptation (ASFDA) method to efficiently adapt Med-VFMs to target domains for volumetric medical image segmentation. This ASFDA employs a novel Active Learning (AL) method to select the most informative samples from target domains for fine-tuning Med-VFMs without the access to source pre-training samples, thus maximizing their performance with the minimal selection budget. In this AL method, we design an Active Test Time Sample Query strategy to select samples from the target domains via two query metrics, including Diversified Knowledge Divergence (DKD) and Anatomical Segmentation Difficulty (ASD). DKD is designed to measure the source-target knowledge gap and intra-domain diversity. It utilizes the knowledge of pre-training to guide the querying of source-dissimilar and semantic-diverse samples from the target domains. ASD is designed to evaluate the difficulty in segmentation of anatomical structures by measuring predictive entropy from foreground regions adaptively. Additionally, our ASFDA method employs a Selective Semi-supervised Fine-tuning to improve the performance and efficiency of fine-tuning by identifying samples with high reliability from unqueried ones.
comment: 17 pages, 5 figures, 8 tables
♻ ☆ Deep Learning in Palmprint Recognition-A Comprehensive Survey
Palmprint recognition has emerged as a prominent biometric technology, widely applied in diverse scenarios. Traditional handcrafted methods for palmprint recognition often fall short in representation capability, as they heavily depend on researchers' prior knowledge. Deep learning (DL) has been introduced to address this limitation, leveraging its remarkable successes across various domains. While existing surveys focus narrowly on specific tasks within palmprint recognition-often grounded in traditional methodologies-there remains a significant gap in comprehensive research exploring DL-based approaches across all facets of palmprint recognition. This paper bridges that gap by thoroughly reviewing recent advancements in DL-powered palmprint recognition. The paper systematically examines progress across key tasks, including region-of-interest segmentation, feature extraction, and security/privacy-oriented challenges. Beyond highlighting these advancements, the paper identifies current challenges and uncovers promising opportunities for future research. By consolidating state-of-the-art progress, this review serves as a valuable resource for researchers, enabling them to stay abreast of cutting-edge technologies and drive innovation in palmprint recognition.
comment: Palmprint recognition, biometrics, deep learning, feature extraction, recognition tasks
♻ ☆ CaMiT: A Time-Aware Car Model Dataset for Classification and Generation NeurIPS 2025
AI systems must adapt to evolving visual environments, especially in domains where object appearances change over time. We introduce Car Models in Time (CaMiT), a fine-grained dataset capturing the temporal evolution of car models, a representative class of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023), supporting both supervised and self-supervised learning. Static pretraining on in-domain data achieves competitive performance with large-scale generalist models while being more resource-efficient, yet accuracy declines when models are tested across years. To address this, we propose a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We evaluate two strategies: time-incremental pretraining, which updates the backbone, and time-incremental classifier learning, which updates only the final layer, both improving temporal robustness. Finally, we explore time-aware image generation that leverages temporal metadata during training, yielding more realistic outputs. CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained visual recognition and generation.
comment: To be published in NeurIPS 2025 Track on Datasets and Benchmarks
♻ ☆ UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning NeurIPS 2025
Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
comment: NeurIPS 2025 Camera Ready. Project Page: https://polyu-chenlab.github.io/unipixel/
♻ ☆ Regression is all you need for medical image translation
While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved impressive results in natural image synthesis, their core strengths - creativity and realism - can be detrimental in medical applications, where accuracy and fidelity are paramount. These models instead risk introducing hallucinations and replication of unwanted acquisition noise. Here, we propose YODA (You Only Denoise once - or Average), a 2.5D diffusion-based framework for medical image translation (MIT). Consistent with DM theory, we find that conventional diffusion sampling stochastically replicates noise. To mitigate this, we draw and average multiple samples, akin to physical signal averaging. As this effectively approximates the DM's expected value, we term this Expectation-Approximation (ExpA) sampling. We additionally propose regression sampling YODA, which retains the initial DM prediction and omits iterative refinement to produce noise-free images in a single step. Across five diverse multi-modal datasets - including multi-contrast brain MRI and pelvic MRI-CT - we demonstrate that regression sampling is not only substantially more efficient but also matches or exceeds image quality of full diffusion sampling even with ExpA. Our results reveal that iterative refinement solely enhances perceptual realism without benefiting information translation, which we confirm in relevant downstream tasks. YODA outperforms eight state-of-the-art DMs and GANs and challenges the presumed superiority of DMs and GANs over computationally cheap regression models for high-quality MIT. Furthermore, we show that YODA-translated images are interchangeable with, or even superior to, physical acquisitions for several medical applications.
♻ ☆ PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.
♻ ☆ ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text
Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline methods both qualitatively and quantitatively, setting a new standard for virtual try-on. Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness in real-world scenarios.
comment: Accepted by PRCV 2025
♻ ☆ Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation NeurIPS 2025
Diffusion-based inverse algorithms have shown remarkable performance across various inverse problems, yet their reliance on numerous denoising steps incurs high computational costs. While recent developments of fast diffusion ODE solvers offer effective acceleration for diffusion sampling without observations, their application in inverse problems remains limited due to the heterogeneous formulations of inverse algorithms and their prevalent use of approximations and heuristics, which often introduce significant errors that undermine the reliability of analytical solvers. In this work, we begin with an analysis of ODE solvers for inverse problems that reveals a linear combination structure of approximations for the inverse trajectory. Building on this insight, we propose a canonical form that unifies a broad class of diffusion-based inverse algorithms and facilitates the design of more generalizable solvers. Inspired by the linear subspace search strategy, we propose Learnable Linear Extrapolation (LLE), a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm conforming to our canonical form. LLE optimizes the combination coefficients to refine current predictions using previous estimates, alleviating the sensitivity of analytical solvers for inverse algorithms. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLE_inverse_problem.
comment: Accepted by NeurIPS 2025
♻ ☆ Pose-free 3D Gaussian splatting via shape-ray estimation ICIP 2025
While generalizable 3D Gaussian splatting enables efficient, high-quality rendering of unseen scenes, it heavily depends on precise camera poses for accurate geometry. In real-world scenarios, obtaining accurate poses is challenging, leading to noisy pose estimates and geometric misalignments. To address this, we introduce SHARE, a pose-free, feed-forward Gaussian splatting framework that overcomes these ambiguities by joint shape and camera rays estimation. Instead of relying on explicit 3D transformations, SHARE builds a pose-aware canonical volume representation that seamlessly integrates multi-view information, reducing misalignment caused by inaccurate pose estimates. Additionally, anchor-aligned Gaussian prediction enhances scene reconstruction by refining local geometry around coarse anchors, allowing for more precise Gaussian placement. Extensive experiments on diverse real-world datasets show that our method achieves robust performance in pose-free generalizable Gaussian splatting. Code is avilable at https://github.com/youngju-na/SHARE
comment: ICIP 2025 (Best Student Paper Award) Code available at: https://github.com/youngju-na/SHARE
♻ ☆ Mask Image Watermarking NeurIPS
We present MaskWM, a simple, efficient, and flexible framework for image watermarking. MaskWM has two variants: (1) MaskWM-D, which supports global watermark embedding, watermark localization, and local watermark extraction for applications such as tamper detection; (2) MaskWM-ED, which focuses on local watermark embedding and extraction, offering enhanced robustness in small regions to support fine-grined image protection. MaskWM-D builds on the classical encoder-distortion layer-decoder training paradigm. In MaskWM-D, we introduce a simple masking mechanism during the decoding stage that enables both global and local watermark extraction. During training, the decoder is guided by various types of masks applied to watermarked images before extraction, helping it learn to localize watermarks and extract them from the corresponding local areas. MaskWM-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions, which improves robustness under regional attacks. Extensive experiments show that MaskWM achieves state-of-the-art performance in global and local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images. In addition, MaskWM is highly efficient and adaptable. It requires only 20 hours of training on a single A6000 GPU, achieving 15x computational efficiency compared to WAM. By simply adjusting the distortion layer, MaskWM can be quickly fine-tuned to meet varying robustness requirements.
comment: Neural Information Processing Systems (NeurIPS) 2025
♻ ☆ PICABench: How Far Are We from Physically Realistic Image Editing?
Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.
♻ ☆ scSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy NeurIPS 2025
Fluorescence microscopy, while being a key driver for progress in the life sciences, is also subject to technical limitations. To overcome them, computational multiplexing techniques have recently been proposed, which allow multiple cellular structures to be captured in a single image and later be unmixed. Existing image decomposition methods are trained on a set of superimposed input images and the respective unmixed target images. It is critical to note that the relative strength (mixing ratio) of the superimposed images for a given input is a priori unknown. However, existing methods are trained on a fixed intensity ratio of superimposed inputs, making them not cognizant of the range of relative intensities that can occur in fluorescence microscopy. In this work, we propose a novel method called scSplit that is cognizant of the severity of the above-mentioned mixing ratio. Our idea is based on InDI , a popular iterative method for image restoration, and an ideal starting point to embrace the unknown mixing ratio in any given input. We introduce (i) a suitably trained regressor network that predicts the degradation level (mixing ratio) of a given input image and (ii) a degradation-specific normalization module, enabling degradation-aware inference across all mixing ratios. We show that this method solves two relevant tasks in fluorescence microscopy, namely image splitting and bleedthrough removal, and empirically demonstrate the applicability of scSplit on 5 public datasets. The source code with pre-trained models is hosted at https://github.com/juglab/scSplit/.
comment: manuscript accepted at NeurIPS 2025
♻ ☆ A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography
Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson's r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson's r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.
comment: Paper accepted to Human Brain Mapping. 25 pages, 3 figures, 8 tables
♻ ☆ REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.
♻ ☆ VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching NeurIPS 2025
Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring the fidelity of critical visual information. To further optimize efficiency, we introduce a layer adaptive token reusing strategy that dynamically adjusts the reuse ratio based on attention concentration across decoder layers, prioritizing critical tokens for recomputation. Extensive experiments on two simulation platforms (LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency, with negligible loss on task success rate. The code and videos can be found at our project page: https://vla-cache.github.io.
comment: Accepted to NeurIPS 2025
♻ ☆ Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.
♻ ☆ VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computation has not been thoroughly explored in the context of image quality assessment (IQA), a task depending critically on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.
♻ ☆ MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
comment: We have come across a recent approach that has not been properly attributed at the time of submission and compared in a fair setting. Therefore, we would like to withdraw the paper to address these concerns
♻ ☆ LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.
comment: Submitted to ARR Rolling Review
♻ ☆ ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model NeurIPS2025
In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.
comment: NeurIPS2025 Accepted Paper
♻ ☆ Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model
3D medical image analysis is essential for modern healthcare, yet traditional task-specific models are inadequate due to limited generalizability across diverse clinical scenarios. Multimodal large language models (MLLMs) offer a promising solution to these challenges. However, existing MLLMs have limitations in fully leveraging the rich, hierarchical information embedded in 3D medical images. Inspired by clinical practice, where radiologists focus on both 3D spatial structure and 2D planar content, we propose Med-2E3, a 3D medical MLLM that integrates a dual 3D-2D encoder architecture. To aggregate 2D features effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module, which scores the attention of each 2D slice based on slice contents and task instructions. To the best of our knowledge, Med-2E3 is the first MLLM to integrate both 3D and 2D features for 3D medical image analysis. Experiments on large-scale, open-source 3D medical multimodal datasets demonstrate that TG-IS exhibits task-specific attention distribution and significantly outperforms current state-of-the-art models. The code is available at: https://github.com/MSIIP/Med-2E3
♻ ☆ GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction NeurIPS 2025
Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Think With Videos For Agentic Long-Video Understanding
Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
♻ ☆ DA$^2$: Depth Anything in Any Direction
Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$'s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data has be released. Project page: https://depth-any-in-any-dir.github.io/.
comment: Work primarily done during an internship at Tencent Hunyuan. Project page: https://depth-any-in-any-dir.github.io/
♻ ☆ H3D-DGS: Exploring Heterogeneous 3D Motion Representation for Deformable 3D Gaussian Splatting
Dynamic scene reconstruction poses a persistent challenge in 3D vision. Deformable 3D Gaussian Splatting has emerged as an effective method for this task, offering real-time rendering and high visual fidelity. This approach decomposes a dynamic scene into a static representation in a canonical space and time-varying scene motion. Scene motion is defined as the collective movement of all Gaussian points, and for compactness, existing approaches commonly adopt implicit neural fields or sparse control points. However, these methods predominantly rely on gradient-based optimization for all motion information. Due to the high degree of freedom, they struggle to converge on real-world datasets exhibiting complex motion. To preserve the compactness of motion representation and address convergence challenges, this paper proposes heterogeneous 3D control points, termed \textbf{H3D control points}, whose attributes are obtained using a hybrid strategy combining optical flow back-projection and gradient-based methods. This design decouples directly observable motion components from those that are geometrically occluded. Specifically, components of 3D motion that project onto the image plane are directly acquired via optical flow back projection, while unobservable portions are refined through gradient-based optimization. Experiments on the Neu3DV and CMU-Panoptic datasets demonstrate that our method achieves superior performance over state-of-the-art deformable 3D Gaussian splatting techniques. Remarkably, our method converges within just 100 iterations and achieves a per-frame processing speed of 2 seconds on a single NVIDIA RTX 4070 GPU.
♻ ☆ WMamba: Wavelet-based Mamba for Face Forgery Detection ACM MM 2025
The rapid evolution of deepfake generation technologies necessitates the development of robust face forgery detection algorithms. Recent studies have demonstrated that wavelet analysis can enhance the generalization abilities of forgery detectors. Wavelets effectively capture key facial contours, often slender, fine-grained, and globally distributed, that may conceal subtle forgery artifacts imperceptible in the spatial domain. However, current wavelet-based approaches fail to fully exploit the distinctive properties of wavelet data, resulting in sub-optimal feature extraction and limited performance gains. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear complexity. This efficiency allows for the extraction of fine-grained, globally distributed forgery artifacts from small image patches. Extensive experiments show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness in face forgery detection.
comment: Accepted by ACM MM 2025
♻ ☆ Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable NeurIPS 2025
Existing detectors are often trained on biased datasets, leading to the possibility of overfitting on non-causal image attributes that are spuriously correlated with real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when applied to unbiased datasets. One common solution is to perform dataset alignment through generative reconstruction, matching the semantic content between real and synthetic images. However, we revisit this approach and show that pixel-level alignment alone is insufficient. The reconstructed images still suffer from frequency-level misalignment, which can perpetuate spurious correlations. To illustrate, we observe that reconstruction models tend to restore the high-frequency details lost in real images (possibly due to JPEG compression), inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images for testing detector performance on the most aligned dataset, and EvalGEN, featuring the latest generative models for assessing detectors under new generative architectures such as visual auto-regressive generators. Finally, our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO could improve across 8 diverse benchmarks by a non-trivial margin, showing a +7.2% on in-the-wild benchmarks, highlighting the improved generalizability of unbiased detectors. Our code is available at: https://github.com/roy-ch/Dual-Data-Alignment.
comment: NeurIPS 2025 Spotlight. 13 Pages, 10 figures
♻ ☆ gen2seg: Generative Models Enable Generalizable Instance Segmentation
By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.
comment: Website: https://reachomk.github.io/gen2seg/
♻ ☆ Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning
Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.
comment: Accepted at MICAD (Medical Imaging and Computer-Aided Diagnosis) 2025
♻ ☆ From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes NeurIPS 2025
3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.00% accuracy on space-level tasks and 31.46% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scenes beyond object-level semantics.
comment: NeurIPS 2025 (Datasets and Benchmarks). Project page: https://anywhere-3d.github.io/
♻ ☆ Leveraging AV1 motion vectors for Fast and Dense Feature Matching
We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.
comment: Accepted ICIR 2025, camera-ready version
♻ ☆ SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of-Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.
comment: Project page: https://scenecot.github.io/
♻ ☆ Polyline Path Masked Attention for Vision Transformer
Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.
♻ ☆ Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge NeurIPS 2025
Facial personalization faces challenges to maintain identity fidelity without disrupting the foundation model's prompt consistency. The mainstream personalization models employ identity embedding to integrate identity information within the attention mechanisms. However, our preliminary findings reveal that identity embeddings compromise the effectiveness of other tokens in the prompt, thereby limiting high prompt consistency and attribute-level controllability. Moreover, by deactivating identity embedding, personalization models still demonstrate the underlying foundation models' ability to control facial attributes precisely. It suggests that such foundation models' knowledge can be leveraged to cure the ill-aligned prompt consistency of personalization models. Building upon these insights, we propose FreeCure, a framework that improves the prompt consistency of personalization models with their latent foundation models' knowledge. First, by setting a dual inference paradigm with/without identity embedding, we identify attributes (e.g., hair, accessories, etc.) for enhancements. Second, we introduce a novel foundation-aware self-attention module, coupled with an inversion-based process to bring well-aligned attribute information to the personalization process. Our approach is training-free, and can effectively enhance a wide array of facial attributes; and it can be seamlessly integrated into existing popular personalization models based on both Stable Diffusion and FLUX. FreeCure has consistently shown significant improvements in prompt consistency across these facial personalization models while maintaining the integrity of their original identity fidelity.
comment: Accepted to NeurIPS 2025
♻ ☆ Global Prompt Refinement with Non-Interfering Attention Masking for One-Shot Federated Learning NeurIPS'25
Federated Prompt Learning (FPL) enables communication-efficient adaptation by tuning lightweight prompts on top of frozen pre-trained models. Existing FPL methods typically rely on global information, which is only available after the second training round, to facilitate collaboration among client models. Therefore, they are inherently dependent on multi-round communication to fully exhibit their strengths. Moreover, existing one-shot federated learning methods typically focus on fitting seen tasks, but lack cross-task generalization. To bridge this gap, we propose the Global Prompt Refinement with Non-Interfering Attention Masking (GPR-NIAM) method for one-shot FPL. The core idea is to design a masking mechanism that restricts excessive interaction between the original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves this through the collaboration of two key modules. Firstly, the attention isolation module suppresses attention from the learnable prompt tokens to the original text tokens, and reweights the reverse attention which preserves generalization across tasks. Secondly, the cross-silo collaborative refinement module integrates decentralized visual knowledge into a unified base and calibrates the global prompt through multi-source cross-modal knowledge alignment, further mitigating the inconsistency caused by data heterogeneity. Extensive experiments conducted on ten benchmark datasets under two tasks show that GPR-NIAM outperforms eight state-of-the-art methods in both class-level and domain-level generalization.
comment: NeurIPS'25 accepted
♻ ☆ Exploring Cross-Modal Flows for Few-Shot Learning
Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.
comment: 13 pages, 6 figures
♻ ☆ CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection
The sparse cross-modality detector offers more advantages than its counterpart, the Bird's-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.
comment: 13 pages
♻ ☆ Class-wise Balancing Data Replay for Federated Class-Incremental Learning NeurIPS'25
Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model's overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.
comment: NeurIPS'25 Accepted, Oral
♻ ☆ DeepDetect: Learning All-in-One Dense Keypoints
Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).
comment: 6 pages, 6 figures, 2 tables, 7 equations
♻ ☆ When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.
comment: 2nd version update to Jun.2025
♻ ☆ Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.
comment: Agent4Science 2025 Spotlight
♻ ☆ Facial Expression-based Parkinson's Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing
Parkinson's disease (PD) severity diagnosis is crucial for early detecting potential patients and adopting tailored interventions. Diagnosing PD based on facial expression is grounded in PD patients' "masked face" symptom and gains growing interest recently for its convenience and affordability. However, current facial expression-based approaches often rely on single type of expression which can lead to misdiagnosis, and ignore the class imbalance across different PD stages which degrades the prediction performance. Moreover, most existing methods focus on binary classification (i.e., PD / non-PD) rather than diagnosing the severity of PD. To address these issues, we propose a new facial expression-based method for PD severity diagnosis which integrates multiple facial expression features through attention-based feature fusion. Moreover, we mitigate the class imbalance problem via an adaptive class balancing strategy which dynamically adjusts the contribution of training samples based on their class distribution and classification difficulty. Experimental results demonstrate the promising performance of the proposed method for PD severity diagnosis, as well as the efficacy of attention-based feature fusion and adaptive class balancing.
comment: 3 pages, 2 figures, accepted by MIND 2025
♻ ☆ Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in large language models (LLMs), such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM -- by prompting the model to produce a reasoning chain before providing an answer -- can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to encourage the model to interpret images prior to reasoning. Therefore, we train the model to adhere to a caption-reason-answer output format: initially generating a detailed caption for an image, followed by constructing an extensive reasoning chain. When trained on 273K CoT-free visual question-answer pairs and using only reinforcement learning, our model, named Visionary-R1, outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro, on multiple visual reasoning benchmarks.
♻ ☆ Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation
The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.
♻ ☆ MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.
♻ ☆ VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles
comment: 13 pages, 4 figures
♻ ☆ RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and an open-loop gap. In this work, we propose RAD, a 3DGS-based closed-loop Reinforcement Learning (RL) framework for end-to-end Autonomous Driving. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards to guide the policy in effectively responding to safety-critical events and understanding real-world causal relationships. To better align with human driving behavior, we incorporate IL into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, particularly exhibiting a 3x lower collision rate. Abundant closed-loop results are presented in the supplementary material. Code is available at https://github.com/hustvl/RAD for facilitating future research.
comment: Code: https://github.com/hustvl/RAD
♻ ☆ SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models
World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.
comment: 22 pages,15 figures
♻ ☆ Latent Diffusion Model without Variational Autoencoder
Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.
♻ ☆ Learning Collaborative Knowledge with Multimodal Representation for Polyp Re-Identification
Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Worsely, these solutions typically learn unimodal modal representations on the basis of visual samples, which fails to explore complementary information from other different modalities. To address this challenge, we propose a novel Deep Multimodal Collaborative Learning framework named DMCL for polyp re-identification, which can effectively encourage multimodal knowledge collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal feature fusion strategy is introduced to leverage the optimized visual-text representations for multimodal fusion via end-to-end training. Experiments on the standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the collaborative multimodal fusion strategy. The code is publicly available at https://github.com/JeremyXSC/DMCL.
♻ ☆ GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization
Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task. GeoArena has been open-sourced to support future research.
♻ ☆ Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
comment: 16 pages, 12 figures
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking NeurIPS 2025
The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel and unified framework for video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: https://vfbench.github.io.
comment: Accepted by NeurIPS 2025 (Spotlight)
♻ ☆ View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection AAAI 25
View transformation robustness (VTR) is critical for deep-learning-based multi-view 3D object reconstruction models, which indicates the methods' stability under inputs with various view transformations. However, existing research seldom focused on view transformation robustness in multi-view 3D object reconstruction. One direct way to improve the models' VTR is to produce data with more view transformations and add them to model training. Recent progress on large vision models, particularly Stable Diffusion models, has provided great potential for generating 3D models or synthesizing novel view images with only a single image input. Directly deploying these models at inference consumes heavy computation resources and their robustness to view transformations is not guaranteed either. To fully utilize the power of Stable Diffusion models without extra inference computation burdens, we propose to generate novel views with Stable Diffusion models for better view transformation robustness. Instead of synthesizing random views, we propose a reconstruction error-guided view selection method, which considers the reconstruction errors' spatial distribution of the 3D predictions and chooses the views that could cover the reconstruction errors as much as possible. The methods are trained and tested on sets with large view transformations to validate the 3D reconstruction models' robustness to view transformations. Extensive experiments demonstrate that the proposed method can outperform state-of-the-art 3D reconstruction methods and other view transformation robustness comparison methods. Code is available at: https://github.com/zqyq/VTR.
comment: AAAI 25
♻ ☆ ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction
Bed-related falls remain a major source of injury in hospitals and long-term care facilities, yet many commercial alarms trigger only after a patient has already left the bed. We show that early bed-exit intent can be predicted using only one low-cost load cell mounted under a bed leg. The resulting load signals are first converted into a compact set of complementary images: an RGB line plot that preserves raw waveforms and three texture maps-recurrence plot, Markov transition field, and Gramian angular field-that expose higher-order dynamics. We introduce ViFusionTST, a dual-stream Swin Transformer that processes the line plot and texture maps in parallel and fuses them through cross-attention to learn data-driven modality weights. To provide a realistic benchmark, we collected six months of continuous data from 95 beds in a long-term-care facility. On this real-world dataset ViFusionTST reaches an accuracy of 0.885 and an F1 score of 0.794, surpassing recent 1D and 2D time-series baselines across F1, recall, accuracy, and AUPRC. The results demonstrate that image-based fusion of load-sensor signals for time series classification is a practical and effective solution for real-time, privacy-preserving fall prevention.
♻ ☆ Implicit Neural Compression of Point Clouds
Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC$^3$, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC$^3$. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC$^3$ outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC$^3$ achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.
Machine Learning 150
☆ Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
☆ LightMem: Lightweight and Efficient Memory-Augmented Generation
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.
comment: Work in progress
☆ Lyapunov-Aware Quantum-Inspired Reinforcement Learning for Continuous-Time Vehicle Control: A Feasibility Study
This paper presents a novel Lyapunov-Based Quantum Reinforcement Learning (LQRL) framework that integrates quantum policy optimization with Lyapunov stability analysis for continuous-time vehicle control. The proposed approach combines the representational power of variational quantum circuits (VQCs) with a stability-aware policy gradient mechanism to ensure asymptotic convergence and safe decision-making under dynamic environments. The vehicle longitudinal control problem was formulated as a continuous-state reinforcement learning task, where the quantum policy network generates control actions subject to Lyapunov stability constraints. Simulation experiments were conducted in a closed-loop adaptive cruise control scenario using a quantum-inspired policy trained under stability feedback. The results demonstrate that the LQRL framework successfully embeds Lyapunov stability verification into quantum policy learning, enabling interpretable and stability-aware control performance. Although transient overshoot and Lyapunov divergence were observed under aggressive acceleration, the system maintained bounded state evolution, validating the feasibility of integrating safety guarantees within quantum reinforcement learning architectures. The proposed framework provides a foundational step toward provably safe quantum control in autonomous systems and hybrid quantum-classical optimization domains.
comment: 7 pages, 4 figures, 20 equations, 3 appendices, 4 tables
☆ A Hybrid Enumeration Framework for Optimal Counterfactual Generation in Post-Acute COVID-19 Heart Failure
Counterfactual inference provides a mathematical framework for reasoning about hypothetical outcomes under alternative interventions, bridging causal reasoning and predictive modeling. We present a counterfactual inference framework for individualized risk estimation and intervention analysis, illustrated through a clinical application to post-acute sequelae of COVID-19 (PASC) among patients with pre-existing heart failure (HF). Using longitudinal diagnosis, laboratory, and medication data from a large health-system cohort, we integrate regularized predictive modeling with counterfactual search to identify actionable pathways to PASC-related HF hospital admissions. The framework combines exact enumeration with optimization-based methods, including the Nearest Instance Counterfactual Explanations (NICE) and Multi-Objective Counterfactuals (MOC) algorithms, to efficiently explore high-dimensional intervention spaces. Applied to more than 2700 individuals with confirmed SARS-CoV-2 infection and prior HF, the model achieved strong discriminative performance (AUROC: 0.88, 95% CI: 0.84-0.91) and generated interpretable, patient-specific counterfactuals that quantify how modifying comorbidity patterns or treatment factors could alter predicted outcomes. This work demonstrates how counterfactual reasoning can be formalized as an optimization problem over predictive functions, offering a rigorous, interpretable, and computationally efficient approach to personalized inference in complex biomedical systems.
☆ MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.
☆ Actor-Free Continuous Control via Structurally Maximizable Q-Functions NeurIPS 2025
Value-based algorithms are a cornerstone of off-policy reinforcement learning due to their simplicity and training stability. However, their use has traditionally been restricted to discrete action spaces, as they rely on estimating Q-values for individual state-action pairs. In continuous action spaces, evaluating the Q-value over the entire action space becomes computationally infeasible. To address this, actor-critic methods are typically employed, where a critic is trained on off-policy data to estimate Q-values, and an actor is trained to maximize the critic's output. Despite their popularity, these methods often suffer from instability during training. In this work, we propose a purely value-based framework for continuous control that revisits structural maximization of Q-functions, introducing a set of key architectural and algorithmic choices to enable efficient and stable learning. We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. Particularly, in environments with constrained action spaces, where the value functions are typically non-smooth, our method with structural maximization outperforms traditional actor-critic methods with gradient-based maximization. We have released our code at https://github.com/USC-Lira/Q3C.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ SO(3)-invariant PCA with application to molecular data
Principal component analysis (PCA) is a fundamental technique for dimensionality reduction and denoising; however, its application to three-dimensional data with arbitrary orientations -- common in structural biology -- presents significant challenges. A naive approach requires augmenting the dataset with many rotated copies of each sample, incurring prohibitive computational costs. In this paper, we extend PCA to 3D volumetric datasets with unknown orientations by developing an efficient and principled framework for SO(3)-invariant PCA that implicitly accounts for all rotations without explicit data augmentation. By exploiting underlying algebraic structure, we demonstrate that the computation involves only the square root of the total number of covariance entries, resulting in a substantial reduction in complexity. We validate the method on real-world molecular datasets, demonstrating its effectiveness and opening up new possibilities for large-scale, high-dimensional reconstruction problems.
☆ BO4Mob: Bayesian Optimization Benchmarks for High-Dimensional Urban Mobility Problem
We introduce \textbf{BO4Mob}, a new benchmark framework for high-dimensional Bayesian Optimization (BO), driven by the challenge of origin-destination (OD) travel demand estimation in large urban road networks. Estimating OD travel demand from limited traffic sensor data is a difficult inverse optimization problem, particularly in real-world, large-scale transportation networks. This problem involves optimizing over high-dimensional continuous spaces where each objective evaluation is computationally expensive, stochastic, and non-differentiable. BO4Mob comprises five scenarios based on real-world San Jose, CA road networks, with input dimensions scaling up to 10,100. These scenarios utilize high-resolution, open-source traffic simulations that incorporate realistic nonlinear and stochastic dynamics. We demonstrate the benchmark's utility by evaluating five optimization methods: three state-of-the-art BO algorithms and two non-BO baselines. This benchmark is designed to support both the development of scalable optimization algorithms and their application for the design of data-driven urban mobility models, including high-resolution digital twins of metropolitan road networks. Code and documentation are available at https://github.com/UMN-Choi-Lab/BO4Mob.
☆ Search Self-play: Pushing the Frontier of Agent Capability without Supervision
Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Alibaba-Quark/SSP.
☆ Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards
We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.
☆ A Unified Perspective on Optimization in Machine Learning and Neuroscience: From Gradient Descent to Neural Adaptation
Iterative optimization is central to modern artificial intelligence (AI) and provides a crucial framework for understanding adaptive systems. This review provides a unified perspective on this subject, bridging classic theory with neural network training and biological learning. Although gradient-based methods, powered by the efficient but biologically implausible backpropagation (BP), dominate machine learning, their computational demands can hinder scalability in high-dimensional settings. In contrast, derivative-free or zeroth-order (ZO) optimization feature computationally lighter approaches that rely only on function evaluations and randomness. While generally less sample efficient, recent breakthroughs demonstrate that modern ZO methods can effectively approximate gradients and achieve performance competitive with BP in neural network models. This ZO paradigm is also particularly relevant for biology. Its core principles of random exploration (probing) and feedback-guided adaptation (reinforcing) parallel key mechanisms of biological learning, offering a mathematically principled perspective on how the brain learns. In this review, we begin by categorizing optimization approaches based on the order of derivative information they utilize, ranging from first-, second-, and higher-order gradient-based to ZO methods. We then explore how these methods are adapted to the unique challenges of neural network training and the resulting learning dynamics. Finally, we build upon these insights to view biological learning through an optimization lens, arguing that a ZO paradigm leverages the brain's intrinsic noise as a computational resource. This framework not only illuminates our understanding of natural intelligence but also holds vast implications for neuromorphic hardware, helping us design fast and energy-efficient AI systems that exploit intrinsic hardware noise.
☆ When LRP Diverges from Leave-One-Out in Transformers EMNLP 2025
Leave-One-Out (LOO) provides an intuitive measure of feature importance but is computationally prohibitive. While Layer-Wise Relevance Propagation (LRP) offers a potentially efficient alternative, its axiomatic soundness in modern Transformers remains largely under-examined. In this work, we first show that the bilinear propagation rules used in recent advances of AttnLRP violate the implementation invariance axiom. We prove this analytically and confirm it empirically in linear attention layers. Second, we also revisit CP-LRP as a diagnostic baseline and find that bypassing relevance propagation through the softmax layer -- backpropagating relevance only through the value matrices -- significantly improves alignment with LOO, particularly in middle-to-late Transformer layers. Overall, our results suggest that (i) bilinear factorization sensitivity and (ii) softmax propagation error potentially jointly undermine LRP's ability to approximate LOO in Transformers.
comment: BlackboxNLP @ EMNLP 2025
☆ On Biologically Plausible Learning in Continuous Time
Biological learning unfolds continuously in time, yet most algorithmic models rely on discrete updates and separate inference and learning phases. We study a continuous-time neural model that unifies several biologically plausible learning algorithms and removes the need for phase separation. Rules including stochastic gradient descent (SGD), feedback alignment (FA), direct feedback alignment (DFA), and Kolen-Pollack (KP) emerge naturally as limiting cases of the dynamics. Simulations show that these continuous-time networks stably learn at biological timescales, even under temporal mismatches and integration noise. Through analysis and simulation, we show that learning depends on temporal overlap: a synapse updates correctly only when its input and the corresponding error signal coincide in time. When inputs are held constant, learning strength declines linearly as the delay between input and error approaches the stimulus duration, explaining observed robustness and failure across network depths. Critically, robust learning requires the synaptic plasticity timescale to exceed the stimulus duration by one to two orders of magnitude. For typical cortical stimuli (tens of milliseconds), this places the functional plasticity window in the few-second range, a testable prediction that identifies seconds-scale eligibility traces as necessary for error-driven learning in biological circuits.
☆ Decoding Funded Research: Comparative Analysis of Topic Models and Uncovering the Effect of Gender and Geographic Location
Optimizing national scientific investment requires a clear understanding of evolving research trends and the demographic and geographical forces shaping them, particularly in light of commitments to equity, diversity, and inclusion. This study addresses this need by analyzing 18 years (2005-2022) of research proposals funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). We conducted a comprehensive comparative evaluation of three topic modelling approaches: Latent Dirichlet Allocation (LDA), Structural Topic Modelling (STM), and BERTopic. We also introduced a novel algorithm, named COFFEE, designed to enable robust covariate effect estimation for BERTopic. This advancement addresses a significant gap, as BERTopic lacks a native function for covariate analysis, unlike the probabilistic STM. Our findings highlight that while all models effectively delineate core scientific domains, BERTopic outperformed by consistently identifying more granular, coherent, and emergent themes, such as the rapid expansion of artificial intelligence. Additionally, the covariate analysis, powered by COFFEE, confirmed distinct provincial research specializations and revealed consistent gender-based thematic patterns across various scientific disciplines. These insights offer a robust empirical foundation for funding organizations to formulate more equitable and impactful funding strategies, thereby enhancing the effectiveness of the scientific ecosystem.
comment: 35 pages
☆ Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams
Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic-per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.
comment: Under review
☆ CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training
Despite significant work on low-bit quantization-aware training (QAT), there is still a large accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with adherence to quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. When pre-training Llama-style models of up to 800M-parameters, CAGE recovers over 10% of the quantization-induced loss increase in the W4A4 regime over outlier-mitigation methods. These results indicate that curvature-aware gradient corrections can bridge the remaining performance gap beyond current outlier-handling methods.
☆ Enhancing Fractional Gradient Descent with Learned Optimizers
Fractional Gradient Descent (FGD) offers a novel and promising way to accelerate optimization by incorporating fractional calculus into machine learning. Although FGD has shown encouraging initial results across various optimization tasks, it faces significant challenges with convergence behavior and hyperparameter selection. Moreover, the impact of its hyperparameters is not fully understood, and scheduling them is particularly difficult in non-convex settings such as neural network training. To address these issues, we propose a novel approach called Learning to Optimize Caputo Fractional Gradient Descent (L2O-CFGD), which meta-learns how to dynamically tune the hyperparameters of Caputo FGD (CFGD). Our method's meta-learned schedule outperforms CFGD with static hyperparameters found through an extensive search and, in some tasks, achieves performance comparable to a fully black-box meta-learned optimizer. L2O-CFGD can thus serve as a powerful tool for researchers to identify high-performing hyperparameters and gain insights on how to leverage the history-dependence of the fractional differential in optimization.
☆ A Frequentist Statistical Introduction to Variational Inference, Autoencoders, and Diffusion Models
While Variational Inference (VI) is central to modern generative models like Variational Autoencoders (VAEs) and Denoising Diffusion Models (DDMs), its pedagogical treatment is split across disciplines. In statistics, VI is typically framed as a Bayesian method for posterior approximation. In machine learning, however, VAEs and DDMs are developed from a Frequentist viewpoint, where VI is used to approximate a maximum likelihood estimator. This creates a barrier for statisticians, as the principles behind VAEs and DDMs are hard to contextualize without a corresponding Frequentist introduction to VI. This paper provides that introduction: we explain the theory for VI, VAEs, and DDMs from a purely Frequentist perspective, starting with the classical Expectation-Maximization (EM) algorithm. We show how VI arises as a scalable solution for intractable E-steps and how VAEs and DDMs are natural, deep-learning-based extensions of this framework, thereby bridging the gap between classical statistical inference and modern generative AI.
comment: This is an introduction paper. 28 pages, 2 figures
☆ Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference
Causal inference is essential for developing and evaluating medical interventions, yet real-world medical datasets are often difficult to access due to regulatory barriers. This makes synthetic data a potentially valuable asset that enables these medical analyses, along with the development of new inference methods themselves. Generative models can produce synthetic data that closely approximate real data distributions, yet existing methods do not consider the unique challenges that downstream causal inference tasks, and specifically those focused on treatments, pose. We establish a set of desiderata that synthetic data containing treatments should satisfy to maximise downstream utility: preservation of (i) the covariate distribution, (ii) the treatment assignment mechanism, and (iii) the outcome generation mechanism. Based on these desiderata, we propose a set of evaluation metrics to assess such synthetic data. Finally, we present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine that mimics the data-generating process of data containing treatments and optimises for our desiderata. We empirically demonstrate that STEAM achieves state-of-the-art performance across our metrics as compared to existing generative models, particularly as the complexity of the true data-generating process increases.
☆ Symbolic Emulators for Cosmology: Accelerating Cosmological Analyses Without Sacrificing Precision
In cosmology, emulators play a crucial role by providing fast and accurate predictions of complex physical models, enabling efficient exploration of high-dimensional parameter spaces that would be computationally prohibitive with direct numerical simulations. Symbolic emulators have emerged as promising alternatives to numerical approaches, delivering comparable accuracy with significantly faster evaluation times. While previous symbolic emulators were limited to relatively narrow prior ranges, we expand these to cover the parameter space relevant for current cosmological analyses. We introduce approximations to hypergeometric functions used for the $\Lambda$CDM comoving distance and linear growth factor which are accurate to better than 0.001% and 0.05%, respectively, for all redshifts and for $\Omega_{\rm m} \in [0.1, 0.5]$. We show that integrating symbolic emulators into a Dark Energy Survey-like $3\times2$pt analysis produces cosmological constraints consistent with those obtained using standard numerical methods. Our symbolic emulators offer substantial improvements in speed and memory usage, demonstrating their practical potential for scalable, likelihood-based inference.
comment: 22 pages, 6 figures. Invited contribution for the Royal Society Philosophical Transactions A special issue "Symbolic regression in the physical sciences"
☆ Diffusion Buffer for Online Generative Speech Enhancement
Online Speech Enhancement was mainly reserved for predictive models. A key advantage of these models is that for an incoming signal frame from a stream of data, the model is called only once for enhancement. In contrast, generative Speech Enhancement models often require multiple calls, resulting in a computational complexity that is too high for many online speech enhancement applications. This work presents the Diffusion Buffer, a generative diffusion-based Speech Enhancement model which only requires one neural network call per incoming signal frame from a stream of data and performs enhancement in an online fashion on a consumer-grade GPU. The key idea of the Diffusion Buffer is to align physical time with Diffusion time-steps. The approach progressively denoises frames through physical time, where past frames have more noise removed. Consequently, an enhanced frame is output to the listener with a delay defined by the Diffusion Buffer, and the output frame has a corresponding look-ahead. In this work, we extend upon our previous work by carefully designing a 2D convolutional UNet architecture that specifically aligns with the Diffusion Buffer's look-ahead. We observe that the proposed UNet improves performance, particularly when the algorithmic latency is low. Moreover, we show that using a Data Prediction loss instead of Denoising Score Matching loss enables flexible control over the trade-off between algorithmic latency and quality during inference. The extended Diffusion Buffer equipped with a novel NN and loss function drastically reduces the algorithmic latency from 320 - 960 ms to 32 - 176 ms with an even increased performance. While it has been shown before that offline generative diffusion models outperform predictive approaches in unseen noisy speech data, we confirm that the online Diffusion Buffer also outperforms its predictive counterpart on unseen noisy speech data.
☆ Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation
Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
☆ Adapting Language Balance in Code-Switching Speech ICASSP 2026
Despite achieving impressive results on standard benchmarks, large foundational models still struggle against code-switching test cases. When data scarcity cannot be used as the usual justification for poor performance, the reason may lie in the infrequent occurrence of code-switched moments, where the embedding of the second language appears subtly. Instead of expecting the models to learn this infrequency on their own, it might be beneficial to provide the training process with labels. Evaluating model performance on code-switching data requires careful localization of code-switching points where recognition errors are most consequential, so that the analysis emphasizes mistakes occurring at those moments. Building on this observation, we leverage the difference between the embedded and the main language to highlight those code-switching points and thereby emphasize learning at those locations. This simple yet effective differentiable surrogate mitigates context bias during generation -- the central challenge in code-switching -- thereby improving the model's robustness. Our experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly, reflected by the reduced substitution error.
comment: Submitted to ICASSP 2026
☆ Bayesian Low-Rank Factorization for Robust Model Adaptation ICASSP 2026
Large speech foundation models achieve strong performance across many domains, but they often require adaptation to handle local needs such as code-switching, where speakers mix languages within the same utterance. Direct fine-tuning of these models risks overfitting to the target domain and overwriting the broad capabilities of the base model. To address this challenge, we explore Bayesian factorized adapters for speech foundation models, which place priors near zero to achieve sparser adaptation matrices and thereby retain general performance while adapting to specific domains. We apply our approach to the Whisper model and evaluate on different multilingual code-switching scenarios. Our results show only minimal adaptation loss while significantly reducing catastrophic forgetting of the base model. Compared to LoRA, our method achieves a backward gain of 54% with only a 4% drop on the new domain. These findings highlight the effectiveness of Bayesian adaptation for fine-tuning speech foundation models without sacrificing generalization.
comment: Submitted to ICASSP 2026
☆ Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options NeurIPS 2025
We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $\Omega \left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.
comment: Accepted at NeurIPS 2025
☆ Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach
Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman-Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.
☆ Learning Task-Agnostic Representations through Multi-Teacher Distillation NeurIPS-2025
Casting complex inputs into tractable representations is a critical step across various fields. Diverse embedding models emerge from differences in architectures, loss functions, input modalities and datasets, each capturing unique aspects of the input. Multi-teacher distillation leverages this diversity to enrich representations but often remains tailored to specific tasks. In this paper, we introduce a task-agnostic framework based on a ``majority vote" objective function. We demonstrate that this function is bounded by the mutual information between student and teachers' embeddings, leading to a task-agnostic distillation loss that eliminates dependence on task-specific labels or prior knowledge. Our evaluations across text, vision models, and molecular modeling show that our method effectively leverages teacher diversity, resulting in representations enabling better performance for a wide range of downstream tasks such as classification, clustering, or regression. Additionally, we train and release state-of-the-art embedding models, enhancing downstream performance in various modalities.
comment: NeurIPS-2025
☆ Reasoning Language Model Inference Serving Unveiled: An Empirical Study
The reasoning large language model (RLLM) has been proven competitive in solving complex reasoning tasks such as mathematics, coding, compared to general LLM. However, the serving performance and behavior of RLLM remains unexplored, which may undermine the deployment and utilization of RLLM in real-world scenario. To close this gap, in this paper, we conduct a comprehensive study of RLLM service. We first perform a pilot study on comparing the serving performance between RLLM and traditional LLM and reveal that there are several distinct differences regarding serving behavior: (1) significant memory usage and fluctuations; (2) straggler requests; (3) adaptive running time; (4) domain preference. Then we further investigate whether existing inference optimization techniques are valid for RLLM. Our main takeaways are that model quantization methods and speculative decoding can improve service system efficiency with small compromise to RLLM accuracy, while prefix caching, KV cache quantization may even degrade accuracy or serving performance for small RLLM. Lastly, we conduct evaluation under real world workload modeled by Gamma distribution to verify our findings. Empirical results of real world workload evaluation across different dataset are aligned with our main findings regarding RLLM serving. We hope our work can provide the research community and industry with insights to advance RLLM inference serving.
☆ Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches
The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. We train and validate our model on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by three orders of magnitude compared to the state-of-the-art while maintaining competitive accuracy. We demonstrate the applicability of our proposed model on medical edge devices by analyzing energy consumption on a microcontroller and an experimental sensor device setup, confirming that on-device inference can be more energy-efficient than continuous data streaming.
comment: Submitted to the IEEE Journal of Biomedical And Health Informatics
☆ Differentially Private E-Values
E-values have gained prominence as flexible tools for statistical inference and risk control, enabling anytime- and post-hoc-valid procedures under minimal assumptions. However, many real-world applications fundamentally rely on sensitive data, which can be leaked through e-values. To ensure their safe release, we propose a general framework to transform non-private e-values into differentially private ones. Towards this end, we develop a novel biased multiplicative noise mechanism that ensures our e-values remain statistically valid. We show that our differentially private e-values attain strong statistical power, and are asymptotically as powerful as their non-private counterparts. Experiments across online risk monitoring, private healthcare, and conformal e-prediction demonstrate our approach's effectiveness and illustrate its broad applicability.
☆ Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression NeurIPS 2025
This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.
comment: Accepted to NeurIPS 2025
☆ Learning Time-Varying Turn-Taking Behavior in Group Conversations
We propose a flexible probabilistic model for predicting turn-taking patterns in group conversations based solely on individual characteristics and past speaking behavior. Many models of conversation dynamics cannot yield insights that generalize beyond a single group. Moreover, past works often aim to characterize speaking behavior through a universal formulation that may not be suitable for all groups. We thus develop a generalization of prior conversation models that predicts speaking turns among individuals in any group based on their individual characteristics, that is, personality traits, and prior speaking behavior. Importantly, our approach provides the novel ability to learn how speaking inclination varies based on when individuals last spoke. We apply our model to synthetic and real-world conversation data to verify the proposed approach and characterize real group interactions. Our results demonstrate that previous behavioral models may not always be realistic, motivating our data-driven yet theoretically grounded approach.
☆ Informed Learning for Estimating Drought Stress at Fine-Scale Resolution Enables Accurate Yield Prediction
Water is essential for agricultural productivity. Assessing water shortages and reduced yield potential is a critical factor in decision-making for ensuring agricultural productivity and food security. Crop simulation models, which align with physical processes, offer intrinsic explainability but often perform poorly. Conversely, machine learning models for crop yield modeling are powerful and scalable, yet they commonly operate as black boxes and lack adherence to the physical principles of crop growth. This study bridges this gap by coupling the advantages of both worlds. We postulate that the crop yield is inherently defined by the water availability. Therefore, we formulate crop yield as a function of temporal water scarcity and predict both the crop drought stress and the sensitivity to water scarcity at fine-scale resolution. Sequentially modeling the crop yield response to water enables accurate yield prediction. To enforce physical consistency, a novel physics-informed loss function is proposed. We leverage multispectral satellite imagery, meteorological data, and fine-scale yield data. Further, to account for the uncertainty within the model, we build upon a deep ensemble approach. Our method surpasses state-of-the-art models like LSTM and Transformers in crop yield prediction with a coefficient of determination ($R^2$-score) of up to 0.82 while offering high explainability. This method offers decision support for industry, policymakers, and farmers in building a more resilient agriculture in times of changing climate conditions.
☆ Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions NeurIPS 2025
Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.
comment: NeurIPS 2025
☆ ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
comment: 10 pages main text, 17 pages total
☆ C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression BMVC2025
Neural network compression has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts sparsity in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured pruning is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies pruning directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot pruning framework that relies on explainable deep learning. First, we introduce a causal-aware pruning approach that leverages cause-effect relations between model predictions and structures in a progressive pruning process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision transformer baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.
comment: 10 pages, BMVC2025
☆ Hardness of Learning Regular Languages in the Next Symbol Prediction Setting
We study the learnability of languages in the Next Symbol Prediction (NSP) setting, where a learner receives only positive examples from a language together with, for every prefix, (i) whether the prefix itself is in the language and (ii) which next symbols can lead to an accepting string. This setting has been used in prior works to empirically analyze neural sequence models, and additionally, we observe that efficient algorithms for the NSP setting can be used to learn the (truncated) support of language models. We formalize the setting so as to make it amenable to PAC-learning analysis. While the setting provides a much richer set of labels than the conventional classification setting, we show that learning concept classes such as DFAs and Boolean formulas remains computationally hard. The proof is via a construction that makes almost all additional labels uninformative, yielding a reduction from the conventional learning problem to learning with NSP labels. Under cryptographic assumptions, the reduction implies that the problem of learning DFAs is computationally hard in the NSP setting.
comment: 7 pages
☆ A Rectification-Based Approach for Distilling Boosted Trees into Decision Trees
We present a new approach for distilling boosted trees into decision trees, in the objective of generating an ML model offering an acceptable compromise in terms of predictive performance and interpretability. We explain how the correction approach called rectification can be used to implement such a distillation process. We show empirically that this approach provides interesting results, in comparison with an approach to distillation achieved by retraining the model.
comment: 29 pages
☆ Unrolled-SINDy: A Stable Explicit Method for Non linear PDE Discovery from Sparsely Sampled Data
Identifying from observation data the governing differential equations of a physical dynamics is a key challenge in machine learning. Although approaches based on SINDy have shown great promise in this area, they still fail to address a whole class of real world problems where the data is sparsely sampled in time. In this article, we introduce Unrolled-SINDy, a simple methodology that leverages an unrolling scheme to improve the stability of explicit methods for PDE discovery. By decorrelating the numerical time step size from the sampling rate of the available data, our approach enables the recovery of equation parameters that would not be the minimizers of the original SINDy optimization problem due to large local truncation errors. Our method can be exploited either through an iterative closed-form approach or by a gradient descent scheme. Experiments show the versatility of our method. On both traditional SINDy and state-of-the-art noise-robust iNeuralSINDy, with different numerical schemes (Euler, RK4), our proposed unrolling scheme allows to tackle problems not accessible to non-unrolled methods.
comment: 56 pages, 12 figures, 39 tables
☆ A Compositional Paradigm for Foundation Models: Towards Smarter Robotic Agents
The birth of Foundation Models brought unprecedented results in a wide range of tasks, from language to vision, to robotic control. These models are able to process huge quantities of data, and can extract and develop rich representations, which can be employed across different domains and modalities. However, they still have issues in adapting to dynamic, real-world scenarios without retraining the entire model from scratch. In this work, we propose the application of Continual Learning and Compositionality principles to foster the development of more flexible, efficient and smart AI solutions.
☆ Channel-Aware Vector Quantization for Robust Semantic Communication on Discrete Channels
Deep learning-based semantic communication has largely relied on analog or semi-digital transmission, which limits compatibility with modern digital communication infrastructures. Recent studies have employed vector quantization (VQ) to enable discrete semantic transmission, yet existing methods neglect channel state information during codebook optimization, leading to suboptimal robustness. To bridge this gap, we propose a channel-aware vector quantization (CAVQ) algorithm within a joint source-channel coding (JSCC) framework, termed VQJSCC, established on a discrete memoryless channel. In this framework, semantic features are discretized and directly mapped to modulation constellation symbols, while CAVQ integrates channel transition probabilities into the quantization process, aligning easily confused symbols with semantically similar codewords. A multi-codebook alignment mechanism is further introduced to handle mismatches between codebook order and modulation order by decomposing the transmission stream into multiple independently optimized subchannels. Experimental results demonstrate that VQJSCC effectively mitigates the digital cliff effect, achieves superior reconstruction quality across various modulation schemes, and outperforms state-of-the-art digital semantic communication baselines in both robustness and efficiency.
comment: 12 pages, 8 figures
☆ Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing
Graph neural networks (GNNs) are the predominant architecture for learning over graphs. As with any machine learning model, and important issue is the detection of adversarial attacks, where an adversary can change the output with a small perturbation of the input. Techniques for solving the adversarial robustness problem - determining whether such an attack exists - were originally developed for image classification, but there are variants for many other machine learning architectures. In the case of graph learning, the attack model usually considers changes to the graph structure in addition to or instead of the numerical features of the input, and the state of the art techniques in the area proceed via reduction to constraint solving, working on top of powerful solvers, e.g. for mixed integer programming. We show that it is possible to improve on the state of the art in structural robustness by replacing the use of powerful solvers by calls to efficient partial solvers, which run in polynomial time but may be incomplete. We evaluate our tool RobLight on a diverse set of GNN variants and datasets.
☆ CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder NeurIPS 2025
Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
comment: NeurIPS 2025
☆ HeFS: Helper-Enhanced Feature Selection via Pareto-Optimized Genetic Search
Feature selection is a combinatorial optimization problem that is NP-hard. Conventional approaches often employ heuristic or greedy strategies, which are prone to premature convergence and may fail to capture subtle yet informative features. This limitation becomes especially critical in high-dimensional datasets, where complex and interdependent feature relationships prevail. We introduce the HeFS (Helper-Enhanced Feature Selection) framework to refine feature subsets produced by existing algorithms. HeFS systematically searches the residual feature space to identify a Helper Set - features that complement the original subset and improve classification performance. The approach employs a biased initialization scheme and a ratio-guided mutation mechanism within a genetic algorithm, coupled with Pareto-based multi-objective optimization to jointly maximize predictive accuracy and feature complementarity. Experiments on 18 benchmark datasets demonstrate that HeFS consistently identifies overlooked yet informative features and achieves superior performance over state-of-the-art methods, including in challenging domains such as gastric cancer classification, drug toxicity prediction, and computer science applications. The code and datasets are available at https://healthinformaticslab.org/supp/.
☆ A Multi-Evidence Framework Rescues Low-Power Prognostic Signals and Rejects Statistical Artifacts in Cancer Genomics
Motivation: Standard genome-wide association studies in cancer genomics rely on statistical significance with multiple testing correction, but systematically fail in underpowered cohorts. In TCGA breast cancer (n=967, 133 deaths), low event rates (13.8%) create severe power limitations, producing false negatives for known drivers and false positives for large passenger genes. Results: We developed a five-criteria computational framework integrating causal inference (inverse probability weighting, doubly robust estimation) with orthogonal biological validation (expression, mutation patterns, literature evidence). Applied to TCGA-BRCA mortality analysis, standard Cox+FDR detected zero genes at FDR<0.05, confirming complete failure in underpowered settings. Our framework correctly identified RYR2 -- a cardiac gene with no cancer function -- as a false positive despite nominal significance (p=0.024), while identifying KMT2C as a complex candidate requiring validation despite marginal significance (p=0.047, q=0.954). Power analysis revealed median power of 15.1% across genes, with KMT2C achieving only 29.8% power (HR=1.55), explaining borderline statistical significance despite strong biological evidence. The framework distinguished true signals from artifacts through mutation pattern analysis: RYR2 showed 29.8% silent mutations (passenger signature) with no hotspots, while KMT2C showed 6.7% silent mutations with 31.4% truncating variants (driver signature). This multi-evidence approach provides a template for analyzing underpowered cohorts, prioritizing biological interpretability over purely statistical significance. Availability: All code and analysis pipelines available at github.com/akarlaraytu/causal-inference-for-cancer-genomics
comment: 17 pages (main text), 4 figures (main text), 7 supplementary figures, 4 supplementary tables. Focuses on a computational framework using causal inference and biological validation for underpowered cancer genomic studies
☆ RAISE: A Unified Framework for Responsible AI Scoring and Evaluation
As AI systems enter high-stakes domains, evaluation must extend beyond predictive accuracy to include explainability, fairness, robustness, and sustainability. We introduce RAISE (Responsible AI Scoring and Evaluation), a unified framework that quantifies model performance across these four dimensions and aggregates them into a single, holistic Responsibility Score. We evaluated three deep learning models: a Multilayer Perceptron (MLP), a Tabular ResNet, and a Feature Tokenizer Transformer, on structured datasets from finance, healthcare, and socioeconomics. Our findings reveal critical trade-offs: the MLP demonstrated strong sustainability and robustness, the Transformer excelled in explainability and fairness at a very high environmental cost, and the Tabular ResNet offered a balanced profile. These results underscore that no single model dominates across all responsibility criteria, highlighting the necessity of multi-dimensional evaluation for responsible model selection. Our implementation is available at: https://github.com/raise-framework/raise.
comment: Accepted at the 26th International Conference on Principles and Practice of Multi-Agent Systems
☆ Interval Prediction of Annual Average Daily Traffic on Local Roads via Quantile Random Forest with High-Dimensional Spatial Data
Accurate annual average daily traffic (AADT) data are vital for transport planning and infrastructure management. However, automatic traffic detectors across national road networks often provide incomplete coverage, leading to underrepresentation of minor roads. While recent machine learning advances have improved AADT estimation at unmeasured locations, most models produce only point predictions and overlook estimation uncertainty. This study addresses that gap by introducing an interval prediction approach that explicitly quantifies predictive uncertainty. We integrate a Quantile Random Forest model with Principal Component Analysis to generate AADT prediction intervals, providing plausible traffic ranges bounded by estimated minima and maxima. Using data from over 2,000 minor roads in England and Wales, and evaluated with specialized interval metrics, the proposed method achieves an interval coverage probability of 88.22%, a normalized average width of 0.23, and a Winkler Score of 7,468.47. By combining machine learning with spatial and high-dimensional analysis, this framework enhances both the accuracy and interpretability of AADT estimation, supporting more robust and informed transport planning.
☆ Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur individually in anticipated distillation datasets. As such, the poisoned teacher remains stealthy, while during distillation the individual presence of these tokens provides enough signal for the backdoor to transfer onto the student. Using T-MTB, we demonstrate and extensively study the security risks of transferable backdoors across two attack scenarios, jailbreaking and content modulation, and across four model families of LLMs.
☆ Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models
As large language models (LLMs) become increasingly integrated into applications serving users across diverse cultures, communities and demographics, it is critical to align LLMs with pluralistic human values beyond average principles (e.g., HHH). In psychological and social value theories such as Schwartz's Value Theory, pluralistic values are represented by multiple value dimensions paired with various priorities. However, existing methods encounter two challenges when aligning with such fine-grained value objectives: 1) they often treat multiple values as independent and equally important, ignoring their interdependence and relative priorities (value complexity); 2) they struggle to precisely control nuanced value priorities, especially those underrepresented ones (value steerability). To handle these challenges, we propose COUPLE, a COUnterfactual reasoning framework for PLuralistic valuE alignment. It introduces a structural causal model (SCM) to feature complex interdependency and prioritization among features, as well as the causal relationship between high-level value dimensions and behaviors. Moreover, it applies counterfactual reasoning to generate outputs aligned with any desired value objectives. Benefitting from explicit causal modeling, COUPLE also provides better interpretability. We evaluate COUPLE on two datasets with different value systems and demonstrate that COUPLE advances other baselines across diverse types of value objectives.
comment: 41 pages, 7 figures
☆ Partial VOROS: A Cost-aware Performance Metric for Binary Classifiers with Precision and Capacity Constraints
The ROC curve is widely used to assess binary classification performance. Yet for some applications such as alert systems for hospitalized patient monitoring, conventional ROC analysis cannot capture crucial factors that impact deployment, such as enforcing a minimum precision constraint to avoid false alarm fatigue or imposing an upper bound on the number of predicted positives to represent the capacity of hospital staff. The usual area under the curve metric also does not reflect asymmetric costs for false positives and false negatives. In this paper we address all three of these issues. First, we show how the subset of classifiers that meet given precision and capacity constraints can be represented as a feasible region in ROC space. We establish the geometry of this feasible region. We then define the partial area of lesser classifiers, a performance metric that is monotonic with cost and only accounts for the feasible portion of ROC space. Averaging this area over a desired range of cost parameters results in the partial volume over the ROC surface, or partial VOROS. In experiments predicting mortality risk using vital sign history on the MIMIC-IV dataset, we show this cost-aware metric is better than alternatives for ranking classifiers in hospital alert applications.
☆ Decoding Dynamic Visual Experience from Calcium Imaging via Cell-Pattern-Aware SSL
Self-supervised learning (SSL) holds a great deal of promise for applications in neuroscience, due to the lack of large-scale, consistently labeled neural datasets. However, most neural datasets contain heterogeneous populations that mix stable, predictable cells with highly stochastic, stimulus-contingent ones, which has made it hard to identify consistent activity patterns during SSL. As a result, self-supervised pretraining has yet to show clear signs of benefits from scale on neural data. Here, we present a novel approach to self-supervised pretraining, POYO-SSL that exploits the heterogeneity of neural data to improve pre-training and achieve benefits of scale. Specifically, in POYO-SSL we pretrain only on predictable (statistically regular) neurons-identified on the pretraining split via simple higher-order statistics (skewness and kurtosis)-then we fine-tune on the unpredictable population for downstream tasks. On the Allen Brain Observatory dataset, this strategy yields approximately 12-13% relative gains over from-scratch training and exhibits smooth, monotonic scaling with model size. In contrast, existing state-of-the-art baselines plateau or destabilize as model size increases. By making predictability an explicit metric for crafting the data diet, POYO-SSL turns heterogeneity from a liability into an asset, providing a robust, biologically grounded recipe for scalable neural decoding and a path toward foundation models of neural dynamics.
☆ Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
comment: Accepted by The 38th Conference of Open Innovations Association FRUCT, 2025
☆ Alibaba International E-commerce Product Search Competition DILAB Team Technical Report CIKM
This study presents the multilingual e-commerce search system developed by the DILAB team, which achieved 5th place on the final leaderboard with a competitive overall score of 0.8819, demonstrating stable and high-performing results across evaluation metrics. To address challenges in multilingual query-item understanding, we designed a multi-stage pipeline integrating data refinement, lightweight preprocessing, and adaptive modeling. The data refinement stage enhanced dataset consistency and category coverage, while language tagging and noise filtering improved input quality. In the modeling phase, multiple architectures and fine-tuning strategies were explored, and hyperparameters optimized using curated validation sets to balance performance across query-category (QC) and query-item (QI) tasks. The proposed framework exhibited robustness and adaptability across languages and domains, highlighting the effectiveness of systematic data curation and iterative evaluation for multilingual search systems. The source code is available at https://github.com/2noweyh/DILAB-Alibaba-Ecommerce-Search.
comment: CIKM Alibaba E-commerce Search Challenge 2025
☆ Learning to Navigate Under Imperfect Perception: Conformalised Segmentation for Safe Reinforcement Learning
Reliable navigation in safety-critical environments requires both accurate hazard perception and principled uncertainty handling to strengthen downstream safety handling. Despite the effectiveness of existing approaches, they assume perfect hazard detection capabilities, while uncertainty-aware perception approaches lack finite-sample guarantees. We present COPPOL, a conformal-driven perception-to-policy learning approach that integrates distribution-free, finite-sample safety guarantees into semantic segmentation, yielding calibrated hazard maps with rigorous bounds for missed detections. These maps induce risk-aware cost fields for downstream RL planning. Across two satellite-derived benchmarks, COPPOL increases hazard coverage (up to 6x) compared to comparative baselines, achieving near-complete detection of unsafe regions while reducing hazardous violations during navigation (up to approx 50%). More importantly, our approach remains robust to distributional shift, preserving both safety and efficiency.
☆ Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation
Ensuring the safe exploration of reinforcement learning (RL) agents is critical for deployment in real-world systems. Yet existing approaches struggle to strike the right balance: methods that tightly enforce safety often cripple task performance, while those that prioritize reward leave safety constraints frequently violated, producing diffuse cost landscapes that flatten gradients and stall policy improvement. We introduce the Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation and refinement into critic training. By concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas, USC enables policies to achieve effective reward-safety trade-offs. Extensive experiments show that USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%, breaking the prevailing trade-off between safety and performance and paving the way for scalable safe RL.
☆ Benchmarking Fairness-aware Graph Neural Networks in Knowledge Graphs
Graph neural networks (GNNs) are powerful tools for learning from graph-structured data but often produce biased predictions with respect to sensitive attributes. Fairness-aware GNNs have been actively studied for mitigating biased predictions. However, no prior studies have evaluated fairness-aware GNNs on knowledge graphs, which are one of the most important graphs in many applications, such as recommender systems. Therefore, we introduce a benchmarking study on knowledge graphs. We generate new graphs from three knowledge graphs, YAGO, DBpedia, and Wikidata, that are significantly larger than the existing graph datasets used in fairness studies. We benchmark inprocessing and preprocessing methods in different GNN backbones and early stopping conditions. We find several key insights: (i) knowledge graphs show different trends from existing datasets; clearer trade-offs between prediction accuracy and fairness metrics than other graphs in fairness-aware GNNs, (ii) the performance is largely affected by not only fairness-aware GNN methods but also GNN backbones and early stopping conditions, and (iii) preprocessing methods often improve fairness metrics, while inprocessing methods improve prediction accuracy.
☆ Simple and Efficient Heterogeneous Temporal Graph Neural Network
Heterogeneous temporal graphs (HTGs) are ubiquitous data structures in the real world. Recently, to enhance representation learning on HTGs, numerous attention-based neural networks have been proposed. Despite these successes, existing methods rely on a decoupled temporal and spatial learning paradigm, which weakens interactions of spatio-temporal information and leads to a high model complexity. To bridge this gap, we propose a novel learning paradigm for HTGs called Simple and Efficient Heterogeneous Temporal Graph N}eural Network (SE-HTGNN). Specifically, we innovatively integrate temporal modeling into spatial learning via a novel dynamic attention mechanism, which retains attention information from historical graph snapshots to guide subsequent attention computation, thereby improving the overall discriminative representations learning of HTGs. Additionally, to comprehensively and adaptively understand HTGs, we leverage large language models to prompt SE-HTGNN, enabling the model to capture the implicit properties of node types as prior knowledge. Extensive experiments demonstrate that SE-HTGNN achieves up to 10x speed-up over the state-of-the-art and latest baseline while maintaining the best forecasting accuracy.
comment: Accepted by Neurips 2025
☆ Learning Boltzmann Generators via Constrained Mass Transport
Efficient sampling from high-dimensional and multimodal unnormalized probability distributions is a central challenge in many areas of science and machine learning. We focus on Boltzmann generators (BGs) that aim to sample the Boltzmann distribution of physical systems, such as molecules, at a given temperature. Classical variational approaches that minimize the reverse Kullback-Leibler divergence are prone to mode collapse, while annealing-based methods, commonly using geometric schedules, can suffer from mass teleportation and rely heavily on schedule tuning. We introduce Constrained Mass Transport (CMT), a variational framework that generates intermediate distributions under constraints on both the KL divergence and the entropy decay between successive steps. These constraints enhance distributional overlap, mitigate mass teleportation, and counteract premature convergence. Across standard BG benchmarks and the here introduced ELIL tetrapeptide, the largest system studied to date without access to samples from molecular dynamics, CMT consistently surpasses state-of-the-art variational methods, achieving more than 2.5x higher effective sample size while avoiding mode collapse.
☆ Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
comment: Code and models available at: https://github.com/tianciB/VFM-VAE
☆ A machine learning approach to automation and uncertainty evaluation for self-validating thermocouples SC
Thermocouples are in widespread use in industry, but they are particularly susceptible to calibration drift in harsh environments. Self-validating thermocouples aim to address this issue by using a miniature phase-change cell (fixed-point) in close proximity to the measurement junction (tip) of the thermocouple. The fixed point is a crucible containing an ingot of metal with a known melting temperature. When the process temperature being monitored passes through the melting temperature of the ingot, the thermocouple output exhibits a "plateau" during melting. Since the melting temperature of the ingot is known, the thermocouple can be recalibrated in situ. Identifying the melting plateau to determine the onset of melting is reasonably well established but requires manual intervention involving zooming in on the region around the actual melting temperature, a process which can depend on the shape of the melting plateau. For the first time, we present a novel machine learning approach to recognize and identify the characteristic shape of the melting plateau and once identified, to quantity the point at which melting begins, along with its associated uncertainty. This removes the need for human intervention in locating and characterizing the melting point. Results from test data provided by CCPI Europe show 100% accuracy of melting plateau detection. They also show a cross-validated R2 of 0.99 on predictions of calibration drift.
comment: 6 pages, 7 figures. TEMPERATURE: ITS MEASUREMENT AND CONTROL IN SCIENCE AND INDUSTRY, VOLUME 9: Proceedings of the Tenth International Temperature Symposium 3-7 April 2023 Anaheim, USA
☆ Provable Generalization Bounds for Deep Neural Networks with Adaptive Regularization
Deep neural networks (DNNs) achieve remarkable performance but often suffer from overfitting due to their high capacity. We introduce Momentum-Adaptive Gradient Dropout (MAGDrop), a novel regularization method that dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum, enhancing stability in non-convex optimization landscapes. To theoretically justify MAGDrop's effectiveness, we derive a tightened PAC-Bayes generalization bound that accounts for its adaptive nature, achieving up to 20% sharper bounds compared to standard approaches by leveraging momentum-driven perturbation control. Empirically, the activation-based MAGDrop outperforms baseline regularization techniques, including standard dropout and adaptive gradient regularization, by 1-2% in test accuracy on MNIST (99.52%) and CIFAR-10 (90.63%), with generalization gaps of 0.48% and 7.14%, respectively. Our work bridges theoretical insights and practical advancements, offering a robust framework for enhancing DNN generalization suitable for high-stakes applications.
comment: 8 pages
☆ Learning from N-Tuple Data with M Positive Instances: Unbiased Risk Estimation and Theoretical Guarantees
Weakly supervised learning often operates with coarse aggregate signals rather than instance labels. We study a setting where each training example is an $n$-tuple containing exactly m positives, while only the count m per tuple is observed. This NTMP (N-tuple with M positives) supervision arises in, e.g., image classification with region proposals and multi-instance measurements. We show that tuple counts admit a trainable unbiased risk estimator (URE) by linking the tuple-generation process to latent instance marginals. Starting from fixed (n,m), we derive a closed-form URE and extend it to variable tuple sizes, variable counts, and their combination. Identification holds whenever the effective mixing rate is separated from the class prior. We establish generalization bounds via Rademacher complexity and prove statistical consistency with standard rates under mild regularity assumptions. To improve finite-sample stability, we introduce simple ReLU corrections to the URE that preserve asymptotic correctness. Across benchmarks converted to NTMP tasks, the approach consistently outperforms representative weak-supervision baselines and yields favorable precision-recall and F1 trade-offs. It remains robust under class-prior imbalance and across diverse tuple configurations, demonstrating that count-only supervision can be exploited effectively through a theoretically grounded and practically stable objective.
☆ Approximation Rates of Shallow Neural Networks: Barron Spaces, Activation Functions and Optimality Analysis
This paper investigates the approximation properties of shallow neural networks with activation functions that are powers of exponential functions. It focuses on the dependence of the approximation rate on the dimension and the smoothness of the function being approximated within the Barron function space. We examine the approximation rates of ReLU$^{k}$ activation functions, proving that the optimal rate cannot be achieved under $\ell^{1}$-bounded coefficients or insufficient smoothness conditions. We also establish optimal approximation rates in various norms for functions in Barron spaces and Sobolev spaces, confirming the curse of dimensionality. Our results clarify the limits of shallow neural networks' approximation capabilities and offer insights into the selection of activation functions and network structures.
☆ S2AP: Score-space Sharpness Minimization for Adversarial Pruning
Adversarial pruning methods have emerged as a powerful tool for compressing neural networks while preserving robustness against adversarial attacks. These methods typically follow a three-step pipeline: (i) pretrain a robust model, (ii) select a binary mask for weight pruning, and (iii) finetune the pruned model. To select the binary mask, these methods minimize a robust loss by assigning an importance score to each weight, and then keep the weights with the highest scores. However, this score-space optimization can lead to sharp local minima in the robust loss landscape and, in turn, to an unstable mask selection, reducing the robustness of adversarial pruning methods. To overcome this issue, we propose a novel plug-in method for adversarial pruning, termed Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we introduce the concept of score-space sharpness minimization, which operates during the mask search by perturbing importance scores and minimizing the corresponding robust loss. Extensive experiments across various datasets, models, and sparsity levels demonstrate that S2AP effectively minimizes sharpness in score space, stabilizing the mask selection, and ultimately improving the robustness of adversarial pruning methods.
♻ ☆ Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration
Autonomous exploration in complex multi-agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self-supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication-free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi-agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi-agent context. Additionally, CERMIC generates theoretically-grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse-reward environments.
♻ ☆ High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services
The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate syntactically correct and semantically relevant high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the SQL test targets (queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant. Our results demonstrate the practical utility of an LLM (\textit{gemini}) based test data generation for industrial SQL code generation services where generating high-fidelity test data is essential due to the frequent unavailability and inaccessibility of production datasets for testing.
♻ ☆ One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares
While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit's update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning.
comment: Journal extension of v1: Y. Min, K, Ahn, N. Azizan, "One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares," IEEE Conference on Decision and Control, 2022
♻ ☆ How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
We study the approximation capabilities, convergence speeds and on-convergence behaviors of transformers trained on in-context recall tasks -- which requires to recognize the \emph{positional} association between a pair of tokens from in-context examples. Existing theoretical results only focus on the in-context reasoning behavior of transformers after being trained for the \emph{one} gradient descent step. It remains unclear what is the on-convergence behavior of transformers being trained by gradient descent and how fast the convergence rate is. In addition, the generalization of transformers in one-step in-context reasoning has not been formally investigated. This work addresses these gaps. We first show that a class of transformers with either linear, ReLU or softmax attentions, is provably Bayes-optimal for an in-context recall task. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss converges at linear rate to the Bayes risks. Moreover, we show that the trained transformers exhibit out-of-distribution (OOD) generalization, i.e., generalizing to samples outside of the population distribution. Our theoretical findings are further supported by extensive empirical validations, showing that \emph{without} proper parameterization, models with larger expressive power surprisingly \emph{fail} to generalize OOD after being trained by gradient descent.
comment: V3: added new results for softmax attention, typos fixed, titled changed. 33 pages
♻ ☆ Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks
Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present NAO: a Nondeterministic tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. NAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement NAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. NAO reconciles scalability with verifiability for real-world heterogeneous ML compute.
comment: 17 pages, 7 figures
♻ ☆ Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods
While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of H\"older functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for H\"older functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of H\"older functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.
♻ ☆ TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs
With the emergence of wearable devices and other embedded systems, deploying large language models (LLMs) on edge platforms has become an urgent need. However, this is challenging because of their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as low as 1.58~bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected long latency of the prefill stage. We present \textbf{TeLLMe}, the first table-lookup-based ternary LLM accelerator for low-power edge FPGAs that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. TeLLMe incorporates several novel techniques, including (1) a table-lookup-based ternary matrix multiplication (TLMM) engine utilizing grouped activations and online precomputation for low resource utilization and high throughput; (2) a fine-grained analytic URAM-based weight buffer management scheme for efficient loading and compute engine access; (3) a streaming dataflow architecture that fuses floating-point element-wise operations with linear computations to hide latency; (4) a reversed-reordered prefill stage attention with fused attention operations for high memory efficiency; and (5) a resource-efficient specialized decoding stage attention. Under a 5~W power budget, TeLLMe delivers up to 25~tokens/s decoding throughput and 0.45--0.96~s time-to-first-token (TTFT) for 64--128 token prompts, marking a significant energy-efficiency advancement in LLM inference on edge FPGAs.
♻ ☆ Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.
♻ ☆ Glyph: Scaling Context Windows via Visual-Text Compression
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
♻ ☆ Stochastic Path Planning in Correlated Obstacle Fields
We introduce the Stochastic Correlated Obstacle Scene (SCOS) problem, a navigation setting with spatially correlated obstacles of uncertain blockage status, realistically constrained sensors that provide noisy readings and costly disambiguation. Modeling the spatial correlation with Gaussian Random Field (GRF), we develop Bayesian belief updates that refine blockage probabilities, and use the posteriors to reduce search space for efficiency. To find the optimal traversal policy, we propose a novel two-stage learning framework. An offline phase learns a robust base policy via optimistic policy iteration augmented with information bonus to encourage exploration in informative regions, followed by an online rollout policy with periodic base updates via a Bayesian mechanism for information adaptation. This framework supports both Monte Carlo point estimation and distributional reinforcement learning (RL) to learn full cost distributions, leading to stronger uncertainty quantification. We establish theoretical benefits of correlation-aware updating and convergence property under posterior sampling. Comprehensive empirical evaluations across varying obstacle densities, sensor capabilities demonstrate consistent performance gains over baselines. This framework addresses navigation challenges in environments with adversarial interruptions or clustered natural hazards.
♻ ☆ Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$-Realizable MDPs
We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear $Q^\pi$-realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning (\SPOIL), which is guaranteed to match the performance of any expert up to an additive error $\varepsilon$ with access to $\mathcal{O}(\varepsilon^{-2})$ samples. Moreover, we extend this result to possibly non-linear $Q^\pi$-realizable MDPs at the cost of a worse sample complexity of order $\mathcal{O}(\varepsilon^{-4})$. Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of \SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.
♻ ☆ Viability of perturbative expansion for quantum field theories on neurons
Neural Network (NN) architectures that break statistical independence of parameters have been proposed as a new approach for simulating local quantum field theories (QFTs). In the infinite neuron number limit, single-layer NNs can exactly reproduce QFT results. This paper examines the viability of this architecture for perturbative calculations of local QFTs for finite neuron number $N$ using scalar $\phi^4$ theory in $d$ Euclidean dimensions as an example. We find that the renormalized $O(1/N)$ corrections to two- and four-point correlators yield perturbative series which are sensitive to the ultraviolet cut-off and therefore have a weak convergence. We propose a modification to the architecture to improve this convergence and discuss constraints on the parameters of the theory and the scaling of N which allow us to extract accurate field theory results.
comment: Updated references and minor changes to the text
♻ ☆ Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures
Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.
comment: Waiting for Conference Response
♻ ☆ In-Context Learning of Stochastic Differential Equations with Foundation Inference Models NeurIPS 2025
Stochastic differential equations (SDEs) describe dynamical systems where deterministic flows, governed by a drift function, are superimposed with random fluctuations, dictated by a diffusion function. The accurate estimation (or discovery) of these functions from data is a central problem in machine learning, with wide application across the natural and social sciences. Yet current solutions either rely heavily on prior knowledge of the dynamics or involve intricate training procedures. We introduce FIM-SDE (Foundation Inference Model for SDEs), a pretrained recognition model that delivers accurate in-context (or zero-shot) estimation of the drift and diffusion functions of low-dimensional SDEs, from noisy time series data, and allows rapid finetuning to target datasets. Leveraging concepts from amortized inference and neural operators, we (pre)train FIM-SDE in a supervised fashion to map a large set of noisy, discretely observed SDE paths onto the space of drift and diffusion functions. We demonstrate that FIM-SDE achieves robust in-context function estimation across a wide range of synthetic and real-world processes -- from canonical SDE systems (e.g., double-well dynamics or weakly perturbed Lorenz attractors) to stock price recordings and oil-price and wind-speed fluctuations -- while matching the performance of symbolic, Gaussian process and Neural SDE baselines trained on the target datasets. When finetuned to the target processes, we show that FIM-SDE consistently outperforms all these baselines.
comment: Accepted at NeurIPS 2025. The previous version appeared under the title "Foundation Inference Models for Stochastic Differential Equations: A Transformer-based Approach for Zero-shot Function Estimation."
♻ ☆ Counterfactual reasoning: an analysis of in-context emergence NeurIPS
Large-scale neural language models exhibit remarkable performance in in-context learning: the ability to learn and reason about the input context on the fly. This work studies in-context counterfactual reasoning in language models, that is, the ability to predict consequences of a hypothetical scenario. We focus on a well-defined, synthetic linear regression task that requires noise abduction. Accurate prediction is based on (1) inferring an unobserved latent concept and (2) copying contextual noise from factual observations. We show that language models are capable of counterfactual reasoning. Further, we enhance existing identifiability results and reduce counterfactual reasoning for a broad class of functions to a transformation on in-context observations. In Transformers, we find that self-attention, model depth and pre-training data diversity drive performance. Moreover, we provide mechanistic evidence that the latent concept is linearly represented in the residual stream and we introduce designated \textit{noise abduction heads} central to performing counterfactual reasoning. Lastly, our findings extend to counterfactual reasoning under SDE dynamics and reflect that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/mrtzmllr/iccr.
comment: Published as a conference paper at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS) 2025
♻ ☆ Interpretable Decision-Making for End-to-End Autonomous Driving ICCV 2025
Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.
comment: Accepted to the ICCV 2025 2nd Workshop on the Challenge Of Out-of-Label Hazards in Autonomous Driving (2COOOL)
♻ ☆ Who cuts emissions, who turns up the heat? causal machine learning estimates of energy efficiency interventions
Reducing domestic energy demand is central to climate mitigation and fuel poverty strategies, yet the impact of energy efficiency interventions is highly heterogeneous. Using a causal machine learning model trained on nationally representative data of the English housing stock, we estimate average and conditional treatment effects of wall insulation on gas consumption, focusing on distributional effects across energy burden subgroups. While interventions reduce gas demand on average (by as much as 19 percent), low energy burden groups achieve substantial savings, whereas those experiencing high energy burdens see little to no reduction. This pattern reflects a behaviourally-driven mechanism: households constrained by high costs-to-income ratios (e.g. more than 0.1) reallocate savings toward improved thermal comfort rather than lowering consumption. Far from wasteful, such responses represent rational adjustments in contexts of prior deprivation, with potential co-benefits for health and well-being. These findings call for a broader evaluation framework that accounts for both climate impacts and the equity implications of domestic energy policy.
♻ ☆ Inductive Domain Transfer In Misspecified Simulation-Based Inference
Simulation-based inference (SBI) is a statistical inference approach for estimating latent parameters of a physical system when the likelihood is intractable but simulations are available. In practice, SBI is often hindered by model misspecification--the mismatch between simulated and real-world observations caused by inherent modeling simplifications. RoPE, a recent SBI approach, addresses this challenge through a two-stage domain transfer process that combines semi-supervised calibration with optimal transport (OT)-based distribution alignment. However, RoPE operates in a fully transductive setting, requiring access to a batch of test samples at inference time, which limits scalability and generalization. We propose here a fully inductive and amortized SBI framework that integrates calibration and distributional alignment into a single, end-to-end trainable model. Our method leverages mini-batch OT with a closed-form coupling to align real and simulated observations that correspond to the same latent parameters, using both paired calibration data and unpaired samples. A conditional normalizing flow is then trained to approximate the OT-induced posterior, enabling efficient inference without simulation access at test time. Across a range of synthetic and real-world benchmarks--including complex medical biomarker estimation--our approach matches or surpasses the performance of RoPE, as well as other standard SBI and non-SBI estimators, while offering improved scalability and applicability in challenging, misspecified environments.
♻ ☆ A unified framework for establishing the universal approximation of transformer-type architectures NeurIPS 2025
We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.
comment: NeurIPS 2025 camera-ready
♻ ☆ Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.
comment: 35 pages, 1 figure
♻ ☆ Dynamic object goal pushing with mobile manipulators through model-free constrained reinforcement learning ICRA 2025
Non-prehensile pushing to move and reorient objects to a goal is a versatile loco-manipulation skill. In the real world, the object's physical properties and friction with the floor contain significant uncertainties, which makes the task challenging for a mobile manipulator. In this paper, we develop a learning-based controller for a mobile manipulator to move an unknown object to a desired position and yaw orientation through a sequence of pushing actions. The proposed controller for the robotic arm and the mobile base motion is trained using a constrained Reinforcement Learning (RL) formulation. We demonstrate its capability in experiments with a quadrupedal robot equipped with an arm. The learned policy achieves a success rate of 91.35% in simulation and at least 80% on hardware in challenging scenarios. Through our extensive hardware experiments, we show that the approach demonstrates high robustness against unknown objects of different masses, materials, sizes, and shapes. It reactively discovers the pushing location and direction, thus achieving contact-rich behavior while observing only the pose of the object. Additionally, we demonstrate the adaptive behavior of the learned policy towards preventing the object from toppling.
comment: presented at ICRA 2025, Video: https://youtu.be/wGAdPGVf9Ws?si=pi83ONWofHHqbFG0
♻ ☆ FedMeld: A Model-dispersal Federated Learning Framework for Space-ground Integrated Networks
To bridge the digital divide, the space-ground integrated networks (SGINs), which will be a key component of the six-generation (6G) mobile networks, are expected to deliver artificial intelligence (AI) services to every corner of the world. One mission of SGINs is to support federated learning (FL) at a global scale. However, existing space-ground integrated FL frameworks involve ground stations or costly inter-satellite links, entailing excessive training latency and communication costs. To overcome these limitations, we propose an infrastructure-free federated learning framework based on a model dispersal (FedMeld) strategy, which exploits periodic movement patterns and store-carry-forward capabilities of satellites to enable parameter mixing across large-scale geographical regions. We theoretically show that FedMeld leads to global model convergence and quantify the effects of round interval and mixing ratio between adjacent areas on its learning performance. Based on the theoretical results, we formulate a joint optimization problem to design the staleness control and mixing ratio (SC-MR) for minimizing the training loss. By decomposing the problem into sequential SC and MR subproblems without compromising the optimality, we derive the round interval solution in a closed form and the mixing ratio in a semi-closed form to achieve the optimal latency-accuracy tradeoff. Experiments using various datasets demonstrate that FedMeld achieves superior model accuracy while significantly reducing communication costs as compared with traditional FL schemes for SGINs.
comment: 16 pages, 10 figures. This work has been submitted to the IEEE for possible publication
♻ ☆ A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI
Multi-modal generative AI systems, such as those combining vision and language, rely on contrastive pre-training to learn representations across different modalities. While their practical benefits are widely acknowledged, a rigorous theoretical understanding of the contrastive pre-training framework remains limited. This paper develops a theoretical framework to explain the success of contrastive pre-training in downstream tasks, such as zero-shot classification, conditional diffusion models, and vision-language models. We introduce the concept of approximate sufficient statistics, a generalization of the classical sufficient statistics, and show that near-minimizers of the contrastive pre-training loss are approximately sufficient, making them adaptable to diverse downstream tasks. We further propose the Joint Generative Hierarchical Model for the joint distribution of images and text, showing that transformers can efficiently approximate relevant functions within this model via belief propagation. Building on this framework, we derive sample complexity guarantees for multi-modal learning based on contrastive pre-trained representations. Numerical simulations validate these theoretical findings, demonstrating the strong generalization performance of contrastively pre-trained transformers in various multi-modal tasks.
♻ ☆ Review of Explainable Graph-Based Recommender Systems
Explainability of recommender systems has become essential to ensure users' trust and satisfaction. Various types of explainable recommender systems have been proposed including explainable graph-based recommender systems. This review paper discusses state-of-the-art approaches of these systems and categorizes them based on three aspects: learning methods, explaining methods, and explanation types. It also explores the commonly used datasets, explainability evaluation methods, and future directions of this research area. Compared with the existing review papers, this paper focuses on explainability based on graphs and covers the topics required for developing novel explainable graph-based recommender systems.
♻ ☆ From Reviews to Actionable Insights: An LLM-Based Approach for Attribute and Feature Extraction
This research proposes a systematic, large language model (LLM) approach for extracting product and service attributes, features, and associated sentiments from customer reviews. Grounded in marketing theory, the framework distinguishes perceptual attributes from actionable features, producing interpretable and managerially actionable insights. We apply the methodology to 20,000 Yelp reviews of Starbucks stores and evaluate eight prompt variants on a random subset of reviews. Model performance is assessed through agreement with human annotations and predictive validity for customer ratings. Results show high consistency between LLMs and human coders and strong predictive validity, confirming the reliability of the approach. Human coders required a median of six minutes per review, whereas the LLM processed each in two seconds, delivering comparable insights at a scale unattainable through manual coding. Managerially, the analysis identifies attributes and features that most strongly influence customer satisfaction and their associated sentiments, enabling firms to pinpoint "joy points," address "pain points," and design targeted interventions. We demonstrate how structured review data can power an actionable marketing dashboard that tracks sentiment over time and across stores, benchmarks performance, and highlights high-leverage features for improvement. Simulations indicate that enhancing sentiment for key service features could yield 1-2% average revenue gains per store.
♻ ☆ Backward Conformal Prediction
We introduce $\textit{Backward Conformal Prediction}$, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the coverage level accordingly. Our method builds on two key foundations: (i) recent results by Gauthier et al. [2025] on post-hoc validity using e-values, which ensure marginal coverage of the form $\mathbb{P}(Y_{\rm test} \in \hat C_n^{\tilde{\alpha}}(X_{\rm test})) \ge 1 - \mathbb{E}[\tilde{\alpha}]$ up to a first-order Taylor approximation for any data-dependent miscoverage $\tilde{\alpha}$, and (ii) a novel leave-one-out estimator $\hat{\alpha}^{\rm LOO}$ of the marginal miscoverage $\mathbb{E}[\tilde{\alpha}]$ based on the calibration set, ensuring that the theoretical guarantees remain computable in practice. This approach is particularly useful in applications where large prediction sets are impractical such as medical diagnosis. We provide theoretical results and empirical evidence supporting the validity of our method, demonstrating that it maintains computable coverage guarantees while ensuring interpretable, well-controlled prediction set sizes.
comment: Code available at: https://github.com/GauthierE/backward-cp
♻ ☆ Spike-timing-dependent Hebbian learning as noisy gradient descent
Hebbian learning is a key principle underlying learning in biological neural networks. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a non-convex loss function on the probability simplex. Despite the constant injection of noise and the non-convexity of the underlying optimization problem, one can rigorously prove that the considered Hebbian learning dynamic identifies the presynaptic neuron with the highest activity and that the convergence is exponentially fast in the number of iterations. This is non-standard and surprising as typically noisy gradient descent with fixed noise level only converges to a stationary regime where the noise causes the dynamic to fluctuate around a minimiser.
♻ ☆ Trial and Trust: Addressing Byzantine Attacks with Comprehensive Defense Strategy
Recent advancements in machine learning have improved performance while also increasing computational demands. While federated and distributed setups address these issues, their structure is vulnerable to malicious influences. In this paper, we address a specific threat, Byzantine attacks, where compromised clients inject adversarial updates to derail global convergence. We combine the trust scores concept with trial function methodology to dynamically filter outliers. Our methods address the critical limitations of previous approaches, allowing functionality even when Byzantine nodes are in the majority. Moreover, our algorithms adapt to widely used scaled methods like Adam and RMSProp, as well as practical scenarios, including local training and partial participation. We validate the robustness of our methods by conducting extensive experiments on both synthetic and real ECG data collected from medical institutions. Furthermore, we provide a broad theoretical analysis of our algorithms and their extensions to aforementioned practical setups. The convergence guarantees of our methods are comparable to those of classical algorithms developed without Byzantine interference.
♻ ☆ Generation of Uncertainty-Aware Emergent Concepts in Factorized 3D Scene Graphs via Graph Neural Networks
Enabling robots to autonomously discover emergent spatial concepts (e.g., rooms) from primitive geometric observations (e.g., planar surfaces) within 3D Scene Graphs is essential for robust indoor navigation and mapping. These graphs provide a hierarchical metric-semantic representation in which such concepts are organized. To further enhance graph-SLAM performance, Factorized 3D Scene Graphs incorporate these concepts as optimization factors that constrain relative geometry and enforce global consistency. However, both stages of this process remain largely manual: concepts are typically derived using hand-crafted, concept-specific heuristics, while factors and their covariances are likewise manually designed. This reliance on manual specification limits generalization across diverse environments and scalability to new concept classes. This paper presents, for the first time, a learning-based method to generate online spatial emergent concepts as optimizable factors within a SLAM backend, reducing the need to handcraft both concept generation and the definition of their corresponding factors and covariances. In both simulated and real indoor scenarios, our approach improves complex concept detection by 20.7% and 5.3%, trajectory estimation by 19.2%, and map reconstruction by 12.3% and 3.8%, respectively, highlighting the benefits of this integration for robust and adaptive spatial understanding.
comment: Submitted to IEEE Robotics and Automation Letters (RA-L)
♻ ☆ The $\varphi$ Curve: The Shape of Generalization through the Lens of Norm-based Capacity Control NeurIPS'25
Understanding how the test risk scales with model complexity is a central question in machine learning. Classical theory is challenged by the learning curves observed for large over-parametrized deep networks. Capacity measures based on parameter count typically fail to account for these empirical observations. To tackle this challenge, we consider norm-based capacity measures and develop our study for random features based estimators, widely used as simplified theoretical models for more complex networks. In this context, we provide a precise characterization of how the estimator's norm concentrates and how it governs the associated test error. Our results show that the predicted learning curve admits a phase transition from under- to over-parameterization, but no double descent behavior. This confirms that more classical U-shaped behavior is recovered considering appropriate capacity measures based on models norms rather than size. From a technical point of view, we leverage deterministic equivalence as the key tool and further develop new deterministic quantities which are of independent interest.
comment: Accepted by NeurIPS'25
♻ ☆ A Physics-Informed Spatiotemporal Deep Learning Framework for Turbulent Systems
Fluid thermodynamics underpins atmospheric dynamics, climate science, industrial applications, and energy systems. However, direct numerical simulations (DNS) of such systems can be computationally prohibitive. To address this, we present a novel physics-informed spatiotemporal surrogate model for Rayleigh-B\'enard convection (RBC), a canonical example of convective fluid flow. Our approach combines convolutional neural networks, for spatial dimension reduction, with an innovative recurrent architecture, inspired by large language models, to model long-range temporal dynamics. Inference is penalized with respect to the governing partial differential equations to ensure physical interpretability. Since RBC exhibits turbulent behavior, we quantify uncertainty using a conformal prediction framework. This model replicates key physical features of RBC dynamics while significantly reducing computational cost, offering a scalable alternative to DNS for long-term simulations.
♻ ☆ Language Models are Injective and Hence Invertible
Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
♻ ☆ Neural Graduated Assignment for Maximum Common Edge Subgraphs
The Maximum Common Edge Subgraph (MCES) problem is a crucial challenge with significant implications in domains such as biology and chemistry. Traditional approaches, which include transformations into max-clique and search-based algorithms, suffer from scalability issues when dealing with larger instances. This paper introduces ``Neural Graduated Assignment'' (NGA), a simple, scalable, unsupervised-training-based method that addresses these limitations. Central to NGA is stacking of differentiable assignment optimization with neural components, enabling high-dimensional parameterization of the matching process through a learnable temperature mechanism. We further theoretically analyze the learning dynamics of NGA, showing its design leads to fast convergence, better exploration-exploitation tradeoff, and ability to escape local optima. Extensive experiments across MCES computation, graph similarity estimation, and graph retrieval tasks reveal that NGA not only significantly improves computation time and scalability on large instances but also enhances performance compared to existing methodologies. The introduction of NGA marks a significant advancement in the computation of MCES and offers insights into other assignment problems.
♻ ☆ Mitigating Prior Errors in Causal Structure Learning: A Resilient Approach via Bayesian Networks
Causal structure learning (CSL), a prominent technique for encoding cause-and-effect relationships among variables, through Bayesian Networks (BNs). Although recovering causal structure solely from data is a challenge, the integration of prior knowledge, revealing partial structural truth, can markedly enhance learning quality. However, current methods based on prior knowledge exhibit limited resilience to errors in the prior, with hard constraint methods disregarding priors entirely, and soft constraints accepting priors based on a predetermined confidence level, which may require expert intervention. To address this issue, we propose a strategy resilient to edge-level prior errors for CSL, thereby minimizing human intervention. We classify prior errors into different types and provide their theoretical impact on the Structural Hamming Distance (SHD) under the presumption of sufficient data. Intriguingly, we discover and prove that the strong hazard of prior errors is associated with a unique acyclic closed structure, defined as ``quasi-circle''. Leveraging this insight, a post-hoc strategy is employed to identify the prior errors by its impact on the increment of ``quasi-circles''. Through empirical evaluation on both real and synthetic datasets, we demonstrate our strategy's robustness against prior errors. Specifically, we highlight its substantial ability to resist order-reversed errors while maintaining the majority of correct prior.
comment: 12 pages, 4 figures
♻ ☆ On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration
Learning from observations (LfO) replicates expert behavior without needing access to the expert's actions, making it more practical than learning from demonstrations (LfD) in many real-world scenarios. However, directly applying the on-policy training scheme in LfO worsens the sample inefficiency problem, while employing the traditional off-policy training scheme in LfO magnifies the instability issue. This paper seeks to develop an efficient and stable solution for the LfO problem. Specifically, we begin by exploring the generalization capabilities of both the reward function and policy in LfO, which provides a theoretical foundation for computation. Building on this, we modify the policy optimization method in generative adversarial imitation from observation (GAIfO) with distributional soft actor-critic (DSAC), and propose the Mimicking Observations through Distributional Update Learning with adequate Exploration (MODULE) algorithm to solve the LfO problem. MODULE incorporates the advantages of (1) high sample efficiency and training robustness enhancement in soft actor-critic (SAC), and (2) training stability in distributional reinforcement learning (RL). Extensive experiments in MuJoCo environments showcase the superior performance of MODULE over current LfO methods.
♻ ☆ Efficient Verified Machine Unlearning For Distillation NeurIPS 2025
Growing data privacy demands, driven by regulations like GDPR and CCPA, require machine unlearning methods capable of swiftly removing the influence of specific training points. Although verified approaches like SISA, using data slicing and checkpointing, achieve efficient unlearning for single models by reverting to intermediate states, these methods struggle in teacher-student knowledge distillation settings. Unlearning in the teacher typically forces costly, complete student retraining due to pervasive information propagation during distillation. Our primary contribution is PURGE (Partitioned Unlearning with Retraining Guarantee for Ensembles), a novel framework integrating verified unlearning with distillation. We introduce constituent mapping and an incremental multi-teacher strategy that partitions the distillation process, confines each teacher constituent's impact to distinct student data subsets, and crucially maintains data isolation. The PURGE framework substantially reduces retraining overhead, requiring only partial student updates when teacher-side unlearning occurs. We provide both theoretical analysis, quantifying significant speed-ups in the unlearning process, and empirical validation on multiple datasets, demonstrating that PURGE achieves these efficiency gains while maintaining student accuracy comparable to standard baselines.
comment: Accepted at The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Learning Confidence Bounds for Classification with Imbalanced Data ECAI 2024
Class imbalance poses a significant challenge in classification tasks, where traditional approaches often lead to biased models and unreliable predictions. Undersampling and oversampling techniques have been commonly employed to address this issue, yet they suffer from inherent limitations stemming from their simplistic approach such as loss of information and additional biases respectively. In this paper, we propose a novel framework that leverages learning theory and concentration inequalities to overcome the shortcomings of traditional solutions. We focus on understanding the uncertainty in a class-dependent manner, as captured by confidence bounds that we directly embed into the learning process. By incorporating class-dependent estimates, our method can effectively adapt to the varying degrees of imbalance across different classes, resulting in more robust and reliable classification outcomes. We empirically show how our framework provides a promising direction for handling imbalanced data in classification tasks, offering practitioners a valuable tool for building more accurate and trustworthy models.
comment: Published at ECAI 2024 main track (Frontiers in Artificial Intelligence and Applications)
♻ ☆ Pretraining a Shared Q-Network for Data-Efficient Offline Reinforcement Learning
Offline reinforcement learning (RL) aims to learn a policy from a static dataset without further interactions with the environment. Collecting sufficiently large datasets for offline RL is exhausting since this data collection requires colossus interactions with environments and becomes tricky when the interaction with the environment is restricted. Hence, how an agent learns the best policy with a minimal static dataset is a crucial issue in offline RL, similar to the sample efficiency problem in online RL. In this paper, we propose a simple yet effective plug-and-play pretraining method to initialize a feature of a Q-network to enhance data efficiency in offline RL. Specifically, we introduce a shared Q-network structure that outputs predictions of the next state and Q-value. We pretrain the shared Q-network through a supervised regression task that predicts a next state and trains the shared Q-network using diverse offline RL methods. Through extensive experiments, we empirically demonstrate that our method enhances the performance of existing popular offline RL methods on the D4RL, Robomimic and V-D4RL benchmarks. Furthermore, we show that our method significantly boosts data-efficient offline RL across various data qualities and data distributions trough D4RL and ExoRL benchmarks. Notably, our method adapted with only 10% of the dataset outperforms standard algorithms even with full datasets.
♻ ☆ Beyond Benign Overfitting in Nadaraya-Watson Interpolators NeurIPS 2025
In recent years, there has been much interest in understanding the generalization behavior of interpolating predictors, which overfit on noisy training data. Whereas standard analyses are concerned with whether a method is consistent or not, recent observations have shown that even inconsistent predictors can generalize well. In this work, we revisit the classic interpolating Nadaraya-Watson (NW) estimator (also known as Shepard's method), and study its generalization capabilities through this modern viewpoint. In particular, by varying a single bandwidth-like hyperparameter, we prove the existence of multiple overfitting behaviors, ranging non-monotonically from catastrophic, through benign, to tempered. Our results highlight how even classical interpolating methods can exhibit intricate generalization behaviors. In addition, for the purpose of tuning the hyperparameter, the results suggest that over-estimating the intrinsic dimension of the data is less harmful than under-estimating it. Numerical experiments complement our theory, demonstrating the same phenomena.
comment: NeurIPS 2025
♻ ☆ DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding NeurIPS 2025
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench
comment: Accepted by Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Sparse Explanations of Neural Networks Using Pruned Layer-Wise Relevance Propagation ECML
Explainability is a key component in many applications involving deep neural networks (DNNs). However, current explanation methods for DNNs commonly leave it to the human observer to distinguish relevant explanations from spurious noise. This is not feasible anymore when going from easily human-accessible data such as images to more complex data such as genome sequences. To facilitate the accessibility of DNN outputs from such complex data and to increase explainability, we present a modification of the widely used explanation method layer-wise relevance propagation. Our approach enforces sparsity directly by pruning the relevance propagation for the different layers. Thereby, we achieve sparser relevance attributions for the input features as well as for the intermediate layers. As the relevance propagation is input-specific, we aim to prune the relevance propagation rather than the underlying model architecture. This allows to prune different neurons for different inputs and hence, might be more appropriate to the local nature of explanation methods. To demonstrate the efficacy of our method, we evaluate it on two types of data: images and genome sequences. We show that our modification indeed leads to noise reduction and concentrates relevance on the most important features compared to the baseline.
comment: Presented at ECML PKDD 2024
♻ ☆ Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning ICML 2025
Large Language Models have been shown to contain extensive world knowledge in their parameters, enabling impressive performance on many knowledge intensive tasks. However, when deployed in novel settings, LLMs often encounter situations where they must integrate parametric knowledge with new or unfamiliar information. In this work, we explore whether LLMs can combine knowledge in-context with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real experiments in multi-hop reasoning problems, we show that LLMs generally struggle with counterfactual reasoning, often resorting to exclusively using their parametric knowledge. Moreover, we show that simple post-hoc finetuning can struggle to instill counterfactual reasoning ability -- often leading to degradation in stored parametric knowledge. Ultimately, our work reveals important limitations of current LLM's abilities to re-purpose parametric knowledge in novel settings.
comment: ICML 2025 Workshop on Scaling up Intervention Models
♻ ☆ Patent Language Model Pretraining with ModernBERT
Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.
comment: 7 pages, 5 figures, 4 tables
♻ ☆ LAMP-PRo: Label-aware Attention for Multi-label Prediction of DNA- and RNA-binding Proteins using Protein Language Models
Identifying DNA- (DBPs) and RNA-binding proteins (RBPs) is crucial for the understanding of cell function, molecular interactions as well as regulatory functions. Owing to their high similarity, most of the existing approaches face challenges in differentiating between DBPs and RBPs leading to high cross-prediction errors. Moreover, identifying proteins which bind to both DNA and RNA (DRBPs) is also quite a challenging task. In this regard, we propose a novel framework viz. LAMP-PRo which is based on pre-trained protein language model (PLM), attention mechanisms and multi-label learning to mitigate these issues. First, pre-trained PLM such ESM-2 is used for embedding the protein sequences followed by convolutional neural network (CNN). Subsequently multi-head self-attention mechanism is applied for the contextual information while label-aware attention is used to compute class-specific representations by attending to the sequence in a way that is tailored to each label (DBP, RBP and non-NABP) in a multi-label setup. We have also included a novel cross-label attention mechanism to explicitly capture dependencies between DNA- and RNA-binding proteins, enabling more accurate prediction of DRBP. Finally, a linear layer followed by a sigmoid function are used for the final prediction. Extensive experiments are carried out to compare LAMP-PRo with the existing methods wherein the proposed model shows consistent competent performance. Furthermore, we also provide visualization to showcase model interpretability, highlighting which parts of the sequence are most relevant for a predicted label. The original datasets are available at http://bliulab.net/iDRBP\_MMC and the codes are available at https://github.com/NimishaGhosh/LAMP-PRo.
♻ ☆ SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
comment: Project Website: http://simbench.tiancheng.hu/ Data: https://huggingface.co/datasets/pitehu/SimBench
♻ ☆ A representational framework for learning and encoding structurally enriched trajectories in complex agent environments
The ability of artificial intelligence agents to make optimal decisions and generalise them to different domains and tasks is compromised in complex scenarios. One way to address this issue has focused on learning efficient representations of the world and on how the actions of agents affect them in state-action transitions. Whereas such representations are procedurally efficient, they lack structural richness. To address this problem, we propose to enhance the agent's ontology and extend the traditional conceptualisation of trajectories to provide a more nuanced view of task execution. Structurally Enriched Trajectories (SETs) extend the encoding of sequences of states and their transitions by incorporating hierarchical relations between objects, interactions, and affordances. SETs are built as multi-level graphs, providing a detailed representation of the agent dynamics and a transferable functional abstraction of the task. SETs are integrated into an architecture, Structurally Enriched Trajectory Learning and Encoding (SETLE), that employs a heterogeneous graph-based memory structure of multi-level relational dependencies essential for generalisation. We demonstrate that SETLE can support downstream tasks, enabling agents to recognise task relevant structural patterns across CREATE and MiniGrid environments. Finally, we integrate SETLE with reinforcement learning and show measurable improvements in downstream performance, including breakthrough success rates in complex, sparse-reward tasks.
♻ ☆ ReVeal: Self-Evolving Code Agents via Reliable Self-Verification
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. However, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verification-generation asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generation-verification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three. It also significantly improves Pass@k, indicating stronger exploration that expands the reasoning boundaries of the base model. These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.
♻ ☆ LENS: Large Pre-trained Transformer for Exploring Financial Time Series Regularities
Modeling large-scale time series has gained significant attention in recent years. However, its direct application in finance remains challenging due to substantial differences in data characteristics across domains. Specifically, financial systems feature inherent stochasticity and low signal-to-noise ratios, rendering traditional methods and pre-training approaches ineffective. This underscores the urgent need for a foundation model tailored to financial time series. To bridge this gap, we propose \textbf{LENS}, a pre-trained model for this domain. \textbf{LENS} effectively captures the complexity of financial stochastic systems through a carefully crafted model architecture and mitigates noise during pre-training by using an invertible embedding module. We provide a rigorous theoretical explanation of the model's effectiveness and validate its performance through extensive experiments. Pre-trained on a dataset comprising 100 billion financial observations, \textbf{LENS} achieves exceptional results across a wide range of critical downstream tasks. Moreover, our work offers practical insights into developing pre-trained time series models in high-noise environments, paving the way for further advancements in this pivotal research domain.
♻ ☆ Molecular Fingerprints Are Strong Models for Peptide Function Prediction
Understanding peptide properties is often assumed to require modeling long-range molecular interactions, motivating the use of complex graph neural networks and pretrained transformers. Yet, whether such long-range dependencies are essential remains unclear. We investigate if simple, domain-specific molecular fingerprints can capture peptide function without these assumptions. Atomic-level representation aims to provide richer information than purely sequence-based models and better efficiency than structural ones. Across 132 datasets, including LRGB and five other peptide benchmarks, models using count-based ECFP, Topological Torsion, and RDKit fingerprints with LightGBM achieve state-of-the-art accuracy. Despite encoding only short-range molecular features, these models outperform GNNs and transformer-based approaches. Control experiments with sequence shuffling and amino acid counts confirm that fingerprints, though inherently local, suffice for robust peptide property prediction. Our results challenge the presumed necessity of long-range interaction modeling and highlight molecular fingerprints as efficient, interpretable, and computationally lightweight alternatives for peptide prediction.
♻ ☆ Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation NeurIPS 2025
Diffusion-based inverse algorithms have shown remarkable performance across various inverse problems, yet their reliance on numerous denoising steps incurs high computational costs. While recent developments of fast diffusion ODE solvers offer effective acceleration for diffusion sampling without observations, their application in inverse problems remains limited due to the heterogeneous formulations of inverse algorithms and their prevalent use of approximations and heuristics, which often introduce significant errors that undermine the reliability of analytical solvers. In this work, we begin with an analysis of ODE solvers for inverse problems that reveals a linear combination structure of approximations for the inverse trajectory. Building on this insight, we propose a canonical form that unifies a broad class of diffusion-based inverse algorithms and facilitates the design of more generalizable solvers. Inspired by the linear subspace search strategy, we propose Learnable Linear Extrapolation (LLE), a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm conforming to our canonical form. LLE optimizes the combination coefficients to refine current predictions using previous estimates, alleviating the sensitivity of analytical solvers for inverse algorithms. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLE_inverse_problem.
comment: Accepted by NeurIPS 2025
♻ ☆ Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.
comment: 10 pages, 3 figures. v2 adds discussion of related work (G-Pass@k)
♻ ☆ S-CFE: Simple Counterfactual Explanations
We study the problem of finding optimal sparse, manifold-aligned counterfactual explanations for classifiers. Canonically, this can be formulated as an optimization problem with multiple non-convex components, including classifier loss functions and manifold alignment (or \emph{plausibility}) metrics. The added complexity of enforcing \emph{sparsity}, or shorter explanations, complicates the problem further. Existing methods often focus on specific models and plausibility measures, relying on convex $\ell_1$ regularizers to enforce sparsity. In this paper, we tackle the canonical formulation using the accelerated proximal gradient (APG) method, a simple yet efficient first-order procedure capable of handling smooth non-convex objectives and non-smooth $\ell_p$ (where $0 \leq p < 1$) regularizers. This enables our approach to seamlessly incorporate various classifiers and plausibility measures while producing sparser solutions. Our algorithm only requires differentiable data-manifold regularizers and supports box constraints for bounded feature ranges, ensuring the generated counterfactuals remain \emph{actionable}. Finally, experiments on real-world datasets demonstrate that our approach effectively produces sparse, manifold-aligned counterfactual explanations while maintaining proximity to the factual data and computational efficiency.
♻ ☆ Denoising the Future: Top-p Distributions for Moving Through Time
Inference in dynamic probabilistic models is a complex task involving expensive operations. In particular, for Hidden Markov Models, the whole state space has to be enumerated for advancing in time. Even states with negligible probabilities are considered, resulting in computational inefficiency and increased noise due to the propagation of unlikely probability mass. We propose to denoise the future and speed up inference by using only the top-p states, i.e., the most probable states with accumulated probability p. We show that the error introduced by using only the top-p states is bound by p and the so-called minimal mixing rate of the underlying model. Moreover, in our empirical evaluation, we show that we can expect speedups of at least an order of magnitude, while the error in terms of total variation distance is below 0.09.
comment: Accepted at ECSQARU 2025
♻ ☆ Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking
Accurate prediction of protein-ligand binding poses is crucial for structure-based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi-stage flow matching with learned scoring and physical validity filtering. Our approach consists of three sequential stages applied consecutively to refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces ($\mathbb{R}^3$, $\mathrm{SO}(3)$, and $\mathrm{SO}(2)$). We enhance the prediction quality through a dedicated scoring model and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBbind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 25 times faster than modern large-scale co-folding models. The model weights and inference code to reproduce our results are available at https://github.com/LigandPro/Matcha.
♻ ☆ A surrogate model for topology optimisation of elastic structures via parametric autoencoders
A surrogate-based topology optimisation algorithm for linear elastic structures under parametric loads and boundary conditions is proposed. Instead of learning the parametric solution of the state (and adjoint) problems or the optimisation trajectory as a function of the iterations, the proposed approach devises a surrogate version of the entire optimisation pipeline. First, the method predicts a quasi-optimal topology for a given problem configuration as a surrogate model of high-fidelity topologies optimised with the homogenisation method. This is achieved by means of a feed-forward net learning the mapping between the input parameters characterising the system setup and a latent space determined by encoder/decoder blocks reducing the dimensionality of the parametric topology optimisation problem and reconstructing a high-dimensional representation of the topology. Then, the predicted topology is used as an educated initial guess for a computationally efficient algorithm penalising the intermediate values of the design variable, while enforcing the governing equations of the system. This step allows the method to correct potential errors introduced by the surrogate model, eliminate artifacts, and refine the design in order to produce topologies consistent with the underlying physics. Different architectures are proposed and the approximation and generalisation capabilities of the resulting models are numerically evaluated. The quasi-optimal topologies allow to outperform the high-fidelity optimiser by reducing the average number of optimisation iterations by $53\%$ while achieving discrepancies below $4\%$ in the optimal value of the objective functional, even in the challenging scenario of testing the model to extrapolate beyond the training and validation domain.
comment: 43 pages, 13 figures, 7 tables
♻ ☆ Learning Fairer Representations with FairVIC
Mitigating bias in automated decision-making systems, particularly in deep learning models, is a critical challenge due to nuanced definitions of fairness, dataset-specific biases, and the inherent trade-off between fairness and accuracy. To address these issues, we introduce FairVIC, an innovative approach that enhances fairness in neural networks by integrating variance, invariance, and covariance terms into the loss function during training. Unlike methods that rely on predefined fairness criteria, FairVIC abstracts fairness concepts to minimise dependency on protected characteristics. We evaluate FairVIC against comparable bias mitigation techniques on benchmark datasets, considering both group and individual fairness, and conduct an ablation study on the accuracy-fairness trade-off. FairVIC demonstrates significant improvements ($\approx70\%$) in fairness across all tested metrics without compromising accuracy, thus offering a robust, generalisable solution for fair deep learning across diverse tasks and datasets.
♻ ☆ Airfoil optimization using Design-by-Morphing with minimized design-space dimensionality
Effective airfoil geometry optimization requires exploring a diverse range of designs using as few design variables as possible. This study introduces AirDbM, a Design-by-Morphing (DbM) approach specialized for airfoil optimization that systematically reduces design-space dimensionality. AirDbM selects an optimal set of 12 baseline airfoils from the UIUC airfoil database, which contains over 1,600 shapes, by sequentially adding the baseline that most increases the design capacity. With these baselines, AirDbM reconstructs 99 % of the database with a mean absolute error below 0.005, which matches the performance of a previous DbM approach that used more baselines. In multi-objective aerodynamic optimization, AirDbM demonstrates rapid convergence and achieves a Pareto front with a greater hypervolume than that of the previous larger-baseline study, where new Pareto-optimal solutions are discovered with enhanced lift-to-drag ratios at moderate stall tolerances. Furthermore, AirDbM demonstrates outstanding adaptability for reinforcement learning (RL) agents in generating airfoil geometry when compared to conventional airfoil parameterization methods, implying the broader potential of DbM in machine learning-driven design.
♻ ☆ Adversarial Graph Fusion for Incomplete Multi-view Semi-supervised Learning with Tensorial Imputation
View missing remains a significant challenge in graph-based multi-view semi-supervised learning, hindering their real-world applications. To address this issue, traditional methods introduce a missing indicator matrix and focus on mining partial structure among existing samples in each view for label propagation (LP). However, we argue that these disregarded missing samples sometimes induce discontinuous local structures, i.e., sub-clusters, breaking the fundamental smoothness assumption in LP. Consequently, such a Sub-Cluster Problem (SCP) would distort graph fusion and degrade classification performance. To alleviate SCP, we propose a novel incomplete multi-view semi-supervised learning method, termed AGF-TI. Firstly, we design an adversarial graph fusion scheme to learn a robust consensus graph against the distorted local structure through a min-max framework. By stacking all similarity matrices into a tensor, we further recover the incomplete structure from the high-order consistency information based on the low-rank tensor learning. Additionally, the anchor-based strategy is incorporated to reduce the computational complexity. An efficient alternative optimization algorithm combining a reduced gradient descent method is developed to solve the formulated objective, with theoretical convergence. Extensive experimental results on various datasets validate the superiority of our proposed AGF-TI as compared to state-of-the-art methods. Code is available at https://github.com/ZhangqiJiang07/AGF_TI.
comment: 31 pages, 15 figures
♻ ☆ MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization NeurIPS 2025
Meta-Black-Box Optimization (MetaBBO) streamlines the automation of optimization algorithm design through meta-learning. It typically employs a bi-level structure: the meta-level policy undergoes meta-training to reduce the manual effort required in developing algorithms for low-level optimization tasks. The original MetaBox (2023) provided the first open-source framework for reinforcement learning-based single-objective MetaBBO. However, its relatively narrow scope no longer keep pace with the swift advancement in this field. In this paper, we introduce MetaBox-v2 (https://github.com/MetaEvo/MetaBox) as a milestone upgrade with four novel features: 1) a unified architecture supporting RL, evolutionary, and gradient-based approaches, by which we reproduce $23$ up-to-date baselines; 2) efficient parallelization schemes, which reduce the training/testing time by $10-40$x; 3) a comprehensive benchmark suite of $18$ synthetic/realistic tasks ($1900$+ instances) spanning single-objective, multi-objective, multi-model, and multi-task optimization scenarios; 4) plentiful and extensible interfaces for custom analysis/visualization and integrating to external optimization tools/benchmarks. To show the utility of MetaBox-v2, we carry out a systematic case study that evaluates the built-in baselines in terms of the optimization performance, generalization ability and learning efficiency. Valuable insights are concluded from thorough and detailed analysis for practitioners and those new to the field.
comment: Accepted by NeurIPS 2025
♻ ☆ The Narcissus Hypothesis: Descending to the Rung of Illusion NeurIPS 2025
Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment-via human feedback and model-generated corpora-induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the Narcissus Hypothesis and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl's Ladder of Causality, culminating in what we refer to as the Rung of Illusion.
comment: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
♻ ☆ REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.
♻ ☆ Training on Plausible Counterfactuals Removes Spurious Correlations
Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.
♻ ☆ VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching NeurIPS 2025
Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring the fidelity of critical visual information. To further optimize efficiency, we introduce a layer adaptive token reusing strategy that dynamically adjusts the reuse ratio based on attention concentration across decoder layers, prioritizing critical tokens for recomputation. Extensive experiments on two simulation platforms (LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency, with negligible loss on task success rate. The code and videos can be found at our project page: https://vla-cache.github.io.
comment: Accepted to NeurIPS 2025
♻ ☆ The Spacetime of Diffusion Models: An Information Geometry Perspective
We present a novel geometric perspective on the latent space of diffusion models. We first show that the standard pullback approach, utilizing the deterministic probability flow ODE decoder, is fundamentally flawed. It provably forces geodesics to decode as straight segments in data space, effectively ignoring any intrinsic data geometry beyond the ambient Euclidean space. Complementing this view, diffusion also admits a stochastic decoder via the reverse SDE, which enables an information geometric treatment with the Fisher-Rao metric. However, a choice of $x_T$ as the latent representation collapses this metric due to memorylessness. We address this by introducing a latent spacetime $z=(x_t,t)$ that indexes the family of denoising distributions $p(x_0 | x_t)$ across all noise scales, yielding a nontrivial geometric structure. We prove these distributions form an exponential family and derive simulation-free estimators for curve lengths, enabling efficient geodesic computation. The resulting structure induces a principled Diffusion Edit Distance, where geodesics trace minimal sequences of noise and denoise edits between data. We also demonstrate benefits for transition path sampling in molecular systems, including constrained variants such as low-variance transitions and region avoidance. Code is available at: https://github.com/rafalkarczewski/spacetime-geometry
♻ ☆ ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition NeurIPS 2025
Many critical applications, from autonomous scientific discovery to personalized medicine, demand systems that can both strategically acquire the most informative data and instantaneously perform inference based upon it. While amortized methods for Bayesian inference and experimental design offer part of the solution, neither approach is optimal in the most general and challenging task, where new data needs to be collected for instant inference. To tackle this issue, we introduce the Amortized Active Learning and Inference Engine (ALINE), a unified framework for amortized Bayesian inference and active data acquisition. ALINE leverages a transformer architecture trained via reinforcement learning with a reward based on self-estimated information gain provided by its own integrated inference component. This allows it to strategically query informative data points while simultaneously refining its predictions. Moreover, ALINE can selectively direct its querying strategy towards specific subsets of model parameters or designated predictive tasks, optimizing for posterior estimation, data prediction, or a mixture thereof. Empirical results on regression-based active learning, classical Bayesian experimental design benchmarks, and a psychometric model with selectively targeted parameters demonstrate that ALINE delivers both instant and accurate inference along with efficient selection of informative points.
comment: 27 pages, 13 figures. Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ A Flow-Based Model for Conditional and Probabilistic Electricity Consumption Profile Generation and Prediction
Residential Load Profile (RLP) generation and prediction are critical for the operation and planning of distribution networks, especially as diverse low-carbon technologies (e.g., photovoltaic and electric vehicles) are increasingly adopted. This paper introduces a novel flow-based generative model, termed Full Convolutional Profile Flow (FCPFlow), which is uniquely designed for both conditional and unconditional RLP generation, and for probabilistic load forecasting. By introducing two new layers--the invertible linear layer and the invertible normalization layer--the proposed FCPFlow architecture shows three main advantages compared to traditional statistical and contemporary deep generative models: 1) it is well-suited for RLP generation under continuous conditions, such as varying weather and annual electricity consumption, 2) it demonstrates superior scalability in different datasets compared to traditional statistical models, and 3) it also demonstrates better modeling capabilities in capturing the complex correlation of RLPs compared with deep generative models.
♻ ☆ Shuffling Heuristic in Variational Inequalities: Establishing New Convergence Guarantees
Variational inequalities have gained significant attention in machine learning and optimization research. While stochastic methods for solving these problems typically assume independent data sampling, we investigate an alternative approach -- the shuffling heuristic. This strategy involves permuting the dataset before sequential processing, ensuring equal consideration of all data points. Despite its practical utility, theoretical guarantees for shuffling in variational inequalities remain unexplored. We address this gap by providing the first theoretical convergence estimates for shuffling methods in this context. Our analysis establishes rigorous bounds and convergence rates, extending the theoretical framework for this important class of algorithms. We validate our findings through extensive experiments on diverse benchmark variational inequality problems, demonstrating faster convergence of shuffling methods compared to independent sampling approaches.
comment: 25 pages, 5 figures, 2 tables
♻ ☆ ProSh: Probabilistic Shielding for Model-free Reinforcement Learning
Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent's policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.
♻ ☆ Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization
Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
comment: 58 pages, 5 figures, 5 tables
♻ ☆ Expressive Reward Synthesis with the Runtime Monitoring Language
A key challenge in reinforcement learning (RL) is reward (mis)specification, whereby imprecisely defined reward functions can result in unintended, possibly harmful, behaviours. Indeed, reward functions in RL are typically treated as black-box mappings from state-action pairs to scalar values. While effective in many settings, this approach provides no information about why rewards are given, which can hinder learning and interpretability. Reward Machines address this issue by representing reward functions as finite state automata, enabling the specification of structured, non-Markovian reward functions. However, their expressivity is typically bounded by regular languages, leaving them unable to capture more complex behaviours such as counting or parametrised conditions. In this work, we build on the Runtime Monitoring Language (RML) to develop a novel class of language-based Reward Machines. By leveraging the built-in memory of RML, our approach can specify reward functions for non-regular, non-Markovian tasks. We demonstrate the expressiveness of our approach through experiments, highlighting additional advantages in flexible event-handling and task specification over existing Reward Machine-based methods.
♻ ☆ MEET-Sepsis: Multi-Endogenous-View Enhanced Time-Series Representation Learning for Early Sepsis Prediction PRICAI 2025
Sepsis is a life-threatening infectious syndrome associated with high mortality in intensive care units (ICUs). Early and accurate sepsis prediction (SP) is critical for timely intervention, yet remains challenging due to subtle early manifestations and rapidly escalating mortality. While AI has improved SP efficiency, existing methods struggle to capture weak early temporal signals. This paper introduces a Multi-Endogenous-view Representation Enhancement (MERE) mechanism to construct enriched feature views, coupled with a Cascaded Dual-convolution Time-series Attention (CDTA) module for multi-scale temporal representation learning. The proposed MEET-Sepsis framework achieves competitive prediction accuracy using only 20% of the ICU monitoring time required by SOTA methods, significantly advancing early SP. Extensive validation confirms its efficacy. Code is available at: https://github.com/yueliangy/MEET-Sepsis.
comment: Accepted to PRICAI 2025
♻ ☆ FlashBias: Fast Computation of Attention with Bias
Attention with bias, which extends standard attention by introducing prior knowledge as an additive bias matrix to the query-key scores, has been widely deployed in vision, language, protein-folding and other advanced scientific models, underscoring its status as a key evolution of this foundational module. However, introducing bias terms creates a severe efficiency bottleneck in attention computation. It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention, thereby stripping away most of their performance gains and leaving biased attention computationally expensive. Surprisingly, despite its common usage, targeted efficiency optimization for attention with bias remains absent, which seriously hinders its application in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalizations. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$\times$ speedup for Pairformer in AlphaFold 3, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy. Code is available at this repository: https://github.com/thuml/FlashBias.
♻ ☆ Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning NeurIPS2025
We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at https://github.com/dongkwani/DeepEdgeFilter.
comment: NeurIPS2025
♻ ☆ DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training
Training multimodal process reward models (PRMs) is hard due to (i) distribution shift between training set and test set and (ii) quality imbalance across training data samples. While domain-level reweighting (e.g., DreamPRM) aligns training with test-time objectives, it leaves a clear gap to an oracle upper bound (pass@N), even under a "sanity check" that uses test set data to probe headroom -- pointing to meta-level under-parameterization. We introduce DreamPRM-1.5, an instance-level reweighting framework that assigns an adaptive weight to every training example via bi-level optimization. To realize instance reweighting across scales, we develop two complementary regimes: Instance Table, which learns explicit per-sample weights and excels on small/medium data, and Instance Net, a lightweight neural network that generalizes better and scales to large corpora. A practical, stable training recipe -- time-scale matching between upper/lower updates, cold-start initialization, and bounded-range weights -- prevents divergence. Integrated with test-time scaling, DreamPRM-1.5 attains 84.6 accuracy on the MMMU validation set, 31.3 accuracy on R-Bench-V and, when paired with a leading backbone (e.g., GPT-5-mini), achieves first-place results on public multimodal reasoning leaderboards. Moreover, extensive experiments, including benchmark evaluations, baseline comparisons, and a sanity check, demonstrate that DreamPRM-1.5 closes the gap toward the oracle, achieves leading performance, and trains stably.
♻ ☆ Asynchronous Federated Learning: A Scalable Approach for Decentralized Machine Learning
Federated Learning (FL) has emerged as a powerful paradigm for decentralized machine learning, enabling collaborative model training across diverse clients without sharing raw data. However, traditional FL approaches often face limitations in scalability and efficiency due to their reliance on synchronous client updates, which can result in significant delays and increased communication overhead, particularly in heterogeneous and dynamic environments. To address these challenges in this paper, we propose an Asynchronous Federated Learning (AFL) algorithm, which allows clients to update the global model independently and asynchronously. Our key contributions include a comprehensive convergence analysis of AFL in the presence of client delays and model staleness. By leveraging martingale difference sequence theory and variance bounds, we ensure robust convergence despite asynchronous updates. Assuming strongly convex local objective functions, we establish bounds on gradient variance under random client sampling and derive a recursion formula quantifying the impact of client delays on convergence. Furthermore, we demonstrate the practical applicability of the AFL algorithm by training decentralized linear regression and Support Vector Machine (SVM) based classifiers and compare its results with synchronous FL algorithm to effectively handling non-IID data distributed among clients. The proposed AFL algorithm addresses key limitations of traditional FL methods, such as inefficiency due to global synchronization and susceptibility to client drift. It enhances scalability, robustness, and efficiency in real-world settings with heterogeneous client populations and dynamic network conditions. Our results underscore the potential of AFL to drive advancements indistributed learning systems, particularly for large-scale, privacy-preserving applications in resource-constrained environments.
♻ ☆ Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning NeurIPS 2025
Symmetry is pervasive in robotics and has been widely exploited to improve sample efficiency in deep reinforcement learning (DRL). However, existing approaches primarily focus on spatial symmetries, such as reflection, rotation, and translation, while largely neglecting temporal symmetries. To address this gap, we explore time reversal symmetry, a form of temporal symmetry commonly found in robotics tasks such as door opening and closing. We propose Time Reversal symmetry enhanced Deep Reinforcement Learning (TR-DRL), a framework that combines trajectory reversal augmentation and time reversal guided reward shaping to efficiently solve temporally symmetric tasks. Our method generates reversed transitions from fully reversible transitions, identified by a proposed dynamics-consistent filter, to augment the training data. For partially reversible transitions, we apply reward shaping to guide learning, according to successful trajectories from the reversed task. Extensive experiments on the Robosuite and MetaWorld benchmarks demonstrate that TR-DRL is effective in both single-task and multi-task settings, achieving higher sample efficiency and stronger final performance compared to baseline methods.
comment: Accepted in NeurIPS 2025
♻ ☆ Fair Minimum Labeling: Efficient Temporal Network Activations for Reachability and Equity NeurIPS 2025
Balancing resource efficiency and fairness is critical in networked systems that support modern learning applications. We introduce the Fair Minimum Labeling (FML) problem: the task of designing a minimum-cost temporal edge activation plan that ensures each group of nodes in a network has sufficient access to a designated target set, according to specified coverage requirements. FML captures key trade-offs in systems where edge activations incur resource costs and equitable access is essential, such as distributed data collection, update dissemination in edge-cloud systems, and fair service restoration in critical infrastructure. We show that FML is NP-hard and $\Omega(\log |V|)$-hard to approximate, where $V$ is the set of nodes, and we present probabilistic approximation algorithms that match this bound, achieving the best possible guarantee for the activation cost. We demonstrate the practical utility of FML in a fair multi-source data aggregation task for training a shared model. Empirical results show that FML enforces group-level fairness with substantially lower activation cost than baseline heuristics, underscoring its potential for building resource-efficient, equitable temporal reachability in learning-integrated networks.
comment: Accepted at NeurIPS 2025
♻ ☆ Enhancing Sample Selection Against Label Noise by Cutting Mislabeled Easy Examples NeurIPS 2025
Sample selection is a prevalent approach in learning with noisy labels, aiming to identify confident samples for training. Although existing sample selection methods have achieved decent results by reducing the noise rate of the selected subset, they often overlook that not all mislabeled examples harm the model's performance equally. In this paper, we demonstrate that mislabeled examples correctly predicted by the model early in the training process are particularly harmful to model performance. We refer to these examples as Mislabeled Easy Examples (MEEs). To address this, we propose Early Cutting, which introduces a recalibration step that employs the model's later training state to re-select the confident subset identified early in training, thereby avoiding misleading confidence from early learning and effectively filtering out MEEs. Experiments on the CIFAR, WebVision, and full ImageNet-1k datasets demonstrate that our method effectively improves sample selection and model performance by reducing MEEs.
comment: Accepted by NeurIPS 2025
♻ ☆ Model-based Implicit Neural Representation for sub-wavelength Radio Localization
The increasing deployment of large antenna arrays at base stations has significantly improved the spatial resolution and localization accuracy of radio-localization methods. However, traditional signal processing techniques struggle in complex radio environments, particularly in scenarios dominated by non line of sight (NLoS) propagation paths, resulting in degraded localization accuracy. Recent developments in machine learning have facilitated the development of machine learning-assisted localization techniques, enhancing localization accuracy in complex radio environments. However, these methods often involve substantial computational complexity during both the training and inference phases. This work extends the well-established fingerprinting-based localization framework by simultaneously reducing its memory requirements and improving its accuracy. Specifically, a model-based neural network is used to learn the location-to-channel mapping, and then serves as a generative neural channel model. This generative model augments the fingerprinting comparison dictionary while reducing the memory requirements. The proposed method outperforms fingerprinting baselines by achieving sub-wavelength localization accuracy, even in complex static NLoS environments. Remarkably, it offers an improvement by several orders of magnitude in localization accuracy, while simultaneously reducing memory requirements by an order of magnitude compared to classical fingerprinting methods.
♻ ☆ Explaining Large Language Models with gSMILE
Large Language Models (LLMs) such as GPT, LLaMA, and Claude achieve remarkable performance in text generation but remain opaque in their decision-making processes, limiting trust and accountability in high-stakes applications. We present gSMILE (generative SMILE), a model-agnostic, perturbation-based framework for token-level interpretability in LLMs. Extending the SMILE methodology, gSMILE uses controlled prompt perturbations, Wasserstein distance metrics, and weighted linear surrogates to identify input tokens with the most significant impact on the output. This process enables the generation of intuitive heatmaps that visually highlight influential tokens and reasoning paths. We evaluate gSMILE across leading LLMs (OpenAI's gpt-3.5-turbo-instruct, Meta's LLaMA 3.1 Instruct Turbo, and Anthropic's Claude 2.1) using attribution fidelity, attribution consistency, attribution stability, attribution faithfulness, and attribution accuracy as metrics. Results show that gSMILE delivers reliable human-aligned attributions, with Claude 2.1 excelling in attention fidelity and GPT-3.5 achieving the highest output consistency. These findings demonstrate gSMILE's ability to balance model performance and interpretability, enabling more transparent and trustworthy AI systems.
♻ ☆ PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning
Federated learning (FL) enables multiple edge devices to collaboratively train a machine learning model without the need to share potentially private data. Federated learning proceeds through iterative exchanges of model updates, which pose two key challenges: First, the accumulation of privacy leakage over time, and second, communication latency. These two limitations are typically addressed separately: The former via perturbed updates to enhance privacy and the latter using user selection to mitigate latency - both at the expense of accuracy. In this work, we propose a method that jointly addresses the accumulation of privacy leakage and communication latency via active user selection, aiming to improve the trade-off among privacy, latency, and model performance. To achieve this, we construct a reward function that accounts for these three objectives. Building on this reward, we propose a multi-armed bandit (MAB)-based algorithm, termed Privacy-aware Active User SElection (PAUSE) which dynamically selects a subset of users each round while ensuring bounded overall privacy leakage. We establish a theoretical analysis, systematically showing that the reward growth rate of PAUSE follows that of the best-known rate in MAB literature. To address the complexity overhead of active user selection, we propose a simulated annealing-based relaxation of PAUSE and analyze its ability to approximate the reward-maximizing policy under reduced complexity. We numerically validate the privacy leakage, associated improved latency, and accuracy gains of our methods for the federated training in various scenarios.
♻ ☆ Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs
Modern AI relies on huge matrix multiplications (MatMuls), whose computation poses a scalability problem for inference and training. We propose an alternative, GPU native bilinear operator to MatMuls in neural networks, which offers a three-way tradeoff between: speed, accuracy and parameter count. In particular, this operator requires substantially fewer FLOPs to evaluate ($\ll n^3$), yet increases the parameter count compared to MatMul ($\gg n^2$). We call this operator Strassen-Tile (STL). The key idea behind STL is a local learnable change-of-basis, applied on tiles of the weight and activation matrices, followed by an element-wise product between the tiles, implemented simultaneously via MatMul. The key technical question we study is how to optimize the change-of-basis of a given layer, which is a highly non-convex problem. We show that theory-backed initializations (inspired by fast matrix and polynomial multiplication) lead to substantially better accuracy than random SGD initialization. This phenomenon motivates further algorithmic study of STL optimization in DNNs. Our experiments demonstrate that STL can approximate 4x4 MatMul of tiles while reducing FLOPs by a factor of 2.66, and can improve Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-CUDA optimized PyTorch code, STL achieves wall-clock speedups in the compute-bound regime. These results, together with its theoretical grounds, suggest STL as a promising building block for scalable and cost-efficient AI.
♻ ☆ Low-cost Embedded Breathing Rate Determination Using 802.15.4z IR-UWB Hardware for Remote Healthcare
Respiratory diseases account for a significant portion of global mortality. Affordable and early detection is an effective way of addressing these ailments. To this end, a low-cost commercial off-the-shelf (COTS), IEEE 802.15.4z standard compliant impulse-radio ultra-wideband (IR-UWB) radar system is used to estimate human respiration rates. We propose a convolutional neural network (CNN) specifically adapted to predict breathing rates from ultra-wideband (UWB) channel impulse response (CIR) data, and compare its performance with both other rule-based algorithms and model-based solutions. The study uses a diverse dataset, incorporating various real-life environments to evaluate system robustness. To facilitate future research, this dataset will be released as open source. Results show that the CNN achieves a mean absolute error (MAE) of 1.73 breaths per minute (BPM) in unseen situations, significantly outperforming rule-based methods (3.40 BPM). By incorporating calibration data from other individuals in the unseen situations, the error is further reduced to 0.84 BPM. In addition, this work evaluates the feasibility of running the pipeline on a low-cost embedded device. Applying 8-bit quantization to both the weights and input/ouput tensors, reduces memory requirements by 67% and inference time by 64% with only a 3% increase in MAE. As a result, we show it is feasible to deploy the algorithm on an nRF52840 system-on-chip (SoC) requiring only 46 KB of memory and operating with an inference time of only 192 ms. Once deployed, an analytical energy model estimates that the system, while continuously monitoring the room, can operate for up to 268 days without recharging when powered by a 20 000 mAh battery pack. For breathing monitoring in bed, the sampling rate can be lowered, extending battery life to 313 days, making the solution highly efficient for real-world, low-cost deployments.
comment: This paper has been submitted to IEEE Sensors Journal and is currently undergoing review
♻ ☆ Estimating Model Performance Under Covariate Shift Without Labels
After deployment, machine learning models often experience performance degradation due to shifts in data distribution. It is challenging to assess post-deployment performance accurately when labels are missing or delayed. Existing proxy methods, such as data drift detection, fail to measure the effects of these shifts adequately. To address this, we introduce a new method for evaluating binary classification models on unlabeled tabular data that accurately estimates model performance under covariate shift and call it Probabilistic Adaptive Performance Estimation (PAPE). It can be applied to any performance metric defined with elements of the confusion matrix. Crucially, PAPE operates independently of the original model, relying only on its predictions and probability estimates, and does not need any assumptions about the nature of covariate shift, learning directly from data instead. We tested PAPE using over 900 dataset-model combinations from US census data, assessing its performance against several benchmarks through various metrics. Our findings show that PAPE outperforms other methodologies, making it a superior choice for estimating the performance of binary classification models.
comment: 25 content pages, 7 figures
♻ ☆ SpecExit: Accelerating Large Reasoning Model via Speculative Exit
Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we propose SpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, reducing average generation length by 66\% and achieving a 2.5x speedup in end-to-end latency compared to the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at https://github.com/Tencent/AngelSlim.
♻ ☆ gen2seg: Generative Models Enable Generalizable Instance Segmentation
By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.
comment: Website: https://reachomk.github.io/gen2seg/
Human-Computer Interaction 26
☆ Effects of Virtual Controller Representation and Virtuality on Selection Performance in Extended Reality
We present an experiment exploring how the controller's virtual representation impacts target acquisition performance across MR and VR contexts. Participants performed selection tasks comparing four visual configurations: a virtual controller, a virtual hand, both the controller and the hand, and neither representation. We found performance comparable between VR and MR, and switching between them did not impact the user's ability to perform basic tasks. Controller representations mimicking reality enhanced performance across both modes. However, users perceived performance differently in MR, indicating the need for unique MR design considerations, particularly regarding spatial awareness.
☆ A Structured Evaluation Framework for Low-Code Platform Selection: A Multi-Criteria Decision Model for Enterprise Digital Transformation
The rapid adoption of Low-Code Development Platforms (LCDPs) has created a critical need for systematic evaluation methodologies that enable organizations to make informed platform selection decisions. This paper presents a comprehensive evaluation framework based on five key criteria: Business Process Orchestration, UI/UX Customization, Integration and Interoperability, Governance and Security, and AI-Enhanced Automation. We propose a weighted scoring model that allows organizations to quantitatively assess and compare different low-code platforms based on their specific requirements and strategic priorities. The framework addresses the gap between marketing-driven platform comparisons and rigorous, context-specific evaluation methodologies. Through empirical validation in enterprise environments, we demonstrate how this structured approach can significantly improve decision-making outcomes and reduce the risk of platform lock-in or inadequate solution selection.
comment: 15 pages, 1 figure. PDF-only submission (XeLaTeX)
☆ One Size Fits All? A Modular Adaptive Sanitization Kit (MASK) for Customizable Privacy-Preserving Phone Scam Detection
Phone scams remain a pervasive threat to both personal safety and financial security worldwide. Recent advances in large language models (LLMs) have demonstrated strong potential in detecting fraudulent behavior by analyzing transcribed phone conversations. However, these capabilities introduce notable privacy risks, as such conversations frequently contain sensitive personal information that may be exposed to third-party service providers during processing. In this work, we explore how to harness LLMs for phone scam detection while preserving user privacy. We propose MASK (Modular Adaptive Sanitization Kit), a trainable and extensible framework that enables dynamic privacy adjustment based on individual preferences. MASK provides a pluggable architecture that accommodates diverse sanitization methods - from traditional keyword-based techniques for high-privacy users to sophisticated neural approaches for those prioritizing accuracy. We also discuss potential modeling approaches and loss function designs for future development, enabling the creation of truly personalized, privacy-aware LLM-based detection systems that balance user trust and detection effectiveness, even beyond phone scam context.
comment: 9 pages
☆ Khelte Khelte Shikhi: A Proposed HCI Framework for Gamified Interactive Learning with Minecraft in Bangladeshi Education Systems
Game-based learning shows real promise for engaging students in well-funded schools, but what about everyone else? We propose a practical framework for implementing Minecraft Education Edition in Bangladesh's 130,000 schools where 55 percent lack reliable internet, rural areas experience 12-16 hour daily power availability, only 8 percent of rural schools have computer access, and student-teacher ratios reach 52:1. Our approach tackles these constraints head-on with three deployment tiers: cloud-based multiplayer for urban schools with stable infrastructure (15 percent), local area network solutions with solar power for semi-urban contexts (30 percent), and offline turn-based modes using refurbished hardware for rural settings (55 percent). We provide eight pre-built curriculum-aligned worlds with complete Bangla localization covering topics from Lalbagh Fort reconstruction to monsoon flood simulation. The interface accommodates first-time users through progressive complexity, culturally familiar metaphors using local farming and architecture, and accessibility features including keyboard-only controls and 200 percent text scaling. Teacher training spans 48 hours across digital literacy, pedagogical integration, and content creation. We detail evaluation protocols with specific benchmarks: 15 percent learning gains, 70 percent transfer task mastery, System Usability Scale scores above 70, and sub-two-dollar cost per student-hour. This framework has not been empirically validated; it synthesizes game-based learning theory, HCI principles, and contextual analysis to provide implementable specifications for pilot testing in resource-constrained settings.
comment: 6 pages, 6 tables, 6 figures, submitted to the 28th International Conference on Computer and Information Technology (ICCIT 2025)
☆ KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.
comment: 6 pages, 7 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE 2025)
☆ Reimagining Disassembly Interfaces with Visualization: Combining Instruction Tracing and Control Flow with DisViz
In applications where efficiency is critical, developers may examine their compiled binaries, seeking to understand how the compiler transformed their source code and what performance implications that transformation may have. This analysis is challenging due to the vast number of disassembled binary instructions and the many-to-many mappings between them and the source code. These problems are exacerbated as source code size increases, giving the compiler more freedom to map and disperse binary instructions across the disassembly space. Interfaces for disassembly typically display instructions as an unstructured listing or sacrifice the order of execution. We design a new visual interface for disassembly code that combines execution order with control flow structure, enabling analysts to both trace through code and identify familiar aspects of the computation. Central to our approach is a novel layout of instructions grouped into basic blocks that displays a looping structure in an intuitive way. We add to this disassembly representation a unique block-based mini-map that leverages our layout and shows context across thousands of disassembly instructions. Finally, we embed our disassembly visualization in a web-based tool, DisViz, which adds dynamic linking with source code across the entire application. DizViz was developed in collaboration with program analysis experts following design study methodology and was validated through evaluation sessions with ten participants from four institutions. Participants successfully completed the evaluation tasks, hypothesized about compiler optimizations, and noted the utility of our new disassembly view. Our evaluation suggests that our new integrated view helps application developers in understanding and navigating disassembly code.
comment: 14 pages, 9 figures, accepted in TVCG
☆ Relief or displacement? How teachers are negotiating generative AI's role in their professional practice
As generative AI (genAI) rapidly enters classrooms, accompanied by district-level policy rollouts and industry-led teacher trainings, it is important to rethink the canonical ``adopt and train'' playbook. Decades of educational technology research show that tools promising personalization and access often deepen inequities due to uneven resources, training, and institutional support. Against this backdrop, we conducted semi-structured interviews with 22 teachers from a large U.S. school district that was an early adopter of genAI. Our findings reveal the motivations driving adoption, the factors underlying resistance, and the boundaries teachers negotiate to align genAI use with their values. We further contribute by unpacking the sociotechnical dynamics -- including district policies, professional norms, and relational commitments -- that shape how teachers navigate the promises and risks of these tools.
☆ Enhancing Urban Data Exploration: Layer Toggling and Visibility-Preserving Lenses for Multi-Attribute Spatial Analysis
We propose two novel interaction techniques for visualization-assisted exploration of urban data: Layer Toggling and Visibility-Preserving Lenses. Layer Toggling mitigates visual overload by organizing information into separate layers while enabling comparisons through controlled overlays. This technique supports focused analysis without losing spatial context and allows users to switch layers using a dedicated button. Visibility-Preserving Lenses adapt their size and transparency dynamically, enabling detailed inspection of dense spatial regions and temporal attributes. These techniques facilitate urban data exploration and improve prediction. Understanding complex phenomena related to crime, mobility, and residents' behavior is crucial for informed urban planning. Yet navigating such data often causes cognitive overload and visual clutter due to overlapping layers. We validate our visualization tool through a user study measuring performance, cognitive load, and interaction efficiency. Using real-world data from Sao Paulo, we demonstrate how our approach enhances exploratory and analytical tasks and provides guidelines for future interactive systems.
☆ When Strings Tug at Algorithm: Human-AI Sovereignty and Entanglement in Nomadic Improvisational Music Performance as a Decolonial Exploration
As emergent artificial intelligence technologies increasingly assert roles as assistants within intangible cultural heritage contexts, researchers and artists observe existing questions on the theme of agency negotiation, cultural resistance, and technical critique. This research interrogates power dynamics in human-AI sovereignty and entanglement for nomadic improvisational Dutar performance, a living cultural heritage through a long-necked lute from the Central Asia region. To investigate tensions between human agency and computational hegemony, the researcher and artists examined and iterated a feedback workflow that captures live performance data, processes digital transformations, and creates a real-time interactive art experience via immersive environments. Empirical data from artists and audience reveal modulations where musicians selectively embrace or reject algorithmic suggestions to preserve creative identity. The author concludes that decolonial potential requires redesigning tools or systems for cultural survivance, where technology becomes not merely a feedback environment but a site for decolonial praxis, challenging computational hegemony in digital ecosystems.
comment: 14 pages, 5 figures
☆ "Over-the-Hood" AI Inclusivity Bugs and How 3 AI Product Teams Found and Fixed Them
While much research has shown the presence of AI's "under-the-hood" biases (e.g., algorithmic, training data, etc.), what about "over-the-hood" inclusivity biases: barriers in user-facing AI products that disproportionately exclude users with certain problem-solving approaches? Recent research has begun to report the existence of such biases -- but what do they look like, how prevalent are they, and how can developers find and fix them? To find out, we conducted a field study with 3 AI product teams, to investigate what kinds of AI inclusivity bugs exist uniquely in user-facing AI products, and whether/how AI product teams might harness an existing (non-AI-oriented) inclusive design method to find and fix them. The teams' work resulted in identifying 6 types of AI inclusivity bugs arising 83 times, fixes covering 47 of these bug instances, and a new variation of the GenderMag inclusive design method, GenderMag-for-AI, that is especially effective at detecting certain kinds of AI inclusivity bugs.
☆ When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/
☆ CLiVR: Conversational Learning System in Virtual Reality with AI-Powered Patients
Simulations constitute a fundamental component of medical and nursing education and traditionally employ standardized patients (SP) and high-fidelity manikins to develop clinical reasoning and communication skills. However, these methods require substantial resources, limiting accessibility and scalability. In this study, we introduce CLiVR, a Conversational Learning system in Virtual Reality that integrates large language models (LLMs), speech processing, and 3D avatars to simulate realistic doctor-patient interactions. Developed in Unity and deployed on the Meta Quest 3 platform, CLiVR enables trainees to engage in natural dialogue with virtual patients. Each simulation is dynamically generated from a syndrome-symptom database and enhanced with sentiment analysis to provide feedback on communication tone. Through an expert user study involving medical school faculty (n=13), we assessed usability, realism, and perceived educational impact. Results demonstrated strong user acceptance, high confidence in educational potential, and valuable feedback for improvement. CLiVR offers a scalable, immersive supplement to SP-based training.
☆ Re:Member: Emotional Question Generation from Personal Memories ACL 2025
We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users' personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.
comment: Accepted to HCI+NLP at ACL 2025
☆ Examining the Impact of Label Detail and Content Stakes on User Perceptions of AI-Generated Images on Social Media SC
AI-generated images are increasingly prevalent on social media, raising concerns about trust and authenticity. This study investigates how different levels of label detail (basic, moderate, maximum) and content stakes (high vs. low) influence user engagement with and perceptions of AI-generated images through a within-subjects experimental study with 105 participants. Our findings reveal that increasing label detail enhances user perceptions of label transparency but does not affect user engagement. However, content stakes significantly impact user engagement and perceptions, with users demonstrating higher engagement and trust in low-stakes images. These results suggest that social media platforms can adopt detailed labels to improve transparency without compromising user engagement, offering insights for effective labeling strategies for AI-generated content.
comment: In Companion Publication of the 2025 Conference on Computer-Supported Cooperative Work and Social Computing (CSCW Companion '25)
☆ SocializeChat: A GPT-Based AAC Tool Grounded in Personal Memories to Support Social Communication
Elderly people with speech impairments often face challenges in engaging in meaningful social communication, particularly when using Augmentative and Alternative Communication (AAC) tools that primarily address basic needs. Moreover, effective chats often rely on personal memories, which is hard to extract and reuse. We introduce SocializeChat, an AAC tool that generates sentence suggestions by drawing on users' personal memory records. By incorporating topic preference and interpersonal closeness, the system reuses past experience and tailors suggestions to different social contexts and conversation partners. SocializeChat not only leverages past experiences to support interaction, but also treats conversations as opportunities to create new memories, fostering a dynamic cycle between memory and communication. A user study shows its potential to enhance the inclusivity and relevance of AAC-supported social interaction.
comment: Accepted to the IEEE International Conference on Systems, Man, and Cybernetics 2025 (IEEE SMC 2025). Personal use permitted. For other uses, permission must be obtained from IEEE
☆ Plural Voices, Single Agent: Towards Inclusive AI in Multi-User Domestic Spaces
Domestic AI agents faces ethical, autonomy, and inclusion challenges, particularly for overlooked groups like children, elderly, and Neurodivergent users. We present the Plural Voices Model (PVM), a novel single-agent framework that dynamically negotiates multi-user needs through real-time value alignment, leveraging diverse public datasets on mental health, eldercare, education, and moral reasoning. Using human+synthetic curriculum design with fairness-aware scenarios and ethical enhancements, PVM identifies core values, conflicts, and accessibility requirements to inform inclusive principles. Our privacy-focused prototype features adaptive safety scaffolds, tailored interactions (e.g., step-by-step guidance for Neurodivergent users, simple wording for children), and equitable conflict resolution. In preliminary evaluations, PVM outperforms multi-agent baselines in compliance (76% vs. 70%), fairness (90% vs. 85%), safety-violation rate (0% vs. 7%), and latency. Design innovations, including video guidance, autonomy sliders, family hubs, and adaptive safety dashboards, demonstrate new directions for ethical and inclusive domestic AI, for building user-centered agentic systems in plural domestic contexts. Our Codes and Model are been open sourced, available for reproduction: https://github.com/zade90/Agora
☆ Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking
In this paper we describe Ninja Codes, neurally-generated fiducial markers that can be made to naturally blend into various real-world environments. An encoder network converts arbitrary images into Ninja Codes by applying visually modest alterations; the resulting codes, printed and pasted onto surfaces, can provide stealthy 6-DoF location tracking for a wide range of applications including augmented reality, robotics, motion-based user interfaces, etc. Ninja Codes can be printed using off-the-shelf color printers on regular printing paper, and can be detected using any device equipped with a modern RGB camera and capable of running inference. Using an end-to-end process inspired by prior work on deep steganography, we jointly train a series of network modules that perform the creation and detection of Ninja Codes. Through experiments, we demonstrate Ninja Codes' ability to provide reliable location tracking under common indoor lighting conditions, while successfully concealing themselves within diverse environmental textures. We expect Ninja Codes to offer particular value in scenarios where the conspicuous appearances of conventional fiducial markers make them undesirable for aesthetic and other reasons.
comment: 11 pages, 12 figures
Prompt Decorators: A Declarative and Composable Syntax for Reasoning, Formatting, and Control in LLMs
Large Language Models (LLMs) are central to reasoning, writing, and decision-support workflows, yet users lack consistent control over how they reason and express outputs. Conventional prompt engineering relies on verbose natural-language instructions, limiting reproducibility, modularity, and interpretability. This paper introduces Prompt Decorators, a declarative, composable syntax that governs LLM behavior through compact control tokens such as +++Reasoning, +++Tone(style=formal), and +++Import(topic="Systems Thinking"). Each decorator modifies a behavioral dimension, such as reasoning style, structure, or tone, without changing task content. The framework formalizes twenty core decorators organized into two functional families (Cognitive & Generative and Expressive & Systemic), each further decomposed into subcategories that govern reasoning, interaction, expression, and session-control. It defines a unified syntax, scoping model, and deterministic processing pipeline enabling predictable and auditable behavior composition. By decoupling task intent from execution behavior, Prompt Decorators create a reusable and interpretable interface for prompt design. Illustrative use cases demonstrate improved reasoning transparency, reduced prompt complexity, and standardized model behavior across domains. The paper concludes with implications for interoperability, behavioral consistency, and the development of declarative interfaces for scalable AI systems.
♻ ☆ EvalAssist: A Human-Centered Tool for LLM-as-a-Judge
With the broad availability of large language models and their ability to generate vast outputs using varied prompts and configurations, determining the best output for a given task requires an intensive evaluation process, one where machine learning practitioners must decide how to assess the outputs and then carefully carry out the evaluation. This process is both time-consuming and costly. As practitioners work with an increasing number of models, they must now evaluate outputs to determine which model and prompt performs best for a given task. LLMs are increasingly used as evaluators to filter training data, evaluate model performance, assess harms and risks, or assist human evaluators with detailed assessments. We present EvalAssist, a framework that simplifies the LLM-as-a-judge workflow. The system provides an online criteria development environment, where users can interactively build, test, and share custom evaluation criteria in a structured and portable format. We support a set of LLM-based evaluation pipelines that leverage off-the-shelf LLMs and use a prompt-chaining approach we developed and contributed to the UNITXT open-source library. Additionally, our system also includes specially trained evaluators to detect harms and risks in LLM outputs. We have deployed the system internally in our organization with several hundreds of users.
comment: arXiv admin note: substantial text overlap with arXiv:2410.00873
♻ ☆ DeTAILS: Deep Thematic Analysis with Iterative LLM Support
Thematic analysis is widely used in qualitative research but can be difficult to scale because of its iterative, interpretive demands. We introduce DeTAILS, a toolkit that integrates large language model (LLM) assistance into a workflow inspired by Braun and Clarke's thematic analysis framework. DeTAILS supports researchers in generating and refining codes, reviewing clusters, and synthesizing themes through interactive feedback loops designed to preserve analytic agency. We evaluated the system with 18 qualitative researchers analyzing Reddit data. Quantitative results showed strong alignment between LLM-supported outputs and participants' refinements, alongside reduced workload and high perceived usefulness. Qualitatively, participants reported that DeTAILS accelerated analysis, prompted reflexive engagement with AI outputs, and fostered trust through transparency and control. We contribute: (1) an interactive human-LLM workflow for large-scale qualitative analysis, (2) empirical evidence of its feasibility and researcher experience, and (3) design implications for trustworthy AI-assisted qualitative research.
♻ ☆ The Narcissus Hypothesis: Descending to the Rung of Illusion NeurIPS 2025
Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment-via human feedback and model-generated corpora-induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the Narcissus Hypothesis and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl's Ladder of Causality, culminating in what we refer to as the Rung of Illusion.
comment: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
♻ ☆ "She's Like a Person but Better": Characterizing Companion-Assistant Dynamics in Human-AI Relationships
Large language models are increasingly used for both task-based assistance and social companionship, yet research has typically focused on one or the other. Drawing on a survey (N = 204) and 30 interviews with high-engagement ChatGPT and Replika users, we characterize digital companionship as an emerging form of human-AI relationship. With both systems, users were drawn to humanlike qualities, such as emotional resonance and personalized responses, and non-humanlike qualities, such as constant availability and inexhaustible tolerance. This led to fluid chatbot uses, such as Replika as a writing assistant and ChatGPT as an emotional confidant, despite their distinct branding. However, we observed challenging tensions in digital companionship dynamics: participants grappled with bounded personhood, forming deep attachments while denying chatbots "real" human qualities, and struggled to reconcile chatbot relationships with social norms. These dynamics raise questions for the design of digital companions and the rise of hybrid, general-purpose AI systems.
♻ ☆ Toward a Cognitive-Affective-Systemic Framework for Art and Sustainability
This paper proposes a cognitive-Affective-Systemic (CAS) framework that integrates cognition, emotion, and systemic understanding to cultivate sustainability awareness through art. Drawing from eco-aesthetics, affect theory, complexity science, and posthuman ethics, the framework defines artistic practice as both epistemic and performative--a way of knowing through making and feeling. Central to this is logomotion, an aesthetic mode where comprehension and emotion move together as a unified experience. Two artworks, SPill, visualizing antimicrobial resistance through avalanche dynamics, and Echoes of the Land, modeling anthropogenic seismicity, demonstrate how systemic modeling and sensory immersion transform complex science into embodied ecological understanding. The framework offers a methodological foundation for artists, theorists, and activists to translate awareness into engagement, advancing collective creativity toward sustainable futures.
comment: 16 pages, 5 figures. Submitted to Digital Creativity
♻ ☆ LegiScout: A Visual Tool for Understanding Complex Legislation
Modern legislative frameworks, such as the Affordable Care Act (ACA), often involve complex webs of agencies, mandates, and interdependencies. Government issued charts attempt to depict these structures but are typically static, dense, and difficult to interpret - even for experts. We introduce LegiScout, an interactive visualization system that transforms static policy diagrams into dynamic, force-directed graphs, enhancing comprehension while preserving essential relationships. By integrating data extraction, natural language processing, and computer vision techniques, LegiScout supports deeper exploration of not only the ACA but also a wide range of legislative and regulatory frameworks. Our approach enables stakeholders - policymakers, analysts, and the public - to navigate and understand the complexity inherent in modern law.
♻ ☆ Human-AI Interactions: Cognitive, Behavioral, and Emotional Impacts
As stories of human-AI interactions continue to be highlighted in the news and research platforms, the challenges are becoming more pronounced, including potential risks of overreliance, cognitive offloading, social and emotional manipulation, and the nuanced degradation of human agency and judgment. This paper surveys recent research on these issues through the lens of the psychological triad: cognition, behavior, and emotion. Observations seem to suggest that while AI can substantially enhance memory, creativity, and engagement, it also introduces risks such as diminished critical thinking, skill erosion, and increased anxiety. Emotional outcomes are similarly mixed, with AI systems showing promise for support and stress reduction, but raising concerns about dependency, inappropriate attachments, and ethical oversight. This paper aims to underscore the need for responsible and context-aware AI design, highlighting gaps for longitudinal research and grounded evaluation frameworks to balance benefits with emerging human-centric risks.
comment: 13 pages, 1 figure. Submitted to IEEE Transactions on Technology and Society. Preprint also available on TechRxiv
♻ ☆ ScholaWrite: A Dataset of End-to-End Scholarly Writing Process
Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.
comment: Equal contribution: Khanh Chi Le, Linghe Wang, Minhwa Lee | project page: https://minnesotanlp.github.io/scholawrite/
Software Engineering 42
☆ EffiReasonTrans: RL-Optimized Reasoning for Code Translation
Code translation is a crucial task in software development and maintenance. While recent advancements in large language models (LLMs) have improved automated code translation accuracy, these gains often come at the cost of increased inference latency, hindering real-world development workflows that involve human-in-the-loop inspection. To address this trade-off, we propose EffiReasonTrans, a training framework designed to improve translation accuracy while balancing inference latency. We first construct a high-quality reasoning-augmented dataset by prompting a stronger language model, DeepSeek-R1, to generate intermediate reasoning and target translations. Each (source code, reasoning, target code) triplet undergoes automated syntax and functionality checks to ensure reliability. Based on this dataset, we employ a two-stage training strategy: supervised fine-tuning on reasoning-augmented samples, followed by reinforcement learning to further enhance accuracy and balance inference latency. We evaluate EffiReasonTrans on six translation pairs. Experimental results show that it consistently improves translation accuracy (up to +49.2% CA and +27.8% CodeBLEU compared to the base model) while reducing the number of generated tokens (up to -19.3%) and lowering inference latency in most cases (up to -29.0%). Ablation studies further confirm the complementary benefits of the two-stage training framework. Additionally, EffiReasonTrans demonstrates improved translation accuracy when integrated into agent-based frameworks. Our code and data are available at https://github.com/DeepSoftwareAnalytics/EffiReasonTrans.
☆ Streamlining Acceptance Test Generation for Mobile Applications Through Large Language Models: An Industrial Case Study
Mobile acceptance testing remains a bottleneck in modern software development, particularly for cross-platform mobile development using frameworks like Flutter. While developers increasingly rely on automated testing tools, creating and maintaining acceptance test artifacts still demands significant manual effort. To help tackle this issue, we introduce AToMIC, an automated framework leveraging specialized Large Language Models to generate Gherkin scenarios, Page Objects, and executable UI test scripts directly from requirements (JIRA tickets) and recent code changes. Applied to BMW's MyBMW app, covering 13 real-world issues in a 170+ screen codebase, AToMIC produced executable test artifacts in under five minutes per feature on standard hardware. The generated artifacts were of high quality: 93.3% of Gherkin scenarios were syntactically correct upon generation, 78.8% of PageObjects ran without manual edits, and 100% of generated UI tests executed successfully. In a survey, all practitioners reported time savings (often a full developer-day per feature) and strong confidence in adopting the approach. These results confirm AToMIC as a scalable, practical solution for streamlining acceptance test creation and maintenance in industrial mobile projects.
☆ Computational Foundations for Strategic Coopetition: Formalizing Interdependence and Complementarity
Modern socio-technical systems are characterized by strategic coopetition where actors simultaneously cooperate to create value and compete to capture it. While conceptual modeling languages like i* provide rich qualitative representations of strategic dependencies, they lack mechanisms for quantitative analysis of dynamic trade-offs. Conversely, classical game theory offers mathematical rigor but strips away contextual richness. This technical report bridges this gap by developing computational foundations that formalize two critical dimensions of coopetition: interdependence and complementarity. We ground interdependence in i* structural dependency analysis, translating depender-dependee-dependum relationships into quantitative interdependence coefficients through a structured translation framework. We formalize complementarity following Brandenburger and Nalebuff's Added Value concept, modeling synergistic value creation with validated parameterization. We integrate structural dependencies with bargaining power in value appropriation and introduce a game-theoretic formulation where Nash Equilibrium incorporates structural interdependence. Validation combines comprehensive experimental testing across power and logarithmic value function specifications, demonstrating functional form robustness, with empirical application to the Samsung-Sony S-LCD joint venture (2004-2011), where logarithmic specifications achieve superior empirical fit (validation score 45/60) while power functions provide theoretical tractability. This technical report serves as the foundational reference for a coordinated research program examining strategic coopetition in requirements engineering and multi-agent systems, with companion work addressing trust dynamics, team production, and reciprocity mechanisms.
comment: 36 pages, 7 figures
☆ FeClustRE: Hierarchical Clustering and Semantic Tagging of App Features from User Reviews
[Context and motivation.] Extracting features from mobile app reviews is increasingly important for multiple requirements engineering (RE) tasks. However, existing methods struggle to turn noisy, ambiguous feedback into interpretable insights. [Question/problem.] Syntactic approaches lack semantic depth, while large language models (LLMs) often miss fine-grained features or fail to structure them coherently. In addition, existing methods output flat lists of features without semantic organization, limiting interpretation and comparability. Consequently, current feature extraction approaches do not provide structured, meaningful representations of app features. As a result, practitioners face fragmented information that hinder requirement analysis, prioritization, and cross-app comparison, among other use cases. [Principal ideas/results.] In this context, we propose FeClustRE, a framework integrating hybrid feature extraction, hierarchical clustering with auto-tuning and LLM-based semantic labelling. FeClustRE combines syntactic parsing with LLM enrichment, organizes features into clusters, and automatically generates meaningful taxonomy labels. We evaluate FeClustRE on public benchmarks for extraction correctness and on a sample study of generative AI assistant app reviews for clustering quality, semantic coherence, and interpretability. [Contribution.] Overall, FeClustRE delivers (1) a hybrid framework for feature extraction and taxonomy generation, (2) an auto-tuning mechanism with a comprehensive evaluation methodology, and (3) open-source and replicable implementation. These contributions bridge user feedback and feature understanding, enabling deeper insights into current and emerging requirements.
comment: Under review
☆ ShaRE your Data! Characterizing Datasets for LLM-based Requirements Engineering
[Context] Large Language Models (LLMs) rely on domain-specific datasets to achieve robust performance across training and inference stages. However, in Requirements Engineering (RE), data scarcity remains a persistent limitation reported in surveys and mapping studies. [Question/Problem] Although there are multiple datasets supporting LLM-based RE tasks (LLM4RE), they are fragmented and poorly characterized, limiting reuse and comparability. This research addresses the limited visibility and characterization of datasets used in LLM4RE. We investigate which public datasets are employed, how they can be systematically characterized, and which RE tasks and dataset descriptors remain under-represented. [Ideas/Results] To address this, we conduct a systematic mapping study to identify and analyse datasets used in LLM4RE research. A total of 62 publicly available datasets are referenced across 43 primary studies. Each dataset is characterized along descriptors such as artifact type, granularity, RE stage, task, domain, and language. Preliminary findings show multiple research gaps, including limited coverage for elicitation tasks, scarce datasets for management activities beyond traceability, and limited multilingual availability. [Contribution] This research preview offers a public catalogue and structured characterization scheme to support dataset selection, comparison, and reuse in LLM4RE research. Future work will extend the scope to grey literature, as well as integration with open dataset and benchmark repositories.
comment: Under review
☆ Causally Perturbed Fairness Testing
To mitigate unfair and unethical discrimination over sensitive features (e.g., gender, age, or race), fairness testing plays an integral role in engineering systems that leverage AI models to handle tabular data. A key challenge therein is how to effectively reveal fairness bugs under an intractable sample size using perturbation. Much current work has been focusing on designing the test sample generators, ignoring the valuable knowledge about data characteristics that can help guide the perturbation and hence limiting their full potential. In this paper, we seek to bridge such a gap by proposing a generic framework of causally perturbed fairness testing, dubbed CausalFT. Through causal inference, the key idea of CausalFT is to extract the most directly and causally relevant non-sensitive feature to its sensitive counterpart, which can jointly influence the prediction of the label. Such a causal relationship is then seamlessly injected into the perturbation to guide a test sample generator. Unlike existing generator-level work, CausalFT serves as a higher-level framework that can be paired with diverse base generators. Extensive experiments on 1296 cases confirm that CausalFT can considerably improve arbitrary base generators in revealing fairness bugs over 93% of the cases with acceptable extra runtime overhead. Compared with a state-of-the-art approach that ranks the non-sensitive features solely based on correlation, CausalFT performs significantly better on 64% cases while being much more efficient. Further, CausalFT can better improve bias resilience in nearly all cases.
comment: accepted by TOSEM
☆ An overview of the use of alternative funding and contracting approaches relevant for agile software development: A systematic review of real-life experiences
Agile software development emphasizes flexibility and iterative processes, which may conflict with the more linear, rigid, and time-consuming traditional funding and contracting approaches. This review synthesizes real-life experiences of using alternative (non-traditional) contracting and funding approaches. The focus is on identifying approaches that align better with agile principles and understanding the motivations, benefits, and challenges these alternatives present. A systematic literature review was conducted in SCOPUS, Web of Science, and Google Scholar, where we identified 38 relevant peer-reviewed empirical studies from private and public sector contexts. Four alternative funding and four alternative contracting approaches were identified. Organizations were motivated to adopt these alternative approaches because traditional approaches often proved too rigid, conflicted with agile principles, hindered effective client-contractor collaboration, and limited profitability. The benefits of these alternatives included higher client satisfaction, reduced contractor risk, and more efficient resource utilization. Adopting alternative funding and contracting approaches may promote flexibility and efficiency in agile projects but also presents cultural and structural challenges, increases the risk of scope creep and analysis paralysis, and requires additional effort in terms of time and resources. The context of the organization matters highly in selecting a suitable approach, such as the organizational readiness in terms of its leaders, people, and systems. Thus, instead of wholly adopting alternative approaches and introducing changes abruptly, organizations may benefit from starting with hybrid approaches that balance flexibility and control and progressively transition to fully flexible approaches tailored to their needs
comment: 48 pages, 6 tables, 6 figures and 3 appendices
☆ CPSLint: A Domain-Specific Language Providing Data Validation and Sanitisation for Industrial Cyber-Physical Systems
Raw datasets are often too large and unstructured to work with directly, and require a data preparation process. The domain of industrial Cyber-Physical Systems (CPS) is no exception, as raw data typically consists of large amounts of time-series data logging the system's status in regular time intervals. Such data has to be sanity checked and preprocessed to be consumable by data-centric workflows. We introduce CPSLint, a Domain-Specific Language designed to provide data preparation for industrial CPS. We build up on the fact that many raw data collections in the CPS domain require similar actions to render them suitable for Machine-Learning (ML) solutions, e.g., Fault Detection and Identification (FDI) workflows, yet still vary enough to hope for one universally applicable solution. CPSLint's main features include type checking and enforcing constraints through validation and remediation for data columns, such as imputing missing data from surrounding rows. More advanced features cover inference of extra CPS-specific data structures, both column-wise and row-wise. For instance, as row-wise structures, descriptive execution phases are an effective method of data compartmentalisation are extracted and prepared for ML-assisted FDI workflows. We demonstrate CPSLint's features through a proof of concept implementation.
☆ Towards an Optimized Benchmarking Platform for CI/CD Pipelines
Performance regressions in large-scale software systems can lead to substantial resource inefficiencies, making their early detection critical. Frequent benchmarking is essential for identifying these regressions and maintaining service-level agreements (SLAs). Performance benchmarks, however, are resource-intensive and time-consuming, which is a major challenge for integration into Continuous Integration / Continuous Deployment (CI/CD) pipelines. Although numerous benchmark optimization techniques have been proposed to accelerate benchmark execution, there is currently no practical system that integrates these optimizations seamlessly into real-world CI/CD pipelines. In this vision paper, we argue that the field of benchmark optimization remains under-explored in key areas that hinder its broader adoption. We identify three central challenges to enabling frequent and efficient benchmarking: (a) the composability of benchmark optimization strategies, (b) automated evaluation of benchmarking results, and (c) the usability and complexity of applying these strategies as part of CI/CD systems in practice. We also introduce a conceptual cloud-based benchmarking framework handling these challenges transparently. By presenting these open problems, we aim to stimulate research toward making performance regression detection in CI/CD systems more practical and effective.
comment: Published in 2025 IEEE International Conference on Cloud Engineering (IC2E)
☆ Evaluating Large Language Models in detecting Secrets in Android Apps
Mobile apps often embed authentication secrets, such as API keys, tokens, and client IDs, to integrate with cloud services. However, developers often hardcode these credentials into Android apps, exposing them to extraction through reverse engineering. Once compromised, adversaries can exploit secrets to access sensitive data, manipulate resources, or abuse APIs, resulting in significant security and financial risks. Existing detection approaches, such as regex-based analysis, static analysis, and machine learning, are effective for identifying known patterns but are fundamentally limited: they require prior knowledge of credential structures, API signatures, or training data. In this paper, we propose SecretLoc, an LLM-based approach for detecting hardcoded secrets in Android apps. SecretLoc goes beyond pattern matching; it leverages contextual and structural cues to identify secrets without relying on predefined patterns or labeled training sets. Using a benchmark dataset from the literature, we demonstrate that SecretLoc detects secrets missed by regex-, static-, and ML-based methods, including previously unseen types of secrets. In total, we discovered 4828 secrets that were undetected by existing approaches, discovering more than 10 "new" types of secrets, such as OpenAI API keys, GitHub Access Tokens, RSA private keys, and JWT tokens, and more. We further extend our analysis to newly crawled apps from Google Play, where we uncovered and responsibly disclosed additional hardcoded secrets. Across a set of 5000 apps, we detected secrets in 2124 apps (42.5%), several of which were confirmed and remediated by developers after we contacted them. Our results reveal a dual-use risk: if analysts can uncover these secrets with LLMs, so can attackers. This underscores the urgent need for proactive secret management and stronger mitigation practices across the mobile ecosystem.
☆ CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.
comment: 24 pages, 6 figures
☆ A Structured Evaluation Framework for Low-Code Platform Selection: A Multi-Criteria Decision Model for Enterprise Digital Transformation
The rapid adoption of Low-Code Development Platforms (LCDPs) has created a critical need for systematic evaluation methodologies that enable organizations to make informed platform selection decisions. This paper presents a comprehensive evaluation framework based on five key criteria: Business Process Orchestration, UI/UX Customization, Integration and Interoperability, Governance and Security, and AI-Enhanced Automation. We propose a weighted scoring model that allows organizations to quantitatively assess and compare different low-code platforms based on their specific requirements and strategic priorities. The framework addresses the gap between marketing-driven platform comparisons and rigorous, context-specific evaluation methodologies. Through empirical validation in enterprise environments, we demonstrate how this structured approach can significantly improve decision-making outcomes and reduce the risk of platform lock-in or inadequate solution selection.
comment: 15 pages, 1 figure. PDF-only submission (XeLaTeX)
☆ WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a significant challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios. Code and data are available at https://github.com/lcy2723/WebDevJudge.
☆ When Abstraction Breaks Physics: Rethinking Modular Design in Quantum Software
Abstraction is a fundamental principle in classical software engineering, which enables modularity, reusability, and scalability. However, quantum programs adhere to fundamentally different semantics, such as unitarity, entanglement, the no-cloning theorem, and the destructive nature of measurement, which introduce challenges to the safe use of classical abstraction mechanisms. This paper identifies a fundamental conflict in quantum software engineering: abstraction practices that are syntactically valid may violate the physical constraints of quantum computation. We present three classes of failure cases where naive abstraction breaks quantum semantics and propose a set of design principles for physically sound abstraction mechanisms. We further propose research directions, including quantum-specific type systems, effect annotations, and contract-based module design. Our goal is to initiate a systematic rethinking of abstraction in quantum software engineering, based on quantum semantics and considering engineering scalability.
comment: Accepted at the NIER track of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE 2025)
☆ Demonstrators for Industrial Cyber-Physical System Research: A Requirements Hierarchy Driven by Software-Intensive Design
One of the challenges apparent in the organisation of research projects is the uncertainties around the subject of demonstrators. A precise and detailed elicitation of the coverage for project demonstrators is often an afterthought and not sufficiently detailed during proposal writing. This practice leads to continuous confusion and a mismatch between targeted and achievable demonstration of results, hindering progress. The reliance on the TRL scale as a loose descriptor does not help either. We propose a demonstrator requirements elaboration framework aiming to evaluate the feasibility of targeted demonstrations, making realistic adjustments, and assist in describing requirements. In doing so, we define 5 hierarchical levels of demonstration, clearly connected to expectations, e.g., work package interaction, and also connected to the project's industrial use-cases. The considered application scope in this paper is the domain of software-intensive systems and industrial cyber-physical systems. A complete validation is not accessible, as it would require application of our framework at the start of a project and observing the results at the end, taking 4-5 years. Nonetheless, we have applied it to two research projects from our portfolio, one at the early and another at the final stages, revealing its effectiveness.
☆ Mining Service Behavior for Stateful Service Emulation
Enterprise software systems are increasingly integrating with diverse services to meet expanding business demands. Testing these highly interconnected systems presents a challenge due to the need for access to the connected services. Service virtualization has emerged as a widely used technique to derive service models from recorded interactions, for service response generation during system testing. Various methods have been proposed to emulate actual service behavior based on these interactions, but most fail to account for the service's state, which reduces the accuracy of service emulation and the realism of the testing environment, especially when dealing with stateful services. This paper proposes an approach to deriving service models from service interactions, which enhance the accuracy of response generation by considering service state. This is achieved by uncovering contextual dependencies among interaction messages and analyzing the relationships between message data values. The approach is evaluated using interaction traces collected from both stateful and stateless services, and the results reveal notable enhancements in accuracy and efficiency over existing approaches in service response generation.
comment: 19 pages
☆ VAPU: System for Autonomous Legacy Code Modernization
In this study, we present a solution for the modernization of legacy applications, an area of code generation where LLM-based multi-agent systems are proving essential for complex multi-phased tasks. Legacy applications often contain deprecated components that create compatibility, security, and reliability risks, but high resource costs make companies hesitate to update. We take a step forward to integrate an LLM-based multi-agent system as part of a legacy web application update to provide a cost-effective solution to update legacy applications autonomously. We propose a multi-agent system named a Verifying Agent Pipeline Updater (VAPU), which is designed to update code files in phases while simulating different roles in a software development team. In our previous study, we evaluated the system for legacy version updates by using six legacy web application view files by resulting errors and accomplished requirements. This study extends the previous evaluation of a multi-agent pipeline system by extending the evaluation of VAPU from a single LLM to five LLMs and using the temperature parameter in both 0 to 1 settings. Additionally, we tested the system with 20 open-source Python GitHub projects. The results of the evaluation were compared to Zero-Shot Learning (ZSL) and One-Shot Learning (OSL) prompts. The extended evaluation of VAPU showed that particularly in a low-temperature VAPU can get similar level of error count compared to the ZSL/OSL prompts but with a higher level of fulfilled requirements, depending on the LLM. VAPU showed up to 22.5% increase in the succeeding Python file update requirements compared to ZSL/OSL prompts. The study indicates that an LLM-based multi-agent system is a capable solution to update components of a legacy application autonomously.
comment: Table 13, figure 2
☆ AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification
On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their adoption is hindered by the perception of poor performance, as even the best models (e.g. Qwen3-VL-235B) scores are capped at around 60% on benchmarks like AndroidControl, far from viability for real-world use. Our research reveals that issue lies not only with the models but with the benchmarks themselves. We identified notable shortcomings in AndroidControl, including ambiguities and factual errors, which systematically underrates agent capabilities. To address this critical oversight, we enhanced AndroidControl into AndroidControl-Curated, a refined version of the benchmark improved through a rigorous purification pipeline. On this enhanced benchmark, state-of-the-art models achieve success rates nearing 75% on complex tasks (15% improvement), reflecting that on-device GUI agents are actually closer to practical deployment than previously thought. We introduce our new SOTA model, Magma-R1- 3B, post-trained on just 2.4k curated samples using 60 hours of an H20 GPU (approximately $60). Despite being 200 times smaller in parameters, this model delivers performance comparable to Qwen3- VL-235B. We release both AndroidControl-Curated benchmark and Magma-R1 model to the research community, encouraging adoption of this enhanced benchmark to better reflect model capabilities and accelerate the development of robust, on-device virtual assistants.
☆ CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CodeRL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CodeRL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CodeRL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CodeRL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CodeRL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CodeRL+ strengthens the alignment between code's textual representations and its underlying execution semantics.
☆ Large Language Models in Thematic Analysis: Prompt Engineering, Evaluation, and Guidelines for Qualitative Software Engineering Research
As artificial intelligence advances, large language models (LLMs) are entering qualitative research workflows, yet no reproducible methods exist for integrating them into established approaches like thematic analysis (TA), one of the most common qualitative methods in software engineering research. Moreover, existing studies lack systematic evaluation of LLM-generated qualitative outputs against established quality criteria. We designed and iteratively refined prompts for Phases 2-5 of Braun and Clarke's reflexive TA, then tested outputs from multiple LLMs against codes and themes produced by experienced researchers. Using 15 interviews on software engineers' well-being, we conducted blind evaluations with four expert evaluators who applied rubrics derived directly from Braun and Clarke's quality criteria. Evaluators preferred LLM-generated codes 61% of the time, finding them analytically useful for answering the research question. However, evaluators also identified limitations: LLMs fragmented data unnecessarily, missed latent interpretations, and sometimes produced themes with unclear boundaries. Our contributions are threefold. First, a reproducible approach integrating refined, documented prompts with an evaluation framework to operationalize Braun and Clarke's reflexive TA. Second, an empirical comparison of LLM- and human-generated codes and themes in software engineering data. Third, guidelines for integrating LLMs into qualitative analysis while preserving methodological rigour, clarifying when and how LLMs can assist effectively and when human interpretation remains essential.
☆ Real-World Usability of Vulnerability Proof-of-Concepts: A Comprehensive Study
The Proof-of-Concept (PoC) for a vulnerability is crucial in validating its existence, mitigating false positives, and illustrating the severity of the security threat it poses. However, research on PoCs significantly lags behind studies focusing on vulnerability data. This discrepancy can be directly attributed to several challenges, including the dispersion of real-world PoCs across multiple platforms, the diversity in writing styles, and the difficulty associated with PoC reproduction. To fill this gap, we conduct the first large-scale study on PoCs in the wild, assessing their report availability, completeness, reproducibility. Specifically, 1) to investigate PoC reports availability for CVE vulnerability, we collected an extensive dataset of 470,921 PoCs and their reports from 13 platforms, representing the broadest collection of publicly available PoCs to date. 2) To assess the completeness of PoC report at a fine-grained level, we proposed a component extraction method, which combines pattern-matching techniques with a fine-tuned BERT-NER model to extract 9 key components from PoC reports. 3) To evaluate the effectiveness of PoCs, we recruited 8 participants to manually reproduce 150 sampled vulnerabilities with 32 vulnerability types based on PoC reports, enabling an in-depth analysis of PoC reproducibility and the factors influencing it. Our findings reveal that 78.9% of CVE vulnerabilities lack available PoCs, and existing PoC reports typically miss about 30% of the essential components required for effective vulnerability understanding and reproduction, with various reasons identified for the failure to reproduce vulnerabilities using available PoC reports. Finally, we proposed actionable strategies for stakeholders to enhance the overall usability of vulnerability PoCs in strengthening software security.
☆ Human to Document, AI to Code: Three Case Studies of Comparing GenAI for Notebook Competitions
Computational notebooks have become the preferred tool of choice for data scientists and practitioners to perform analyses and share results. Notebooks uniquely combine scripts with documentation. With the emergence of generative AI (GenAI) technologies, it is increasingly important, especially in competitive settings, to distinguish the characteristics of human-written versus GenAI. In this study, we present three case studies to explore potential strengths of both humans and GenAI through the coding and documenting activities in notebooks. We first characterize differences between 25 code and documentation features in human-written, medal-winning Kaggle notebooks. We find that gold medalists are primarily distinguished by longer and more detailed documentation. Second, we analyze the distinctions between human-written and GenAI notebooks. Our results show that while GenAI notebooks tend to achieve higher code quality (as measured by metrics like code smells and technical debt), human-written notebooks display greater structural diversity, complexity, and innovative approaches to problem-solving. Based on these results, we envision the work as groundwork that highlight four agendas to further investigate how GenAI could be utilized in notebooks that maximizes the potential collaboration between human and AI.
☆ Evaluating LLM-Based Mobile App Recommendations: An Empirical Study
Large Language Models (LLMs) are increasingly used to recommend mobile applications through natural language prompts, offering a flexible alternative to keyword-based app store search. Yet, the reasoning behind these recommendations remains opaque, raising questions about their consistency, explainability, and alignment with traditional App Store Optimization (ASO) metrics. In this paper, we present an empirical analysis of how widely-used general purpose LLMs generate, justify, and rank mobile app recommendations. Our contributions are: (i) a taxonomy of 16 generalizable ranking criteria elicited from LLM outputs; (ii) a systematic evaluation framework to analyse recommendation consistency and responsiveness to explicit ranking instructions; and (iii) a replication package to support reproducibility and future research on AI-based recommendation systems. Our findings reveal that LLMs rely on a broad yet fragmented set of ranking criteria, only partially aligned with standard ASO metrics. While top-ranked apps tend to be consistent across runs, variability increases with ranking depth and search specificity. LLMs exhibit varying sensitivity to explicit ranking instructions - ranging from substantial adaptations to near-identical outputs - highlighting their complex reasoning dynamics in conversational app discovery. Our results aim to support end-users, app developers, and recommender-systems researchers in navigating the emerging landscape of conversational app discovery.
comment: Under review
☆ InspectCoder: Dynamic Analysis-Enabled Self Repair through interactive LLM-Debugger Collaboration
Large Language Models (LLMs) frequently generate buggy code with complex logic errors that are challenging to diagnose. While existing LLM-based self-repair approaches conduct intensive static semantic analysis or reply on superficial execution logs, they miss the in-depth runtime behaviors that often expose bug root causes-lacking the interactive dynamic analysis capabilities that make human debugging effective. We present InspectCoder, the first agentic program repair system that empowers LLMs to actively conduct dynamic analysis via interactive debugger control. Our dual-agent framework enables strategic breakpoint placement, targeted state inspection, and incremental runtime experimentation within stateful debugger sessions. Unlike existing methods that follow fixed log collection procedures, InspectCoder adaptively inspects and perturbs relevant intermediate states at runtime, and leverages immediate process rewards from debugger feedback to guide multi-step reasoning, transforming LLM debugging paradigm from blind trial-and-error into systematic root cause diagnosis. We conduct comprehensive experiments on two challenging self-repair benchmarks: BigCodeBench-R and LiveCodeBench-R. InspectCoder achieves 5.10%-60.37% relative improvements in repair accuracy over the strongest baseline, while delivering 1.67x-2.24x superior bug-fix efficiency respectively. We also contribute InspectWare, an open-source middleware that abstracts debugger complexities and maintains stateful debugging sessions across mainstream Python testing frameworks. Our work provides actionable insight into the interactive LLM-debugger systems, demonstrating the significant potential of LLM-driven dynamic analysis for automated software engineering.
☆ Ensuring Robustness in ML-enabled Software Systems: A User Survey
Ensuring robustness in ML-enabled software systems requires addressing critical challenges, such as silent failures, out-of-distribution (OOD) data, and adversarial attacks. Traditional software engineering practices, which rely on predefined logic, are insufficient for ML components that depend on data and probabilistic decision-making. To address these challenges, we propose the ML-On-Rails protocol, a unified framework designed to enhance the robustness and trustworthiness of ML-enabled systems in production. This protocol integrates key safeguards such as OOD detection, adversarial attack detection, input validation, and explainability. It also includes a model-to-software communication framework using HTTP status codes to enhance transparency in reporting model outcomes and errors. To align our approach with real-world challenges, we conducted a practitioner survey, which revealed major robustness issues, gaps in current solutions, and highlighted how a standardised protocol such as ML-On-Rails can improve system robustness. Our findings highlight the need for more support and resources for engineers working with ML systems. Finally, we outline future directions for refining the proposed protocol, leveraging insights from the survey and real-world applications to continually enhance its effectiveness.
☆ When Old Meets New: Evaluating the Impact of Regression Tests on SWE Issue Resolution
Test suites in real-world projects are often large and achieve high code coverage, yet they remain insufficient for detecting all bugs. The abundance of unresolved issues in open-source project trackers highlights this gap. While regression tests are typically designed to ensure past functionality is preserved in the new version, they can also serve a complementary purpose: debugging the current version. Specifically, regression tests can (1) enhance the generation of reproduction tests for newly reported issues, and (2) validate that patches do not regress existing functionality. We present TestPrune, a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation. A key contribution of TestPrune is its ability to automatically minimize the regression suite to a small, highly relevant subset of tests. Due to the predominance of LLM-based debugging techniques, this minimization is essential as large test suites exceed context limits, introduce noise, and inflate inference costs. TestPrune can be plugged into any agentic bug repair pipeline and orthogonally improve overall performance. As a proof of concept, we show that TestPrune leads to a 6.2%-9.0% relative increase in issue reproduction rate within the Otter framework and a 9.4% - 12.9% relative increase in issue resolution rate within the Agentless framework on SWE-Bench Lite and SWE-Bench Verified benchmarks, capturing fixes that were correctly produced by agents but not submitted as final patches. Compared to the benefits, the cost overhead of using TestPrune is minimal, i.e., \$0.02 and \$0.05 per SWE-Bench instance, using GPT-4o and Claude-3.7-Sonnet models, respectively.
☆ RESCUE: Retrieval Augmented Secure Code Generation
Despite recent advances, Large Language Models (LLMs) still generate vulnerable code. Retrieval-Augmented Generation (RAG) has the potential to enhance LLMs for secure code generation by incorporating external security knowledge. However, the conventional RAG design struggles with the noise of raw security-related documents, and existing retrieval methods overlook the significant security semantics implicitly embedded in task descriptions. To address these issues, we propose RESCUE, a new RAG framework for secure code generation with two key innovations. First, we propose a hybrid knowledge base construction method that combines LLM-assisted cluster-then-summarize distillation with program slicing, producing both high-level security guidelines and concise, security-focused code examples. Second, we design a hierarchical multi-faceted retrieval to traverse the constructed knowledge base from top to bottom and integrates multiple security-critical facts at each hierarchical level, ensuring comprehensive and accurate retrieval. We evaluated RESCUE on four benchmarks and compared it with five state-of-the-art secure code generation methods on six LLMs. The results demonstrate that RESCUE improves the SecurePass@1 metric by an average of 4.8 points, establishing a new state-of-the-art performance for security. Furthermore, we performed in-depth analysis and ablation studies to rigorously validate the effectiveness of individual components in RESCUE.
☆ Docker-based CI/CD for Rocq/OCaml projects
This paper presents three closely-related software projects, namely: docker-coq, docker-coq-action, and docker-keeper. It aims at two objectives: provide a high-level description of the available features -- to foster the use of a Docker-based CI/CD for Rocq (formerly known as Coq) or OCaml projects -- and document the underlying requirements and the main design choices of these three DevOps tools -- to help their future maintainers.
comment: 26 pages, 17 figures, 3 tables, 16 references
☆ Extending Resource Constrained Project Scheduling to Mega-Projects with Model-Based Systems Engineering & Hetero-functional Graph Theory
Within the project management context, project scheduling serves as an indispensable component, functioning as a fundamental tool for planning, monitoring, controlling, and managing projects more broadly. Although the resource-constrained project scheduling problem (RCPSP) lies at the core of project management activities, it remains largely disconnected from the broader literature on model-based systems engineering (MBSE), thereby limiting its integration into the design and management of complex systems. The original contribution of this paper is twofold. First, the paper seeks to reconcile the RCPSP with the broader literature and vocabulary of model-based systems engineering and hetero-functional graph theory (HFGT). A concrete translation pipeline from an activity-on-node network to a SysML activity diagram, and then to an operand net is constructed. Using this representation, it specializes the hetero-functional network minimum-cost flow (HFNMCF) formulation to the RCPSP context as a systematic means of HFGT for quantitative analysis and proves that the RCPSP is recoverable as a special case of a broader model. Secondly, on an illustrative instance with renewable and non-renewable operands, the specialized HFNMCF, while producing similar schedules, yields explicit explanations of the project states that enable richer monitoring and control. Overall, the framework preserves the strengths of the classical RCPSP while accommodating real-world constraints and enterprise-level decision processes encountered in large, complex megaprojects.
☆ Comparative analysis of large data processing in Apache Spark using Java, Python and Scala
During the study, the results of a comparative analysis of the process of handling large datasets using the Apache Spark platform in Java, Python, and Scala programming languages were obtained. Although prior works have focused on individual stages, comprehensive comparisons of full ETL workflows across programming languages using Apache Iceberg remain limited. The analysis was performed by executing several operations, including downloading data from CSV files, transforming and loading it into an Apache Iceberg analytical table. It was found that the performance of the Spark algorithm varies significantly depending on the amount of data and the programming language used. When processing a 5-megabyte CSV file, the best result was achieved in Python: 6.71 seconds, which is superior to Scala's score of 9.13 seconds and Java's time of 9.62 seconds. For processing a large CSV file of 1.6 gigabytes, all programming languages demonstrated similar results: the fastest performance was showed in Python: 46.34 seconds, while Scala and Java showed results of 47.72 and 50.56 seconds, respectively. When performing a more complex operation that involved combining two CSV files into a single dataset for further loading into an Apache Iceberg table, Scala demonstrated the highest performance, at 374.42 seconds. Java processing was completed in 379.8 seconds, while Python was the least efficient, with a runtime of 398.32 seconds. It follows that the programming language significantly affects the efficiency of data processing by the Apache Spark algorithm, with Scala and Java being more productive for processing large amounts of data and complex operations, while Python demonstrates an advantage in working with small amounts of data. The results obtained can be useful for optimizing data handling processes depending on specific performance requirements and the amount of information being processed.
comment: CITI 2025, 3rd International Workshop on Computer Information Technologies in Industry 4.0, June 11-12, 2025, Ternopil, Ukraine. The article includes 10 pages, 5 figures, 9 tables
☆ SBAN: A Framework \& Multi-Dimensional Dataset for Large Language Model Pre-Training and Software Code Mining
This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code. This unique multimodal structure enables research on cross-representation learning, semantic understanding of software, and automated malware detection. Beyond security applications, SBAN supports broader tasks such as code translation, code explanation, and other software mining tasks involving heterogeneous data. It is particularly suited for scalable training of deep models, including transformers and other LLM architectures. By bridging low-level machine representations and high-level human semantics, SBAN provides a robust foundation for building intelligent systems that reason about code. We believe that this dataset opens new opportunities for mining software behavior, improving security analytics, and enhancing LLM capabilities in pre-training and fine-tuning tasks for software code mining.
☆ A Survey on Feedback Types in Automated Programming Assessment Systems
With the recent rapid increase in digitization across all major industries, acquiring programming skills has increased the demand for introductory programming courses. This has further resulted in universities integrating programming courses into a wide range of curricula, including not only technical studies but also business and management fields of study. Consequently, additional resources are needed for teaching, grading, and tutoring students with diverse educational backgrounds and skills. As part of this, Automated Programming Assessment Systems (APASs) have emerged, providing scalable and high-quality assessment systems with efficient evaluation and instant feedback. Commonly, APASs heavily rely on predefined unit tests for generating feedback, often limiting the scope and level of detail of feedback that can be provided to students. With the rise of Large Language Models (LLMs) in recent years, new opportunities have emerged as these technologies can enhance feedback quality and personalization. To investigate how different feedback mechanisms in APASs are perceived by students, and how effective they are in supporting problem-solving, we have conducted a large-scale study with over 200 students from two different universities. Specifically, we compare baseline Compiler Feedback, standard Unit Test Feedback, and advanced LLM-based Feedback regarding perceived quality and impact on student performance. Results indicate that while students rate unit test feedback as the most helpful, AI-generated feedback leads to significantly better performances. These findings suggest combining unit tests and AI-driven guidance to optimize automated feedback mechanisms and improve learning outcomes in programming education.
☆ E-Test: E'er-Improving Test Suites ICSE 2026
Test suites are inherently imperfect, and testers can always enrich a suite with new test cases that improve its quality and, consequently, the reliability of the target software system. However, finding test cases that explore execution scenarios beyond the scope of an existing suite can be extremely challenging and labor-intensive, particularly when managing large test suites over extended periods. In this paper, we propose E-Test, an approach that reduces the gap between the execution space explored with a test suite and the executions experienced after testing by augmenting the test suite with test cases that explore execution scenarios that emerge in production. E-Test (i) identifies executions that have not yet been tested from large sets of scenarios, such as those monitored during intensive production usage, and (ii) generates new test cases that enhance the test suite. E-Test leverages Large Language Models (LLMs) to pinpoint scenarios that the current test suite does not adequately cover, and augments the suite with test cases that execute these scenarios. Our evaluation on a dataset of 1,975 scenarios, collected from highly-starred open-source Java projects already in production and Defects4J, demonstrates that E-Test retrieves not-yet-tested execution scenarios significantly better than state-of-the-art approaches. While existing regression testing and field testing approaches for this task achieve a maximum F1-score of 0.34, and vanilla LLMs achieve a maximum F1-score of 0.39, E-Test reaches 0.55. These results highlight the impact of E-Test in enhancing test suites by effectively targeting not-yet-tested execution scenarios and reducing manual effort required for maintaining test suites.
comment: Accepted at the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026)
♻ ☆ A Systematic Literature Review on Large Language Models for Automated Program Repair
Automated Program Repair (APR) attempts to patch software bugs and reduce manual debugging efforts. Very recently, with the advances in Large Language Models (LLMs), an increasing number of APR techniques have been proposed, facilitating software development and maintenance and demonstrating remarkable performance. However, due to ongoing explorations in the LLM-based APR field, it is challenging for researchers to understand the current achievements, challenges, and potential opportunities. This work provides the first systematic literature review to summarize the applications of LLMs in APR between 2020 and 2025. We analyze 189 relevant papers from LLMs, APR and their integration perspectives. First, we categorize existing popular LLMs that are applied to support APR and outline four types of utilization strategies for their deployment. Besides, we detail some specific repair scenarios that benefit from LLMs, e.g., semantic bugs and security vulnerabilities. Furthermore, we discuss several critical aspects of integrating LLMs into APR research, e.g., input forms and open science. Finally, we highlight a set of challenges remaining to be investigated and the potential guidelines for future research. Overall, our paper provides a systematic overview of the research landscape to the APR community, helping researchers gain a comprehensive understanding of achievements and promote future research.
comment: update new papers, up to September 2025
♻ ☆ $μ$OpTime: Statically Reducing the Execution Time of Microbenchmark Suites Using Stability Metrics
Performance regressions have a tremendous impact on the quality of software. One way to catch regressions before they reach production is executing performance tests before deployment, e.g., using microbenchmarks, which measure performance at subroutine level. In projects with many microbenchmarks, this may take several hours due to repeated execution to get accurate results, disqualifying them from frequent use in CI/CD pipelines. We propose $\mu$OpTime, a static approach to reduce the execution time of microbenchmark suites by configuring the number of repetitions for each microbenchmark. Based on the results of a full, previous microbenchmark suite run, $\mu$OpTime determines the minimal number of (measurement) repetitions with statistical stability metrics that still lead to accurate results. We evaluate $\mu$OpTime with an experimental study on 14 open-source projects written in two programming languages and five stability metrics. Our results show that (i) $\mu$OpTime reduces the total suite execution time (measurement phase) by up to 95.83% (Go) and 94.17% (Java), (ii) the choice of stability metric depends on the project and programming language, (iii) microbenchmark warmup phases have to be considered for Java projects (potentially leading to higher reductions), and (iv) $\mu$OpTime can be used to reliably detect performance regressions in CI/CD pipelines.
comment: Published in ACM Transactions on Software Engineering and Methodology
♻ ☆ Can Agents Fix Agent Issues? NeurIPS 2025
LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AGENTISSUE-BENCH and reveal their limited effectiveness (i.e., with only 3.33% - 12.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/ .
comment: Accepted by the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ ReVeal: Self-Evolving Code Agents via Reliable Self-Verification
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. However, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verification-generation asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generation-verification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three. It also significantly improves Pass@k, indicating stronger exploration that expands the reasoning boundaries of the base model. These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.
♻ ☆ PEACE: Towards Efficient Project-Level Efficiency Optimization via Hybrid Code Editing
Large Language Models (LLMs) have demonstrated significant capability in code generation, but their potential in code efficiency optimization remains underexplored. Previous LLM-based code efficiency optimization approaches exclusively focus on function-level optimization and overlook interaction between functions, failing to generalize to real-world development scenarios. Code editing techniques show great potential for conducting project-level optimization, yet they face challenges associated with invalid edits and suboptimal internal functions. To address these gaps, we propose Peace, a novel hybrid framework for Project-level code Efficiency optimization through Automatic Code Editing, which also ensures the overall correctness and integrity of the project. Peace integrates three key phases: dependency-aware optimizing function sequence construction, valid associated edits identification, and efficiency optimization editing iteration. To rigorously evaluate the effectiveness of Peace, we construct PeacExec, the first benchmark comprising 146 real-world optimization tasks from 47 high-impact GitHub Python projects, along with highly qualified test cases and executable environments. Extensive experiments demonstrate Peace's superiority over the state-of-the-art baselines, achieving a 69.2% correctness rate (pass@1), +46.9% opt rate, and 0.840 speedup in execution efficiency. Notably, our Peace outperforms all baselines by significant margins, particularly in complex optimization tasks with multiple functions. Moreover, extensive experiments are also conducted to validate the contributions of each component in Peace, as well as the rationale and effectiveness of our hybrid framework design.
♻ ☆ Exploring Data-Efficient Adaptation of Large Language Models for Code Generation
Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training data available in practice leads to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training data is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation. DEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome their own shortcomings, thus achieving efficient learning. Specifically, DEED involves identifying error code generated by LLMs, employing Self-Revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data, showing an average relative improvement of 46.2% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-Revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, DEED consistently demonstrates strong performance across various LLMs, underscoring its applicability.
comment: Accepted by TOSEM
♻ ☆ When Agents go Astray: Course-Correcting SWE Agents with PRMs
Large Language Model (LLM) agents are increasingly deployed for complex, multi-step software engineering (SWE) tasks. However, their trajectories often contain costly inefficiencies, such as redundant exploration, looping, and failure to terminate once a solution is reached. Prior work has largely treated these errors in a post-hoc manner, diagnosing failures only after execution. In this paper, we introduce SWE-PRM, an inference-time Process Reward Model (PRM) that intervenes during execution to detect and course-correct trajectory-level errors. Our PRM design leverages a taxonomy of common inefficiencies and delivers lightweight, interpretable feedback without modifying the underlying policy. On SWE-bench Verified, closed-source PRMs improve resolution from 40.0% to 50.6% (+10.6 p.p.), with the largest gains on medium and hard tasks. Among feedback strategies, taxonomy-guided PRMs outperform unguided or explicit action-prescriptive variants, increasing success rate while reducing trajectory length. These benefits come at an acceptable added inference cost of as low as $0.2, making PRMs a practical and scalable mechanism for improving SWE agents' reliability and efficiency.
♻ ☆ Deep Learning Framework Testing via Heuristic Guidance Based on Multiple Model Measurements
Deep learning frameworks serve as the foundation for developing and deploying deep learning applications. To enhance the quality of deep learning frameworks, researchers have proposed numerous testing methods using deep learning models as test inputs. However, existing methods predominantly measure model bug detection effectiveness as heuristic indicators, presenting three critical limitations. Firstly, existing methods fail to quantitatively measure model's operator combination variety, potentially missing critical operator combinations that could trigger framework bugs. Secondly, existing methods neglect measuring and heuristically guiding the model execution time, resulting in the omission of numerous models potential for detecting more framework bugs within limited testing time. Thirdly, existing methods overlook correlation between different model measurements, relying simply on single-indicator heuristic guidance without considering their trade-offs. To overcome these limitations, we propose DLMMM, the first deep learning framework testing method to include multiple model measurements into heuristic guidance and fuse these measurements to achieve their trade-offs. DLMMM firstly quantitatively measures model's bug detection performance, operator combination variety, and model execution time. After that, DLMMM fuses these measurements based on their correlation to achieve their trade-offs. To further enhance testing effectiveness, DLMMM designs multi-level heuristic guidance for test input model generation. We apply DLMMM to test three widely used deep learning frameworks (including TensorFlow, PyTorch, and MindSpore). The experimental results show that DLMMM outperforms state-of-the-art methods in effectiveness and efficiency.
♻ ☆ Benchmarking Large Language Models with Integer Sequence Generation Tasks
We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs' abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as ``easy'' or ``hard.'' Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models' training data. To prevent models from exploiting memorized sequence values, we introduce an automated cheating detection mechanism that flags usage of lookup tables, validated by comparison with human expert evaluations. Experimental results demonstrate that reasoning-specialized models (o3, o3-mini, o4-mini from OpenAI, and Gemini 2.5-pro from Google) achieve substantial improvements in accuracy over non-reasoning models, especially on more complex tasks. However, overall model performance on the hard sequences is poor, highlighting persistent challenges in algorithmic reasoning. Our benchmark provides important insights into the strengths and limitations of state-of-the-art LLMs, particularly emphasizing the necessity for further advancements to reliably solve complex mathematical reasoning tasks algorithmically.
Systems and Control 41
☆ Lyapunov-Aware Quantum-Inspired Reinforcement Learning for Continuous-Time Vehicle Control: A Feasibility Study
This paper presents a novel Lyapunov-Based Quantum Reinforcement Learning (LQRL) framework that integrates quantum policy optimization with Lyapunov stability analysis for continuous-time vehicle control. The proposed approach combines the representational power of variational quantum circuits (VQCs) with a stability-aware policy gradient mechanism to ensure asymptotic convergence and safe decision-making under dynamic environments. The vehicle longitudinal control problem was formulated as a continuous-state reinforcement learning task, where the quantum policy network generates control actions subject to Lyapunov stability constraints. Simulation experiments were conducted in a closed-loop adaptive cruise control scenario using a quantum-inspired policy trained under stability feedback. The results demonstrate that the LQRL framework successfully embeds Lyapunov stability verification into quantum policy learning, enabling interpretable and stability-aware control performance. Although transient overshoot and Lyapunov divergence were observed under aggressive acceleration, the system maintained bounded state evolution, validating the feasibility of integrating safety guarantees within quantum reinforcement learning architectures. The proposed framework provides a foundational step toward provably safe quantum control in autonomous systems and hybrid quantum-classical optimization domains.
comment: 7 pages, 4 figures, 20 equations, 3 appendices, 4 tables
☆ MADR: MPC-guided Adversarial DeepReach
Hamilton-Jacobi (HJ) Reachability offers a framework for generating safe value functions and policies in the face of adversarial disturbance, but is limited by the curse of dimensionality. Physics-informed deep learning is able to overcome this infeasibility, but itself suffers from slow and inaccurate convergence, primarily due to weak PDE gradients and the complexity of self-supervised learning. A few works, recently, have demonstrated that enriching the self-supervision process with regular supervision (based on the nature of the optimal control problem), greatly accelerates convergence and solution quality, however, these have been limited to single player problems and simple games. In this work, we introduce MADR: MPC-guided Adversarial DeepReach, a general framework to robustly approximate the two-player, zero-sum differential game value function. In doing so, MADR yields the corresponding optimal strategies for both players in zero-sum games as well as safe policies for worst-case robustness. We test MADR on a multitude of high-dimensional simulated and real robotic agents with varying dynamics and games, finding that our approach significantly out-performs state-of-the-art baselines in simulation and produces impressive results in hardware.
comment: 8 pages, under review
☆ $\ell_1$-Based Adaptive Identification under Quantized Observations with Applications
Quantized observations are ubiquitous in a wide range of applications across engineering and the social sciences, and algorithms based on the $\ell_1$-norm are well recognized for their robustness to outliers compared with their $\ell_2$-based counterparts. Nevertheless, adaptive identification methods that integrate quantized observations with $\ell_1$-optimization remain largely underexplored. Motivated by this gap, we develop a novel $\ell_1$-based adaptive identification algorithm specifically designed for quantized observations. Without relying on the traditional persistent excitation condition, we establish global convergence of the parameter estimates to their true values and show that the average regret asymptotically vanishes as the data size increases. Finally, we apply our new identification algorithm to a judicial sentencing problem using real-world data, which demonstrates its superior performance and practical significance.
☆ A Note on Optimal Distributed State Estimation for Linear Time-Varying Systems
In this technical note, we prove that the ODEFTC algorithm constitutes the first optimal distributed state estimator for continuous-time linear time-varying systems subject to stochastic disturbances. Particularly, we formally show that it is able to asymptotically recover the performance, in terms of error covariance of the estimates at each node, of the centralized Kalman-Bucy filter, which is known to be the optimal filter for the considered class of systems. Moreover, we provide a simple sufficient value for the consensus gain to guarantee the stability of the distributed estimator.
comment: This work has been submitted to the IEEE for possible publication
☆ Quantifying Security for Networked Control Systems: A Review
Networked Control Systems (NCSs) are integral in critical infrastructures such as power grids, transportation networks, and production systems. Ensuring the resilient operation of these large-scale NCSs against cyber-attacks is crucial for societal well-being. Over the past two decades, extensive research has been focused on developing metrics to quantify the vulnerabilities of NCSs against attacks. Once the vulnerabilities are quantified, mitigation strategies can be employed to enhance system resilience. This article provides a comprehensive overview of methods developed for assessing NCS vulnerabilities and the corresponding mitigation strategies. Furthermore, we emphasize the importance of probabilistic risk metrics to model vulnerabilities under adversaries with imperfect process knowledge. The article concludes by outlining promising directions for future research.
comment: Journal submission
☆ PIRA: Pan-CDN Intra-video Resource Adaptation for Short Video Streaming
In large scale short video platforms, CDN resource selection plays a critical role in maintaining Quality of Experience (QoE) while controlling escalating traffic costs. To better understand this phenomenon, we conduct in the wild network measurements during video playback in a production short video system. The results reveal that CDNs delivering higher average QoE often come at greater financial cost, yet their connection quality fluctuates even within a single video underscoring a fundamental and dynamic trade off between QoE and cost. However, the problem of sustaining high QoE under cost constraints remains insufficiently investigated in the context of CDN selection for short video streaming. To address this, we propose PIRA, a dynamic resource selection algorithm that optimizes QoE and cost in real time during video playback. PIRA formally integrating QoE and cost by a mathematical model, and introduce a intra video control theoretic CDN resource selection approach which can balance QoE and cost under network dynamics. To reduce the computation overheads, PIRA employs state space pruning and adaptive parameter adjustment to efficiently solve the high dimensional optimization problem. In large scale production experiments involving 450,000 users over two weeks, PIRA outperforms the production baseline, achieving a 2.1% reduction in start up delay, 15.2% shorter rebuffering time, and 10% lower average unit traffic cost, demonstrating its effectiveness in balancing user experience and financial cost at scale.
☆ Designing trajectories in the Earth-Moon system: a Levenberg-Marquardt approach
Trajectory design in cislunar space under a High-Fidelity Ephemeris Model (HFEM) is pursued through a nonlinear optimization perspective anchored on the transition of solutions from lower fidelity models, namely the Circular Restricted Three-Body Problem (CR3BP). The optimization problem is posed in the likeness of a multiple-shooting approach, aiming for segment-to-segment continuity while tracking proximity to the original CR3BP structures. The analysis of various formulations leads to the selection of an unconstrained least-squares problem for further investigation. The nonlinear optimization problem is convexified and the use of the Levenberg-Marquardt algorithm, as an alternative to the minimum-norm update equation found in most literature, is investigated for its control over the update step and inherent robustness. Additional techniques such as adaptive weighting are employed to further consolidate the behavior of the proposed algorithm in challenging scenarios. Numerical trials evaluate the adequacy of the methodology presented and compare it to the minimum-norm baseline over various application cases, including the generation of quasi-periodic trajectories and orbital transfers between them. The proposed approach is found to outperform the baseline in applications where the initial guess is poor and the ease of including proximity constraints provides benefits in control over the shape of the converged solution.
comment: Preprint submitted to Advances in Space Research
☆ Sliding-Mode Control Strategies for PMSM speed control: A Comprehensive Review, Taxonomy and Research Gaps
Permanent Magnet Synchronous Motors (PMSMs) are widely employed in high-performance drive systems due to their high efficiency, power density, and precise dynamic behavior. However, nonlinearities, load disturbances, and parameter uncertainties present persistent challenges to control. Sliding-Mode Control (SMC) remains one of the most reliable strategies for high-performance PMSM drives. Yet, the rapid proliferation of adaptive, fractional-order, and intelligent variants has fragmented recent literature. This paper presents a comprehensive review and taxonomy of SMC-based PMSM speed-control methods published between 2020 and 2025. More than 200 studies are systematically analyzed and classified according to control order, surface design, disturbance-observer integration, optimization approach, and intelligent augmentation. Trends in publication activity, dominant hybrid structures, and application domains are quantitatively summarized. The review reveals a clear evolution from conventional discontinuous SMC toward adaptive, higher-order, and data-driven frameworks that mitigate chattering while preserving robustness. Persistent research gaps are identified in hardware validation, energy-efficiency assessment, and real-time tuning strategies. The taxonomy and critical synthesis provided herein establish a coherent reference for researchers and form the conceptual foundation for the companion paper (Part II), which delivers a unified benchmark and comparative simulation study of representative SMC designs.
comment: 30 Pages, 7 Fugures, 3 Tables
☆ MPC-based motion planning for non-holonomic systems in non-convex domains
Motivated by the application of using model predictive control (MPC) for motion planning of autonomous mobile robots, a form of output tracking MPC for non-holonomic systems and with non-convex constraints is studied. Although the advantages of using MPC for motion planning have been demonstrated in several papers, in most of the available fundamental literature on output tracking MPC it is assumed, often implicitly, that the model is holonomic and generally the state or output constraints must be convex. Thus, in application-oriented publications, empirical results dominate and the topic of proving completeness, in particular under which assumptions the target is always reached, has received comparatively little attention. To address this gap, we present a novel MPC formulation that guarantees convergence to the desired target under realistic assumptions, which can be verified in relevant real-world scenarios.
comment: Preprint of ECC 2025 submission
☆ MMRHP: A Miniature Mixed-Reality HIL Platform for Auditable Closed-Loop Evaluation
Validation of autonomous driving systems requires a trade-off between test fidelity, cost, and scalability. While miniaturized hardware-in-the-loop (HIL) platforms have emerged as a promising solution, a systematic framework supporting rigorous quantitative analysis is generally lacking, limiting their value as scientific evaluation tools. To address this challenge, we propose MMRHP, a miniature mixed-reality HIL platform that elevates miniaturized testing from functional demonstration to rigorous, reproducible quantitative analysis. The core contributions are threefold. First, we propose a systematic three-phase testing process oriented toward the Safety of the Intended Functionality(SOTIF)standard, providing actionable guidance for identifying the performance limits and triggering conditions of otherwise correctly functioning systems. Second, we design and implement a HIL platform centered around a unified spatiotemporal measurement core to support this process, ensuring consistent and traceable quantification of physical motion and system timing. Finally, we demonstrate the effectiveness of this solution through comprehensive experiments. The platform itself was first validated, achieving a spatial accuracy of 10.27 mm RMSE and a stable closed-loop latency baseline of approximately 45 ms. Subsequently, an in-depth Autoware case study leveraged this validated platform to quantify its performance baseline and identify a critical performance cliff at an injected latency of 40 ms. This work shows that a structured process, combined with a platform offering a unified spatio-temporal benchmark, enables reproducible, interpretable, and quantitative closed-loop evaluation of autonomous driving systems.
☆ Coverage-Recon: Coordinated Multi-Drone Image Sampling with Online Map Feedback
This article addresses collaborative 3D map reconstruction using multiple drones. Achieving high-quality reconstruction requires capturing images of keypoints within the target scene from diverse viewing angles, and coverage control offers an effective framework to meet this requirement. Meanwhile, recent advances in real-time 3D reconstruction algorithms make it possible to render an evolving map during flight, enabling immediate feedback to guide drone motion. Building on this, we present Coverage-Recon, a novel coordinated image sampling algorithm that integrates online map feedback to improve reconstruction quality on-the-fly. In Coverage-Recon, the coordinated motion of drones is governed by a Quadratic Programming (QP)-based angle-aware coverage controller, which ensures multi-viewpoint image capture while enforcing safety constraints. The captured images are processed in real time by the NeuralRecon algorithm to generate an evolving 3D mesh. Mesh changes across the scene are interpreted as indicators of reconstruction uncertainty and serve as feedback to update the importance index of the coverage control as the map evolves. The effectiveness of Coverage-Recon is validated through simulation and experiments, demonstrating both qualitatively and quantitatively that incorporating online map feedback yields more complete and accurate 3D reconstructions than conventional methods. Project page: https://htnk-lab.github.io/coverage-recon/
comment: Submitted to IEEE Transactions on Control Systems Technology (under review). Project page: https://htnk-lab.github.io/coverage-recon/
☆ Explicit Reformulation of Discrete Distributionally Robust Optimization Problems
Distributionally robust optimization (DRO) is an effective framework for controlling real-world systems with various uncertainties, typically modeled using distributional uncertainty balls. However, DRO problems often involve infinitely many inequality constraints, rendering exact solutions computationally expensive. In this study, we propose a discrete DRO (DDRO) method that significantly simplifies the problem by reducing it to a single trivial constraint. Specifically, the proposed method utilizes two types of distributional uncertainty balls to reformulate the DDRO problem into a single-layer smooth convex program, significantly improving tractability. Furthermore, we provide practical guidance for selecting the appropriate ball sizes. The original DDRO problem is further reformulated into two optimization problems: one minimizing the mean and standard deviation, and the other minimizing the conditional value at risk (CVaR). These formulations account for the choice of ball sizes, thereby enhancing the practical applicability of the method. The proposed method was applied to a distributionally robust patrol-agent design problem, identifying a Pareto front in which the mean and standard deviation of the mean hitting time varied by up to 3% and 14%, respectively, while achieving a CVaR reduction of up to 13%.
comment: 15 pages, 4 figures, This paper is submitted to a journal for possible publication
☆ Distributed Allocation and Resource Scheduling Algorithms Resilient to Link Failure
Distributed resource allocation (DRA) is fundamental to modern networked systems, spanning applications from economic dispatch in smart grids to CPU scheduling in data centers. Conventional DRA approaches require reliable communication, yet real-world networks frequently suffer from link failures, packet drops, and communication delays due to environmental conditions, network congestion, and security threats. We introduce a novel resilient DRA algorithm that addresses these critical challenges, and our main contributions are as follows: (1) guaranteed constraint feasibility at all times, ensuring resource-demand balance even during algorithm termination or network disruption; (2) robust convergence despite sector-bound nonlinearities at nodes/links, accommodating practical constraints like quantization and saturation; and (3) optimal performance under merely uniformly-connected networks, eliminating the need for continuous connectivity. Unlike existing approaches that require persistent network connectivity and provide only asymptotic feasibility, our graph-theoretic solution leverages network percolation theory to maintain performance during intermittent disconnections. This makes it particularly valuable for mobile multi-agent systems where nodes frequently move out of communication range. Theoretical analysis and simulations demonstrate that our algorithm converges to optimal solutions despite heterogeneous time delays and substantial link failures, significantly advancing the reliability of distributed resource allocation in practical network environments.
comment: European Journal of Control
☆ Brute-force search and Warshall algorithms for matrix-weighted graphs
Although research on the control of networked systems has grown considerably, graph-theoretic and algorithmic studies on matrix-weighted graphs remain limited. To bridge this gap in the literature, this work introduces two algorithms-the brute-force search and the Warshall algorithm-for determining connectedness and clustering in undirected matrix-weighted graphs. The proposed algorithms, which are derived from a sufficient condition for connectedness, emphasize a key distinction between matrix-weighted and scalar-weighted graphs. While the existence of a path between two vertices guarantees connectedness in scalar-weighted graphs, connectedness in matrix-weighted graphs is a collective contribution of all paths joining the two vertices. Proofs of correctness and numerical examples are provided to illustrate and demonstrate the effectiveness of the algorithms.
comment: 20 pages, 6 figures, preprint
☆ Urban Air Mobility: A Review of Recent Advances in Communication, Management, and Sustainability
Urban Air Mobility (UAM) offers a transformative approach to addressing urban congestion, improving accessibility, and advancing environmental sustainability. Rapid progress has emerged in three tightly linked domains since 2020: (1) Communication, where dynamic spectrum allocation and low-altitude channel characterization support reliable air-ground data exchange; (2) UAM management, with novel air-traffic control concepts for dense, largely autonomous urban airspace; and (3) Sustainability, driven by energy-efficient propulsion, integrated charging infrastructure, and holistic environmental assessment. This paper reviews and synthesizes the latest research across these areas, compares the state-of-the-art solutions, and outlines the technological and infrastructural milestones that are critical to realizing a scalable, sustainable UAM ecosystem.
comment: This work has been accepted by the 2025 International Conference on Cyber-physical Social Intelligence (CPSI 2025)
☆ Harmonic Cancellation in Multi-Electrolyzer P2H Plants via Phasor-Modulated Production Scheduling
Thyristor rectifiers (TRs) are cost-effective power supplies for hydrogen electrolyzers (ELZs) but introduce harmonic distortion that may violate grid codes. This letter proposes a self-governing harmonic mitigation strategy through coordinated operation of multiple ELZs in large power-to-hydrogen (P2H) plants. First, the harmonic model of TR-powered ELZs is derived, revealing a natural harmonic cancellation mechanism among them. Based on this, a system-level operation scheme based on phasor modulation is developed and integrated into plant scheduling. Case studies demonstrate that the proposed method reduces harmonic currents by 21.2%-39.7% and ensures grid-code compliance, with only a 0.25% loss in hydrogen output, while increasing total revenue by over 21\% compared to production-oriented strategies.
comment: This work has been submitted to the IEEE for possible publication
☆ Wisdom of Crowds Effects under Antagonistic Interactions and Correlated Opinions
This paper investigates the wisdom of crowds of linear opinion dynamics models evolving on signed networks. Conditions are given under which models such as the DeGroot, Friedkin-Johnsen (FJ) and concatenated FJ models improve or undermine collective wisdom. The extension to dependent initial opinions is also presented, highlighting how the correlation structure influences the feasibility and geometry of the wisdom-improving regions.
☆ A Configurable Simulation Framework for Safety Assessment of Vulnerable Road Users
Ensuring the safety of vulnerable road users (VRUs), including pedestrians, cyclists, electric scooter riders, and motorcyclists, remains a major challenge for advanced driver assistance systems (ADAS) and connected and automated vehicles (CAV) technologies. Real-world VRU tests are expensive and sometimes cannot capture or repeat rare and hazardous events. In this paper, we present a lightweight, configurable simulation framework that follows European New Car Assessment Program (Euro NCAP) VRU testing protocols. A rule-based finite-state machine (FSM) is developed as a motion planner to provide vehicle automation during the VRU interaction. We also integrate ego-vehicle perception and idealized Vehicle-to-Everything (V2X) awareness to demonstrate safety margins in different scenarios. This work provides an extensible platform for rapid and repeatable VRU safety validation, paving the way for broader case-study deployment in diverse, user-defined settings, which will be essential for building a more VRU-friendly and sustainable intelligent transportation system.
comment: This work has been accepted by the 2025 International Conference on Cyber-physical Social Intelligence (CPSI 2025)
☆ Time Domain Differential Equation Based Fault Location Identification in Mixed Overhead-Underground Power Distribution Systems
This paper proposes a time-domain fault location identification method for mixed overhead-underground power distribution systems that can handle challenging fault scenarios such as sub-cycle faults, arcing faults and evolving faults. The proposed method is formulated based on differential equations of the system and accounts for the peculiarities of power distribution systems with distributed generations. It considers the presence of loads, multi-phase laterals and sub-laterals, heterogenous overhead and underground lines, and infeeds and remote-end fault current contributions of distributed generations. It utilizes data collected by limited number of measuring devices installed in modern power distribution systems to systematically eliminate possible multiple fault location estimations to provide a single correct estimation of the actual location of the fault. The performance of the proposed method is demonstrated using IEEE 34-node test system.
☆ A Learning-based Model Reference Adaptive Controller Implemented on a Prosthetic Hand Wrist
The functionality and natural motion of prosthetic hands remain limited by the challenges in controlling compliant wrist mechanisms. Current control strategies often lack adaptability and incur high computational costs, which impedes real-time deployment in assistive robotics. To address this gap, this study presents a computationally efficient Neural Network (NN)-based Model Reference Adaptive Controller (MRAC) for a tendon-driven soft continuum wrist integrated with a prosthetic hand. The dynamic modeling of the wrist is formulated using Timoshenko beam theory, capturing both shear and bending deformations. The proposed NN-MRAC estimates the required tendon forces from deflection errors and minimizes deviation from a reference model through online adaptation. Simulation results demonstrate improved precision with a root mean square error (RMSE) of $6.14 \times 10^{-4}$ m and a settling time of $3.2$s. Experimental validations confirm real-time applicability, with an average RMSE of $5.66 \times 10^{-3}$ m, steady-state error of $8.05 \times 10^{-3}$ m, and settling time of $1.58$ s. These results highlight the potential of the controller to enhance motion accuracy and responsiveness in soft prosthetic systems, thereby advancing the integration of adaptive intelligent control in wearable assistive devices.
comment: International Conference on Social Robotics + AI
☆ Graph Analysis to Fully Automate Fault Location Identification in Power Distribution Systems
This paper proposes graph analysis methods to fully automate the fault location identification task in power distribution systems. The proposed methods take basic unordered data from power distribution systems as input, including branch parameters, load values, and the location of measuring devices. The proposed data preparation and analysis methods automatically identify the system's topology and extract essential information, such as faulted paths, structures, loading of laterals and sublaterals, and estimate the fault location accordingly. The proposed graph analysis methods do not require complex node and branch numbering processes or renumbering following changes in the system topology. The proposed methods eliminate the need for human intervention at any step of the fault location identification process. They are scalable and applicable to systems of any size. The performance of the proposed algorithm is demonstrated using the IEEE 34-bus distribution test system.
☆ Convex Maneuver Planning for Spacecraft Collision Avoidance
Conjunction analysis and maneuver planning for spacecraft collision avoidance remains a manual and time-consuming process, typically involving repeated forward simulations of hand-designed maneuvers. With the growing density of satellites in low-Earth orbit (LEO), autonomy is becoming essential for efficiently evaluating and mitigating collisions. In this work, we present an algorithm to design low-thrust collision-avoidance maneuvers for short-term conjunction events. We first formulate the problem as a nonconvex quadratically-constrained quadratic program (QCQP), which we then relax into a convex semidefinite program (SDP) using Shor's relaxation. We demonstrate empirically that the relaxation is tight, which enables the recovery of globally optimal solutions to the original nonconvex problem. Our formulation produces a minimum-energy solution while ensuring a desired probability of collision at the time of closest approach. Finally, if the desired probability of collision cannot be satisfied, we relax this constraint into a penalty, yielding a minimum-risk solution. We validate our algorithm with a high-fidelity simulation of a satellite conjunction in low-Earth orbit with a simulated conjunction data message (CDM), demonstrating its effectiveness in reducing collision risk.
comment: 8 pages, 6 figures, Accepted to International Space Robotics Conference
☆ Motion Planning and Control of an Overactuated 4-Wheel Drive with Constrained Independent Steering
This paper addresses motion planning and con- trol of an overactuated 4-wheel drive train with independent steering (4WIS) where mechanical constraints prevent the wheels from executing full 360-degree rotations (swerve). The configuration space of such a robot is constrained and contains discontinuities that affect the smoothness of the robot motion. We introduce a mathematical formulation of the steering constraints and derive discontinuity planes that partition the velocity space into regions of smooth and efficient motion. We further design the motion planner for path tracking and ob- stacle avoidance that explicitly accounts for swerve constraints and the velocity transition smoothness. The motion controller uses local feedback to generate actuation from the desired velocity, while properly handling the discontinuity crossing by temporarily stopping the motion and repositioning the wheels. We implement the proposed motion planner as an extension to ROS Navigation package and evaluate the system in simulation and on a physical robot.
comment: 7 pages, 5 figures, 3 tables, video available at https://youtu.be/8l9s2Wb_vec, To appear at IEEE 2025 International Conference on Advanced Robotics
☆ Extreme value distributions of peak loads for non-residential customer segments SC
Electrical grid congestion is a growing challenge in Europe, driving the need for accurate prediction of load, particularly of peak load. Non-time-resolved models of peak load offer the advantages of simplicity and compactness, and among them, Velander's formula (VF) is a traditional method that has been used for decades. Moreover, VF can be adapted into a quantile VF, which learns a truncated cumulative distribution function of peak load based on electricity consumption. This paper proposes a mathematical model based on extreme value theory to characterize the probability distribution of peak load for large non-residential customers. The model underpins the quantile VF as demonstrated through multiple quantile regression and reduces its representation to just four parameters without sacrificing predictive performance. Moreover, using maximum likelihood estimation and the likelihood ratio test, we validate that the probability distribution of peak load of analysed groups belongs to the heavy-tailed Fr\'echet class.
comment: Submitted to Power Systems Computation Conference (PSCC) 2026
☆ Extending Resource Constrained Project Scheduling to Mega-Projects with Model-Based Systems Engineering & Hetero-functional Graph Theory
Within the project management context, project scheduling serves as an indispensable component, functioning as a fundamental tool for planning, monitoring, controlling, and managing projects more broadly. Although the resource-constrained project scheduling problem (RCPSP) lies at the core of project management activities, it remains largely disconnected from the broader literature on model-based systems engineering (MBSE), thereby limiting its integration into the design and management of complex systems. The original contribution of this paper is twofold. First, the paper seeks to reconcile the RCPSP with the broader literature and vocabulary of model-based systems engineering and hetero-functional graph theory (HFGT). A concrete translation pipeline from an activity-on-node network to a SysML activity diagram, and then to an operand net is constructed. Using this representation, it specializes the hetero-functional network minimum-cost flow (HFNMCF) formulation to the RCPSP context as a systematic means of HFGT for quantitative analysis and proves that the RCPSP is recoverable as a special case of a broader model. Secondly, on an illustrative instance with renewable and non-renewable operands, the specialized HFNMCF, while producing similar schedules, yields explicit explanations of the project states that enable richer monitoring and control. Overall, the framework preserves the strengths of the classical RCPSP while accommodating real-world constraints and enterprise-level decision processes encountered in large, complex megaprojects.
☆ Active Cooling Device: A Flexible, Lab-Scale Experimental Unit to Develop Spatio-Temporal Temperature Control Strategies
We present an experimental unit that realizes the ``multi-input, multi-output manifold'' thermal management technology proposed by Lamarre & Raymond (2023). The proposed setup can be used for experiments aimed at controlling spatiotemporal temperature distribution. Temperature control is achieved by impinging coolant fluid jets, leveraging a manifold of channels targeted to the surface. The direction of the fluid is controlled by shifting the role of channels between inputs, outputs, or closing them. Files associated with this work include Computer-Aided Design (CAD) STEP files, Gerber files to manufacture a Printed Circuit Board (PCB), and a Graphical User Interface (GUI) written in Python. We provide a step-by-step guide to assemble the experimental setup. We also provide instructions to interact with the setup through the GUI, which allows for real-time tracking of sample temperature and flow rates per flow control device. Additionally, we provide examples of usage of the setup, including system characterization with step response, Proportional-Integral-Derivative performance tracking, and disturbance rejection in a coupled system. Extending the application is accessible through the files provided in the open repository associated with this work. The active cooling device presents a safe, flexible, and complete design, allowing for lab-scale assessment of the performance of custom temperature control strategies using enclosed impinging jets.
♻ ☆ PowerChain: A Verifiable Agentic AI System for Automating Distribution Grid Analyses
Rapid electrification and decarbonization are increasing the complexity of distribution grid (DG) operation and planning, necessitating advanced computational analyses to ensure reliability and resilience. These analyses depend on disparate workflows comprising complex models, function calls, and data pipelines that require substantial expert knowledge and remain difficult to automate. Workforce and budget constraints further limit utilities' ability to apply such analyses at scale. To address this gap, we build an agentic system PowerChain, which is capable of autonomously performing complex grid analyses. Existing agentic AI systems are typically developed in a bottom-up manner with customized context for predefined analysis tasks; therefore, they do not generalize to tasks that the agent has never seen. In comparison, to generalize to unseen DG analysis tasks, PowerChain dynamically generates structured context by leveraging supervisory signals from self-contained power systems tools (e.g., GridLAB-D) and an optimized set of expert-annotated and verified reasoning trajectories. For complex DG tasks defined in natural language, empirical results on real utility data demonstrate that PowerChain achieves up to a 144/% improvement in performance over baselines.
♻ ☆ Rethink Repeatable Measures of Robot Performance with Statistical Query
For a general standardized testing algorithm designed to evaluate a specific aspect of a robot's performance, several key expectations are commonly imposed. Beyond accuracy (i.e., closeness to a typically unknown ground-truth reference) and efficiency (i.e., feasibility within acceptable testing costs and equipment constraints), one particularly important attribute is repeatability. Repeatability refers to the ability to consistently obtain the same testing outcome when similar testing algorithms are executed on the same subject robot by different stakeholders, across different times or locations. However, achieving repeatable testing has become increasingly challenging as the components involved grow more complex, intelligent, diverse, and, most importantly, stochastic. While related efforts have addressed repeatability at ethical, hardware, and procedural levels, this study focuses specifically on repeatable testing at the algorithmic level. Specifically, we target the well-adopted class of testing algorithms in standardized evaluation: statistical query (SQ) algorithms (i.e., algorithms that estimate the expected value of a bounded function over a distribution using sampled data). We propose a lightweight, parameterized, and adaptive modification applicable to any SQ routine, whether based on Monte Carlo sampling, importance sampling, or adaptive importance sampling, that makes it provably repeatable, with guaranteed bounds on both accuracy and efficiency. We demonstrate the effectiveness of the proposed approach across three representative scenarios: (i) established and widely adopted standardized testing of manipulators, (ii) emerging intelligent testing algorithms for operational risk assessment in automated vehicles, and (iii) developing use cases involving command tracking performance evaluation of humanoid robots in locomotion tasks.
♻ ☆ One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares
While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit's update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning.
comment: Journal extension of v1: Y. Min, K, Ahn, N. Azizan, "One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares," IEEE Conference on Decision and Control, 2022
♻ ☆ Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks
Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present NAO: a Nondeterministic tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. NAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement NAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. NAO reconciles scalability with verifiability for real-world heterogeneous ML compute.
comment: 17 pages, 7 figures
♻ ☆ Unifying Direct and Indirect Learning for Safe Control of Linear Systems
This paper develops learning-enabled safe controllers for linear systems subject to system uncertainties and bounded disturbances. Given the disturbance zonotope, the databased closed-loop dynamics (CLDs) are first characterized using a matrix zonotope (MZ), and refined through several steps to yield a constrained matrix zonotope (CMZ). This refinement is achieved by introducing conformal equality constraints that eliminate incompatible disturbance realizations. More precisely, prior knowledge and observed data are used separately to construct CMZ representations of disturbance sequences that conform to both data and prior knowledge, and are intersected by the initial MZ of the disturbance sequence, producing a refined CMZ. This approach reduces conservatism. To further reduce the conservativeness, we unify open-loop learning with closed-loop learning by presenting a novel set-membership identification method that models open-loop dynamics as a CMZ. The prior knowledge serves as an initial feasible open-loop model set (FOLMS) of this CMZ, which is refined into a posterior set whenever new informative online data becomes available. This posterior FOLMS then adaptively replaces the prior knowledge set employed in the disturbance elimination of the closed-loop learning process. The resulting refined parameterized set of CLD is subsequently leveraged to directly and adaptively learn a controller that robustly enforces safety. Toward this goal, we formulate a linear programming problem that guarantees {\lambda}contractiveness of a polyhedral safe set. A simulation example is provided to validate the effectiveness of the proposed approach and support the theoretical results.
comment: arXiv admin note: text overlap with arXiv:2502.04195
♻ ☆ Dynamic object goal pushing with mobile manipulators through model-free constrained reinforcement learning ICRA 2025
Non-prehensile pushing to move and reorient objects to a goal is a versatile loco-manipulation skill. In the real world, the object's physical properties and friction with the floor contain significant uncertainties, which makes the task challenging for a mobile manipulator. In this paper, we develop a learning-based controller for a mobile manipulator to move an unknown object to a desired position and yaw orientation through a sequence of pushing actions. The proposed controller for the robotic arm and the mobile base motion is trained using a constrained Reinforcement Learning (RL) formulation. We demonstrate its capability in experiments with a quadrupedal robot equipped with an arm. The learned policy achieves a success rate of 91.35% in simulation and at least 80% on hardware in challenging scenarios. Through our extensive hardware experiments, we show that the approach demonstrates high robustness against unknown objects of different masses, materials, sizes, and shapes. It reactively discovers the pushing location and direction, thus achieving contact-rich behavior while observing only the pose of the object. Additionally, we demonstrate the adaptive behavior of the learned policy towards preventing the object from toppling.
comment: presented at ICRA 2025, Video: https://youtu.be/wGAdPGVf9Ws?si=pi83ONWofHHqbFG0
♻ ☆ Neural 3D Object Reconstruction with Small-Scale Unmanned Aerial Vehicles
Small Unmanned Aerial Vehicles (UAVs) exhibit immense potential for navigating indoor and hard-to-reach areas, yet their significant constraints in payload and autonomy have largely prevented their use for complex tasks like high-quality 3-Dimensional (3D) reconstruction. To overcome this challenge, we introduce a novel system architecture that enables fully autonomous, high-fidelity 3D scanning of static objects using UAVs weighing under 100 grams. Our core innovation lies in a dual-reconstruction pipeline that creates a real-time feedback loop between data capture and flight control. A near-real-time (near-RT) process uses Structure from Motion (SfM) to generate an instantaneous pointcloud of the object. The system analyzes the model quality on the fly and dynamically adapts the UAV's trajectory to intelligently capture new images of poorly covered areas. This ensures comprehensive data acquisition. For the final, detailed output, a non-real-time (non-RT) pipeline employs a Neural Radiance Fields (NeRF)-based Neural 3D Reconstruction (N3DR) approach, fusing SfM-derived camera poses with precise Ultra Wide-Band (UWB) location data to achieve superior accuracy. We implemented and validated this architecture using Crazyflie 2.1 UAVs. Our experiments, conducted in both single- and multi-UAV configurations, conclusively show that dynamic trajectory adaptation consistently improves reconstruction quality over static flight paths. This work demonstrates a scalable and autonomous solution that unlocks the potential of miniaturized UAVs for fine-grained 3D reconstruction in constrained environments, a capability previously limited to much larger platforms.
comment: 13 pages, 16 figures, 3 tables, 45 references
♻ ☆ A Flow-Based Model for Conditional and Probabilistic Electricity Consumption Profile Generation and Prediction
Residential Load Profile (RLP) generation and prediction are critical for the operation and planning of distribution networks, especially as diverse low-carbon technologies (e.g., photovoltaic and electric vehicles) are increasingly adopted. This paper introduces a novel flow-based generative model, termed Full Convolutional Profile Flow (FCPFlow), which is uniquely designed for both conditional and unconditional RLP generation, and for probabilistic load forecasting. By introducing two new layers--the invertible linear layer and the invertible normalization layer--the proposed FCPFlow architecture shows three main advantages compared to traditional statistical and contemporary deep generative models: 1) it is well-suited for RLP generation under continuous conditions, such as varying weather and annual electricity consumption, 2) it demonstrates superior scalability in different datasets compared to traditional statistical models, and 3) it also demonstrates better modeling capabilities in capturing the complex correlation of RLPs compared with deep generative models.
♻ ☆ Asynchronous Federated Learning: A Scalable Approach for Decentralized Machine Learning
Federated Learning (FL) has emerged as a powerful paradigm for decentralized machine learning, enabling collaborative model training across diverse clients without sharing raw data. However, traditional FL approaches often face limitations in scalability and efficiency due to their reliance on synchronous client updates, which can result in significant delays and increased communication overhead, particularly in heterogeneous and dynamic environments. To address these challenges in this paper, we propose an Asynchronous Federated Learning (AFL) algorithm, which allows clients to update the global model independently and asynchronously. Our key contributions include a comprehensive convergence analysis of AFL in the presence of client delays and model staleness. By leveraging martingale difference sequence theory and variance bounds, we ensure robust convergence despite asynchronous updates. Assuming strongly convex local objective functions, we establish bounds on gradient variance under random client sampling and derive a recursion formula quantifying the impact of client delays on convergence. Furthermore, we demonstrate the practical applicability of the AFL algorithm by training decentralized linear regression and Support Vector Machine (SVM) based classifiers and compare its results with synchronous FL algorithm to effectively handling non-IID data distributed among clients. The proposed AFL algorithm addresses key limitations of traditional FL methods, such as inefficiency due to global synchronization and susceptibility to client drift. It enhances scalability, robustness, and efficiency in real-world settings with heterogeneous client populations and dynamic network conditions. Our results underscore the potential of AFL to drive advancements indistributed learning systems, particularly for large-scale, privacy-preserving applications in resource-constrained environments.
♻ ☆ Through-the-Earth Magnetic Induction Communication and Networking: A Comprehensive Survey
Magnetic induction (MI) communication (MIC) has emerged as a promising candidate for underground communication networks due to its excellent penetration capabilities. Integration with Space-Air-Ground-Underground (SAGUI) networks in next-generation mobile communication systems requires a well-defined network architecture. A recent discovery in MIC research, MI fast fading, remains in its early stages and presents unique challenges. This paper provides a comprehensive survey on through-the-earth (TTE) MIC, covering MI applications, channel modeling, point-to-point MIC design, relay techniques, network frameworks, and emerging technologies. We compare various MIC applications to highlight TTE-specific challenges and review the principles of channel modeling, addressing both MI slow fading and MI fast fading, along with its potential impact on existing MIC theories. We conduct a fine-grained decomposition of MI channel power gain into four distinct physical parameters, and propose a novel geometric model to analyze MI fast fading. We also summarize MI relay techniques, examine crosstalk effects in relay and high-density networks, and explore key research tasks within the OSI framework for a holistic MI network protocol in SAGUI. To bridge the gaps identified, we propose a MIC framework that supports TCP/IP and Linux, enabling full implementation of existing and emerging MIC solutions. This framework empowers researchers to leverage Linux resources and deep learning platforms for accelerated development of MIC in SAGUI networks. Remaining research challenges, open issues, and promising novel techniques are further identified to advance MIC research.
comment: This work has been accepted by the IEEE Communications Surveys & Tutorials (COMST) for publication. The final published version will be available on IEEE Xplore
♻ ☆ Finite-time Safety and Reach-avoid Verification of Stochastic Discrete-time Systems
This paper studies finite-time safety and reach-avoid verification for stochastic discrete-time dynamical systems. The aim is to ascertain lower and upper bounds of the probability that, within a predefined finite-time horizon, a system starting from an initial state in a safe set will either exit the safe set (safety verification) or reach a target set while remaining within the safe set until the first encounter with the target (reach-avoid verification). We introduce novel barrier-like sufficient conditions for characterizing these bounds, which either complement existing ones or fill gaps. Finally, we demonstrate the efficacy of these conditions on two examples.
comment: To appear in Information and Computation
♻ ☆ Optimal state estimation: Turnpike analysis and performance results
In this paper, we introduce turnpike arguments in the context of optimal state estimation. In particular, we show that the optimal solution of the state estimation problem involving all available past data serves as turnpike for the solutions of truncated problems involving only a subset of the data. We mathematically formalize this phenomenon and derive a sufficient condition that relies on a decaying sensitivity property of the underlying nonlinear program. As second contribution, we show how a specific turnpike property can be used to establish performance guarantees when approximating the optimal solution of the full problem by a sequence of truncated problems, and we show that the resulting performance (both averaged and non-averaged) is approximately optimal with error terms that can be made arbitrarily small by an appropriate choice of the horizon length. In addition, we discuss interesting implications of these results for the practically relevant case of moving horizon estimation and illustrate our results with a numerical example.
comment: replaced with final version
♻ ☆ Finite Sample Identification of Partially Observed Bilinear Dynamical Systems
We consider the problem of learning a realization of a partially observed bilinear dynamical system (BLDS) from noisy input-output data. Given a single trajectory of input-output samples, we provide a finite time analysis for learning the system's Markov-like parameters, from which a balanced realization of the bilinear system can be obtained. Our bilinear system identification algorithm learns the system's Markov-like parameters by regressing the outputs to highly correlated, nonlinear, and heavy-tailed covariates. Moreover, the stability of BLDS depends on the sequence of inputs used to excite the system. These properties, unique to partially observed bilinear dynamical systems, pose significant challenges to the analysis of our algorithm for learning the unknown dynamics. We address these challenges and provide high probability error bounds on our identification algorithm under a uniform stability assumption. Our analysis provides insights into system theoretic quantities that affect learning accuracy and sample complexity. Lastly, we perform numerical experiments with synthetic data to reinforce these insights.
♻ ☆ Sub-optimality of the Separation Principle for Quadratic Control from Bilinear Observations
We consider the problem of controlling a linear dynamical system from bilinear observations with minimal quadratic cost. Despite the similarity of this problem to standard linear quadratic Gaussian (LQG) control, we show that when the observation model is bilinear, neither does the Separation Principle hold, nor is the optimal controller affine in the estimated state. Moreover, the cost-to-go is non-convex in the control input. Hence, finding an analytical expression for the optimal feedback controller is difficult in general. Under certain settings, we show that the standard LQG controller locally maximizes the cost instead of minimizing it. Furthermore, the optimal controllers (derived analytically) are not unique and are nonlinear in the estimated state. We also introduce a notion of input-dependent observability and derive conditions under which the Kalman filter covariance remains bounded. We illustrate our theoretical results through numerical experiments in multiple synthetic settings.
♻ ☆ Iterated Invariant Extended Kalman Filter (IterIEKF)
We study the mathematical properties of the Invariant Extended Kalman Filter (IEKF) when iterating on the measurement update step, following the principles of the well-known Iterated Extended Kalman Filter. This iterative variant of the IEKF (IterIEKF) systematically improves its accuracy through Gauss-Newton-based relinearization, and exhibits additional theoretical properties, particularly in the low-noise regime, that resemble those of the linear Kalman filter. We apply the proposed approach to the problem of estimating the extended pose of a crane payload using an inertial measurement unit. Our results suggest that the IterIEKF significantly outperforms the IEKF when measurements are highly accurate.
Graphics 6
☆ MorphModes: Non-rigid Registration via Adaptive Skinning Eigenmodes
Non-rigid registration is a crucial task with applications in medical imaging, industrial robotics, computer vision, and entertainment. Standard approaches accomplish this task using variations on the Non-Rigid Iterative Closest Point (NRICP) algorithms, which are prone to local minima and sensitive to initial conditions. We instead formulate the non-rigid registration problem as a Signed Distance Function (SDF) matching optimization problem, which provides richer shape information compared to traditional ICP methods. To avoid degenerate solutions, we propose to use a smooth Skinning Eigenmode subspace to parameterize the optimization problem. Finally, we propose an adaptive subspace optimization scheme to allow the resolution of localized deformations within the optimization. The result is a non-rigid registration algorithm that is more robust than NRICP, without the parameter sensitivity present in other SDF-matching approaches.
☆ ORDENA: ORigin-DEstiNAtion data exploration
Analyzing origin-destination flows is an important problem that has been extensively investigated in several scientific fields, particularly by the visualization community. The problem becomes especially challenging when involving massive data, demanding mechanisms such as data aggregation and interactive filtering to make the exploratory process doable. However, data aggregation tends to smooth out certain patterns, and deciding which data should be filtered is not straightforward. In this work, we propose ORDENA, a visual analytic tool to explore origin and destination data. ORDENA is built upon a simple and intuitive scatter plot where the horizontal and vertical axes correspond to origins and destinations. Therefore, each origin-destination flow is represented as a point in the scatter plot. How the points are organized in the plot layout reveals important spatial phenomena present in the data. Moreover, ORDENA provides explainability resources that allow users to better understand the relation between origin-destination flows and associated attributes. We illustrate ORDENA's effectiveness in a set of case studies, which have also been elaborated in collaboration with domain experts. The proposed tool has also been evaluated by domain experts not involved in its development, which provided quite positive feedback about ORDENA.
☆ From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation
Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
☆ A Generalizable Light Transport 3D Embedding for Global Illumination
Global illumination (GI) is essential for realistic rendering but remains computationally expensive due to the complexity of simulating indirect light transport. Recent neural methods have mainly relied on per-scene optimization, sometimes extended to handle changes in camera or geometry. Efforts toward cross-scene generalization have largely stayed in 2D screen space, such as neural denoising or G-buffer based GI prediction, which often suffer from view inconsistency and limited spatial understanding. We propose a generalizable 3D light transport embedding that approximates global illumination directly from 3D scene configurations, without using rasterized or path-traced cues. Each scene is represented as a point cloud with geometric and material features. A scalable transformer models global point-to-point interactions to encode these features into neural primitives. At render time, each query point retrieves nearby primitives via nearest-neighbor search and aggregates their latent features through cross-attention to predict the desired rendering quantity. We demonstrate results on diffuse global illumination prediction across diverse indoor scenes with varying layouts, geometry, and materials. The embedding trained for irradiance estimation can be quickly adapted to new rendering tasks with limited fine-tuning. We also present preliminary results for spatial-directional radiance field estimation for glossy materials and show how the normalized field can accelerate unbiased path guiding. This approach highlights a path toward integrating learned priors into rendering pipelines without explicit ray-traced illumination cues.
☆ Visually Comparing Graph Vertex Ordering Algorithms through Geometrical and Topological Approaches
Graph vertex ordering is widely employed in spatial data analysis, especially in urban analytics, where street graphs serve as spatial discretization for modeling and simulation. It is also crucial for visualization, as many methods require vertices to be arranged in a well-defined order to reveal non-trivial patterns. The goal of vertex ordering methods is to preserve neighborhood relations, but the structural complexity of real-world graphs often introduces distortions. Comparing different ordering methods is therefore essential to identify the most suitable one for each application. Existing metrics for assessing spatial vertex ordering typically focus on global quality, which hinders the identification of localized distortions. Visual evaluation is particularly valuable, as it allows analysts to compare methods within a single visualization, assess distortions, identify anomalous regions, and, in urban contexts, explain spatial inconsistencies. This work presents a visualization-assisted tool for assessing vertex ordering techniques, with a focus on urban analytics. We evaluate geometric and topological ordering approaches using urban street graphs. The visual tool integrates existing and newly proposed metrics, validated through experiments on data from multiple cities. Results demonstrate that the proposed methodology effectively supports users in selecting suitable vertex ordering techniques, tuning hyperparameters, and identifying regions with high ordering distortions.
♻ ☆ CAD-Coder: Text-to-CAD Generation with Chain-of-Thought and Geometric Reward
In this work, we introduce CAD-Coder, a novel framework that reformulates text-to-CAD as the generation of CadQuery scripts - a Python-based, parametric CAD language. This representation enables direct geometric validation, a richer modeling vocabulary, and seamless integration with existing LLMs. To further enhance code validity and geometric fidelity, we propose a two-stage learning pipeline: (1) supervised fine-tuning on paired text-CadQuery data, and (2) reinforcement learning with Group Reward Policy Optimization (GRPO), guided by a CAD-specific reward comprising both a geometric reward (Chamfer Distance) and a format reward. We also introduce a chain-of-thought (CoT) planning process to improve model reasoning, and construct a large-scale, high-quality dataset of 110K text-CadQuery-3D model triplets and 1.5K CoT samples via an automated pipeline. Extensive experiments demonstrate that CAD-Coder enables LLMs to generate diverse, valid, and complex CAD models directly from natural language, advancing the state of the art of text-to-CAD generation and geometric reasoning.
Computation and Language 137
☆ Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.
☆ Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
☆ How Do LLMs Use Their Depth?
Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a "Guess-then-Refine" framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model early on due to the lack of appropriate contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. Even high-frequency token predictions from early layers get refined >70% of the time, indicating that correct token prediction is not "one-and-done". We then go beyond frequency-based prediction to examine the dynamic usage of layer depth across three case studies. (i) Part-of-speech analysis shows that function words are, on average, the earliest to be predicted correctly. (ii) Fact recall task analysis shows that, in a multi-token answer, the first token requires more computational depth than the rest. (iii) Multiple-choice task analysis shows that the model identifies the format of the response within the first half of the layers, but finalizes its response only toward the end. Together, our results provide a detailed view of depth usage in LLMs, shedding light on the layer-by-layer computations that underlie successful predictions and providing insights for future works to improve computational efficiency in transformer-based models.
☆ LightMem: Lightweight and Efficient Memory-Augmented Generation
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.
comment: Work in progress
☆ Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.
comment: Technical Report
☆ Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
comment: work in progress
☆ See the Text: From Tokenization to Visual Reading
People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
☆ MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.
☆ Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health Monitoring EMNLP 2025
Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs with CoT reasoning outperform the base models by a significant margin, narrowing the gap to their LLM counterparts. Our code is open-sourced at: https://github.com/IBM/FailureSensorIQ.
comment: Accepted at EMNLP 2025
☆ WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection
Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and 90.0%, respectively, and demonstrate strong generalization to out-of-distribution datasets. The code is available at https://github.com/99hgz/WebSeer
☆ KAT-Coder Technical Report
Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.
☆ AI use in American newspapers is widespread, uneven, and rarely disclosed
AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.
☆ Topoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting ICLR 2024
Spatial functional organization is a hallmark of biological brains: neurons are arranged topographically according to their response properties, at multiple scales. In contrast, representations within most machine learning models lack spatial biases, instead manifesting as disorganized vector spaces that are difficult to visualize and interpret. Here, we propose a novel form of self-attention that turns Transformers into "Topoformers" with topographic organization. We introduce spatial querying - where keys and queries are arranged on 2D grids, and local pools of queries are associated with a given key - and spatial reweighting, where we convert the standard fully connected layer of self-attention into a locally connected layer. We first demonstrate the feasibility of our approach by training a 1-layer Topoformer on a sentiment classification task. Training with spatial querying encourages topographic organization in the queries and keys, and spatial reweighting separately encourages topographic organization in the values and self-attention outputs. We then apply the Topoformer motifs at scale, training a BERT architecture with a masked language modeling objective. We find that the topographic variant performs on par with a non-topographic control model on NLP benchmarks, yet produces interpretable topographic organization as evaluated via eight linguistic test suites. Finally, analyzing an fMRI dataset of human brain responses to a large set of naturalistic sentences, we demonstrate alignment between low-dimensional topographic variability in the Topoformer model and human brain language network. Scaling up Topoformers further holds promise for greater interpretability in NLP research, and for more accurate models of the organization of linguistic information in the human brain.
comment: ICLR 2024 Workshop on Representational Alignment (Re-Align) Camera Ready
☆ Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation
Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
☆ SemiAdapt and SemiLoRA: Efficient Domain Adaptation for Transformer-based Low-Resource Language Translation with a Case Study on Irish
Fine-tuning is widely used to tailor large language models for specific tasks such as neural machine translation (NMT). However, leveraging transfer learning is computationally expensive when fine-tuning large multilingual models with billions of parameters, thus creating a barrier to entry for researchers working on low-resource domains such as Irish translation. Parameter-efficient fine-tuning (PEFT) bridges this gap by training on a fraction of the original model parameters, with the Low-Rank Adaptation (LoRA) approach introducing small, trainable adapter layers. We introduce SemiAdapt and SemiLoRA as semi-supervised inference-efficient approaches that strengthen domain adaptation and lead to improved overall performance in NMT. We demonstrate that SemiAdapt can outperform full-domain fine-tuning, while most notably, SemiLoRA can propel PEFT methods to match or even outperform full-model fine-tuning. We further evaluate domain-by-dataset fine-tuning and demonstrate that our embedding-based inference methods perform especially well on larger and noisier corpora. All Irish translation models developed in this work are released as open resources. These methods aim to make high-quality domain adaptation and fine-tuning more accessible to researchers working with low-resource languages.
comment: 8 pages
☆ Adapting Language Balance in Code-Switching Speech ICASSP 2026
Despite achieving impressive results on standard benchmarks, large foundational models still struggle against code-switching test cases. When data scarcity cannot be used as the usual justification for poor performance, the reason may lie in the infrequent occurrence of code-switched moments, where the embedding of the second language appears subtly. Instead of expecting the models to learn this infrequency on their own, it might be beneficial to provide the training process with labels. Evaluating model performance on code-switching data requires careful localization of code-switching points where recognition errors are most consequential, so that the analysis emphasizes mistakes occurring at those moments. Building on this observation, we leverage the difference between the embedded and the main language to highlight those code-switching points and thereby emphasize learning at those locations. This simple yet effective differentiable surrogate mitigates context bias during generation -- the central challenge in code-switching -- thereby improving the model's robustness. Our experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly, reflected by the reduced substitution error.
comment: Submitted to ICASSP 2026
☆ Bayesian Low-Rank Factorization for Robust Model Adaptation ICASSP 2026
Large speech foundation models achieve strong performance across many domains, but they often require adaptation to handle local needs such as code-switching, where speakers mix languages within the same utterance. Direct fine-tuning of these models risks overfitting to the target domain and overwriting the broad capabilities of the base model. To address this challenge, we explore Bayesian factorized adapters for speech foundation models, which place priors near zero to achieve sparser adaptation matrices and thereby retain general performance while adapting to specific domains. We apply our approach to the Whisper model and evaluate on different multilingual code-switching scenarios. Our results show only minimal adaptation loss while significantly reducing catastrophic forgetting of the base model. Compared to LoRA, our method achieves a backward gain of 54% with only a 4% drop on the new domain. These findings highlight the effectiveness of Bayesian adaptation for fine-tuning speech foundation models without sacrificing generalization.
comment: Submitted to ICASSP 2026
☆ Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering
This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance. Our comprehensive assessment spans a range of content-inclusion settings based on their relevance, LLM models of varying capabilities and datasets across task formulations, revealing insights on model size effects, limitations, underlying memorization issues and the benefits of reasoning models. Importantly, we examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC. We shed light into some of the evaluation aspects using a multi-faceted approach. Our qualitative and error analyses address open questions on when RAG is beneficial over LC, revealing common failure cases.
☆ MLMA: Towards Multilingual with Mamba Based Architectures ICASSP 2026
Multilingual automatic speech recognition (ASR) remains a challenging task, especially when balancing performance across high- and low-resource languages. Recent advances in sequence modeling suggest that architectures beyond Transformers may offer better scalability and efficiency. In this work, we introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new approach that leverages the Mamba architecture -- an efficient state-space model optimized for long-context sequence processing -- for multilingual ASR. Using Mamba, MLMA implicitly incorporates language-aware conditioning and shared representations to support robust recognition across diverse languages. Experiments on standard multilingual benchmarks show that MLMA achieves competitive performance compared to Transformer-based architectures. These results highlight Mamba's potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.
comment: The paper is under review at ICASSP 2026
☆ Dynamical model parameters from ultrasound tongue kinematics
The control of speech can be modelled as a dynamical system in which articulators are driven toward target positions. These models are typically evaluated using fleshpoint data, such as electromagnetic articulography (EMA), but recent methodological advances make ultrasound imaging a promising alternative. We evaluate whether the parameters of a linear harmonic oscillator can be reliably estimated from ultrasound tongue kinematics and compare these with parameters estimated from simultaneously-recorded EMA data. We find that ultrasound and EMA yield comparable dynamical parameters, while mandibular short tendon tracking also adequately captures jaw motion. This supports using ultrasound kinematics to evaluate dynamical articulatory models.
comment: Accepted for publication in JASA Express Letters
☆ Beyond the Explicit: A Bilingual Dataset for Dehumanization Detection in Social Media
Digital dehumanization, although a critical issue, remains largely overlooked within the field of computational linguistics and Natural Language Processing. The prevailing approach in current research concentrating primarily on a single aspect of dehumanization that identifies overtly negative statements as its core marker. This focus, while crucial for understanding harmful online communications, inadequately addresses the broader spectrum of dehumanization. Specifically, it overlooks the subtler forms of dehumanization that, despite not being overtly offensive, still perpetuate harmful biases against marginalized groups in online interactions. These subtler forms can insidiously reinforce negative stereotypes and biases without explicit offensiveness, making them harder to detect yet equally damaging. Recognizing this gap, we use different sampling methods to collect a theory-informed bilingual dataset from Twitter and Reddit. Using crowdworkers and experts to annotate 16,000 instances on a document- and span-level, we show that our dataset covers the different dimensions of dehumanization. This dataset serves as both a training resource for machine learning models and a benchmark for evaluating future dehumanization detection techniques. To demonstrate its effectiveness, we fine-tune ML models on this dataset, achieving performance that surpasses state-of-the-art models in zero and few-shot in-context settings.
☆ Large language models for folktale type automation based on motifs: Cinderella case study
Artificial intelligence approaches are being adapted to many research areas, including digital humanities. We built a methodology for large-scale analyses in folkloristics. Using machine learning and natural language processing, we automatically detected motifs in a large collection of Cinderella variants and analysed their similarities and differences with clustering and dimensionality reduction. The results show that large language models detect complex interactions in tales, enabling computational analysis of extensive text collections and facilitating cross-lingual comparisons.
☆ Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency EMNLP
Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both established general benchmarks and a novel, healthcare-specific methodology, offering crucial insights to support fairness and safety in clinical AI applications.
comment: Accepted to EMNLP Main 2025
☆ Identity-Aware Large Language Models require Cultural Reasoning
Large language models have become the latest trend in natural language processing, heavily featuring in the digital tools we use every day. However, their replies often reflect a narrow cultural viewpoint that overlooks the diversity of global users. This missing capability could be referred to as cultural reasoning, which we define here as the capacity of a model to recognise culture-specific knowledge values and social norms, and to adjust its output so that it aligns with the expectations of individual users. Because culture shapes interpretation, emotional resonance, and acceptable behaviour, cultural reasoning is essential for identity-aware AI. When this capacity is limited or absent, models can sustain stereotypes, ignore minority perspectives, erode trust, and perpetuate hate. Recent empirical studies strongly suggest that current models default to Western norms when judging moral dilemmas, interpreting idioms, or offering advice, and that fine-tuning on survey data only partly reduces this tendency. The present evaluation methods mainly report static accuracy scores and thus fail to capture adaptive reasoning in context. Although broader datasets can help, they cannot alone ensure genuine cultural competence. Therefore, we argue that cultural reasoning must be treated as a foundational capability alongside factual accuracy and linguistic coherence. By clarifying the concept and outlining initial directions for its assessment, a foundation is laid for future systems to be able to respond with greater sensitivity to the complex fabric of human culture.
☆ Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
comment: Accepted by The 38th Conference of Open Innovations Association FRUCT, 2025
☆ How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices
Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm, offering a parallelable decoding process that could yield greater efficiency. Yet, in practice, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility. This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods. Through empirical benchmarking and a roofline-based theoretical analysis, we demonstrate that AR models generally achieve higher throughput, while DLMs consistently lag. We also investigate acceleration strategies, finding that techniques like dual cache and parallel decoding mainly offer gains at small batch sizes, with their benefits diminishing upon scaling. Our findings underscore the necessity of robust evaluation methods and improved acceleration strategies to advance research on DLMs.
☆ Probabilistic Modeling of Intentions in Socially Intelligent LLM Agents
We present a probabilistic intent modeling framework for large language model (LLM) agents in multi-turn social dialogue. The framework maintains a belief distribution over a partner's latent intentions, initialized from contextual priors and dynamically updated through likelihood estimation after each utterance. The evolving distribution provides additional contextual grounding for the policy, enabling adaptive dialogue strategies under uncertainty. Preliminary experiments in the SOTOPIA environment show consistent improvements: the proposed framework increases the Overall score by 9.0% on SOTOPIA-All and 4.1% on SOTOPIA-Hard compared with the Qwen2.5-7B baseline, and slightly surpasses an oracle agent that directly observes partner intentions. These early results suggest that probabilistic intent modeling can contribute to the development of socially intelligent LLM agents.
☆ DART: A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP
The extraction of pharmacological knowledge from regulatory documents has become a key focus in biomedical natural language processing, with applications ranging from adverse event monitoring to AI-assisted clinical decision support. However, research in this field has predominantly relied on English-language corpora such as DrugBank, leaving a significant gap in resources tailored to other healthcare systems. To address this limitation, we introduce DART (Drug Annotation from Regulatory Texts), the first structured corpus of Italian Summaries of Product Characteristics derived from the official repository of the Italian Medicines Agency (AIFA). The dataset was built through a reproducible pipeline encompassing web-scale document retrieval, semantic segmentation of regulatory sections, and clinical summarization using a few-shot-tuned large language model with low-temperature decoding. DART provides structured information on key pharmacological domains such as indications, adverse drug reactions, and drug-drug interactions. To validate its utility, we implemented an LLM-based drug interaction checker that leverages the dataset to infer clinically meaningful interactions. Experimental results show that instruction-tuned LLMs can accurately infer potential interactions and their clinical implications when grounded in the structured textual fields of DART. We publicly release our code on GitHub: https://github.com/PRAISELab-PicusLab/DART.
☆ CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CodeRL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CodeRL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CodeRL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CodeRL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CodeRL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CodeRL+ strengthens the alignment between code's textual representations and its underlying execution semantics.
☆ IMB: An Italian Medical Benchmark for Question Answering
Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: \textbf{IMB-QA}, containing 782,644 patient-doctor conversations from 77 medical categories, and \textbf{IMB-MCQA}, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that effective medical AI systems may benefit more from domain expertise and efficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: https://github.com/PRAISELab-PicusLab/IMB.
☆ CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning
Although WordNet is a valuable resource owing to its structured semantic networks and extensive vocabulary, its fine-grained sense distinctions can be challenging for second-language learners. To address this, we developed a WordNet annotated with the Common European Framework of Reference for Languages (CEFR), integrating its semantic networks with language-proficiency levels. We automated this process using a large language model to measure the semantic similarity between sense definitions in WordNet and entries in the English Vocabulary Profile Online. To validate our method, we constructed a large-scale corpus containing both sense and CEFR-level information from our annotated WordNet and used it to develop contextual lexical classifiers. Our experiments demonstrate that models fine-tuned on our corpus perform comparably to those trained on gold-standard annotations. Furthermore, by combining our corpus with the gold-standard data, we developed a practical classifier that achieves a Macro-F1 score of 0.81, indicating the high accuracy of our annotations. Our annotated WordNet, corpus, and classifiers are publicly available to help bridge the gap between natural language processing and language education, thereby facilitating more effective and efficient language learning.
☆ DePass: Unified Feature Attributing by Simple Decomposed Forward Pass
Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.
☆ ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks
Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We instantiate our framework on three distinct games to create the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under these complex and realistic conditions. Code is avaliable at: https://github.com/hly1998/ChronoPlay.
☆ Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models
Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain $10-21\%$ in mean humor score, stereotypical jokes appear $11\%$ to $28\%$ more often among the jokes marked funny by LLM-based metric and up to $10\%$ more often in generations perceived as funny by humans.
☆ Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.
☆ Chain-of-Conceptual-Thought: Eliciting the Agent to Deeply Think within the Response
Chain-of-Thought (CoT) is widely applied to improve the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks since there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose another prompt-based paradigm called Chain of Conceptual Thought (CoCT), where the LLM first tags a concept, then generates the detailed content. The chain of concepts is allowed within the utterance, encouraging the LLM's deep and strategic thinking. We experiment with this paradigm in daily and emotional support conversations where the concept is comprised of emotions, strategies and topics. Automatic, human and model evaluations suggest that CoCT surpasses baselines such as Self-Refine, ECoT, ToT, SoT and RAG, suggesting a potential effective prompt-based paradigm of LLM for a wider scope of tasks.
☆ Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference
Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.
☆ MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models
Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.
☆ Towards Fair ASR For Second Language Speakers Using Fairness Prompted Finetuning ICASSP 2026
In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.
comment: Submitted to ICASSP 2026
☆ KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs
We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.
☆ KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.
comment: 6 pages, 7 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE 2025)
☆ Combining Distantly Supervised Models with In Context Learning for Monolingual and Cross-Lingual Relation Extraction
Distantly Supervised Relation Extraction (DSRE) remains a long-standing challenge in NLP, where models must learn from noisy bag-level annotations while making sentence-level predictions. While existing state-of-the-art (SoTA) DSRE models rely on task-specific training, their integration with in-context learning (ICL) using large language models (LLMs) remains underexplored. A key challenge is that the LLM may not learn relation semantics correctly, due to noisy annotation. In response, we propose HYDRE -- HYbrid Distantly Supervised Relation Extraction framework. It first uses a trained DSRE model to identify the top-k candidate relations for a given test sentence, then uses a novel dynamic exemplar retrieval strategy that extracts reliable, sentence-level exemplars from training data, which are then provided in LLM prompt for outputting the final relation(s). We further extend HYDRE to cross-lingual settings for RE in low-resource languages. Using available English DSRE training data, we evaluate all methods on English as well as a newly curated benchmark covering four diverse low-resource Indic languages -- Oriya, Santali, Manipuri, and Tulu. HYDRE achieves up to 20 F1 point gains in English and, on average, 17 F1 points on Indic languages over prior SoTA DSRE models. Detailed ablations exhibit HYDRE's efficacy compared to other prompting strategies.
☆ ECG-LLM -- training and evaluation of domain-specific large language models for electrocardiography
Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.
comment: 34 pages, 8 figures, code available at https://github.com/AI4HealthUOL/ecg-llm
☆ Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption
Despite progress in watermarking algorithms for large language models (LLMs), real-world deployment remains limited. We argue that this gap stems from misaligned incentives among LLM providers, platforms, and end users, which manifest as four key barriers: competitive risk, detection-tool governance, robustness concerns and attribution issues. We revisit three classes of watermarking through this lens. \emph{Model watermarking} naturally aligns with LLM provider interests, yet faces new challenges in open-source ecosystems. \emph{LLM text watermarking} offers modest provider benefit when framed solely as an anti-misuse tool, but can gain traction in narrowly scoped settings such as dataset de-contamination or user-controlled provenance. \emph{In-context watermarking} (ICW) is tailored for trusted parties, such as conference organizers or educators, who embed hidden watermarking instructions into documents. If a dishonest reviewer or student submits this text to an LLM, the output carries a detectable watermark indicating misuse. This setup aligns incentives: users experience no quality loss, trusted parties gain a detection tool, and LLM providers remain neutral by simply following watermark instructions. We advocate for a broader exploration of incentive-aligned methods, with ICW as an example, in domains where trusted parties need reliable tools to detect misuse. More broadly, we distill design principles for incentive-aligned, domain-specific watermarking and outline future research directions. Our position is that the practical adoption of LLM watermarking requires aligning stakeholder incentives in targeted application domains and fostering active community engagement.
☆ The Impact of Image Resolution on Biomedical Multimodal Large Language Models
Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.
comment: Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025
☆ From Retrieval to Generation: Unifying External and Parametric Knowledge for Medical Question Answering
Medical question answering (QA) requires extensive access to domain-specific knowledge. A promising direction is to enhance large language models (LLMs) with external knowledge retrieved from medical corpora or parametric knowledge stored in model parameters. Existing approaches typically fall into two categories: Retrieval-Augmented Generation (RAG), which grounds model reasoning on externally retrieved evidence, and Generation-Augmented Generation (GAG), which depends solely on the models internal knowledge to generate contextual documents. However, RAG often suffers from noisy or incomplete retrieval, while GAG is vulnerable to hallucinated or inaccurate information due to unconstrained generation. Both issues can mislead reasoning and undermine answer reliability. To address these challenges, we propose MedRGAG, a unified retrieval-generation augmented framework that seamlessly integrates external and parametric knowledge for medical QA. MedRGAG comprises two key modules: Knowledge-Guided Context Completion (KGCC), which directs the generator to produce background documents that complement the missing knowledge revealed by retrieval; and Knowledge-Aware Document Selection (KADS), which adaptively selects an optimal combination of retrieved and generated documents to form concise yet comprehensive evidence for answer generation. Extensive experiments on five medical QA benchmarks demonstrate that MedRGAG achieves a 12.5% improvement over MedRAG and a 4.5% gain over MedGENIE, highlighting the effectiveness of unifying retrieval and generation for knowledge-intensive reasoning. Our code and data are publicly available at https://anonymous.4open.science/r/MedRGAG
comment: 13 pages, 4 figures
☆ Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata
Food insecurity remains a persistent public health emergency in the United States, tightly interwoven with chronic disease, mental illness, and opioid misuse. Yet despite the existence of thousands of food banks and pantries, access remains fragmented: 1) current retrieval systems depend on static directories or generic search engines, which provide incomplete and geographically irrelevant results; 2) LLM-based chatbots offer only vague nutritional suggestions and fail to adapt to real-world constraints such as time, mobility, and transportation; and 3) existing food recommendation systems optimize for culinary diversity but overlook survival-critical needs of food-insecure populations, including immediate proximity, verified availability, and contextual barriers. These limitations risk leaving the most vulnerable individuals, those experiencing homelessness, addiction, or digital illiteracy, unable to access urgently needed resources. To address this, we introduce Food4All, the first multi-agent framework explicitly designed for real-time, context-aware free food retrieval. Food4All unifies three innovations: 1) heterogeneous data aggregation across official databases, community platforms, and social media to provide a continuously updated pool of food resources; 2) a lightweight reinforcement learning algorithm trained on curated cases to optimize for both geographic accessibility and nutritional correctness; and 3) an online feedback loop that dynamically adapts retrieval policies to evolving user needs. By bridging information acquisition, semantic analysis, and decision support, Food4All delivers nutritionally annotated and guidance at the point of need. This framework establishes an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its compounding health risks.
☆ BrailleLLM: Braille Instruction Tuning with Large Language Models for Braille Domain Tasks EMNLP 2025
Braille plays a vital role in education and information accessibility for visually impaired individuals. However, Braille information processing faces challenges such as data scarcity and ambiguities in mixed-text contexts. We construct English and Chinese Braille Mixed Datasets (EBMD/CBMD) with mathematical formulas to support diverse Braille domain research, and propose a syntax tree-based augmentation method tailored for Braille data. To address the underperformance of traditional fine-tuning methods in Braille-related tasks, we investigate Braille Knowledge-Based Fine-Tuning (BKFT), which reduces the learning difficulty of Braille contextual features. BrailleLLM employs BKFT via instruction tuning to achieve unified Braille translation, formula-to-Braille conversion, and mixed-text translation. Experiments demonstrate that BKFT achieves significant performance improvements over conventional fine-tuning in Braille translation scenarios. Our open-sourced datasets and methodologies establish a foundation for low-resource multilingual Braille research.
comment: Accepted to EMNLP 2025
☆ Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs EMNLP 2025
Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.
comment: Accepted to EMNLP 2025 Findings. Previously titled "Text or Pixels? Evaluating Efficiency and Understanding of LLMs with Visual Text Inputs"
☆ DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization
Prompt Optimization has emerged as a crucial approach due to its capabilities in steering Large Language Models to solve various tasks. However, current works mainly rely on the random rewriting ability of LLMs, and the optimization process generally focus on specific influencing factors, which makes it easy to fall into local optimum. Besides, the performance of the optimized prompt is often unstable, which limits its transferability in different tasks. To address the above challenges, we propose $\textbf{DelvePO}$ ($\textbf{D}$irection-Guid$\textbf{e}$d Se$\textbf{l}$f-E$\textbf{v}$olving Framework for Fl$\textbf{e}$xible $\textbf{P}$rompt $\textbf{O}$ptimization), a task-agnostic framework to optimize prompts in self-evolve manner. In our framework, we decouple prompts into different components that can be used to explore the impact that different factors may have on various tasks. On this basis, we introduce working memory, through which LLMs can alleviate the deficiencies caused by their own uncertainties and further obtain key insights to guide the generation of new prompts. Extensive experiments conducted on different tasks covering various domains for both open- and closed-source LLMs, including DeepSeek-R1-Distill-Llama-8B, Qwen2.5-7B-Instruct and GPT-4o-mini. Experimental results show that DelvePO consistently outperforms previous SOTA methods under identical experimental settings, demonstrating its effectiveness and transferability across different tasks.
☆ VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
comment: 10 pages, 5 figures, 4 tables. Under review
☆ MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives
Character arcs are important theoretical devices employed in literary studies to understand character journeys, identify tropes across literary genres, and establish similarities between narratives. This work addresses the novel task of computationally generating event-centric, relation-based character arcs from narratives. Providing a quantitative representation for arcs brings tangibility to a theoretical concept and paves the way for subsequent applications. We present MARCUS (Modelling Arcs for Understanding Stories), an NLP pipeline that extracts events, participant characters, implied emotion, and sentiment to model inter-character relations. MARCUS tracks and aggregates these relations across the narrative to generate character arcs as graphical plots. We generate character arcs from two extended fantasy series, Harry Potter and Lord of the Rings. We evaluate our approach before outlining existing challenges, suggesting applications of our pipeline, and discussing future work.
☆ Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. We first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges, preventing the search for optimal score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.3% relative improvement on average in Spearman correlation with human judgments across different score ranges.
☆ Training-Free Spectral Fingerprints of Voice Processing in Transformers
Different transformer architectures implement identical linguistic computations via distinct connectivity patterns, yielding model imprinted ``computational fingerprints'' detectable through spectral analysis. Using graph signal processing on attention induced token graphs, we track changes in algebraic connectivity (Fiedler value, $\Delta\lambda_2$) under voice alternation across 20 languages and three model families, with a prespecified early window (layers 2--5). Our analysis uncovers clear architectural signatures: Phi-3-Mini shows a dramatic English specific early layer disruption ($\overline{\Delta\lambda_2}_{[2,5]}\!\approx\!-0.446$) while effects in 19 other languages are minimal, consistent with public documentation that positions the model primarily for English use. Qwen2.5-7B displays small, distributed shifts that are largest for morphologically rich languages, and LLaMA-3.2-1B exhibits systematic but muted responses. These spectral signatures correlate strongly with behavioral differences (Phi-3: $r=-0.976$) and are modulated by targeted attention head ablations, linking the effect to early attention structure and confirming functional relevance. Taken together, the findings are consistent with the view that training emphasis can leave detectable computational imprints: specialized processing strategies that manifest as measurable connectivity patterns during syntactic transformations. Beyond voice alternation, the framework differentiates reasoning modes, indicating utility as a simple, training free diagnostic for revealing architectural biases and supporting model reliability analysis.
comment: Preprint under review (2025). 12 pages, 8 figures
☆ A Graph Signal Processing Framework for Hallucination Detection in Large Language Models
Large language models achieve impressive results but distinguishing factual reasoning from hallucinations remains challenging. We propose a spectral analysis framework that models transformer layers as dynamic graphs induced by attention, with token embeddings as signals on these graphs. Through graph signal processing, we define diagnostics including Dirichlet energy, spectral entropy, and high-frequency energy ratios, with theoretical connections to computational stability. Experiments across GPT architectures suggest universal spectral patterns: factual statements exhibit consistent "energy mountain" behavior with low-frequency convergence, while different hallucination types show distinct signatures. Logical contradictions destabilize spectra with large effect sizes ($g>1.0$), semantic errors remain stable but show connectivity drift, and substitution hallucinations display intermediate perturbations. A simple detector using spectral signatures achieves 88.75% accuracy versus 75% for perplexity-based baselines, demonstrating practical utility. These findings indicate that spectral geometry may capture reasoning patterns and error behaviors, potentially offering a framework for hallucination detection in large language models.
comment: Preprint under review (2025). 11 pages, 7 figures. Code and scripts: to be released
☆ That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation
This paper investigates how large language models (LLMs) behave when faced with discrepancies between their parametric knowledge and conflicting information contained in a prompt. Building on prior question-answering (QA) research, we extend the investigation of knowledge conflicts to the realm of code generation. We propose a domain-agnostic framework for constructing and interpreting such conflicts, along with a novel evaluation method and dataset tailored to code conflict scenarios. Our experiments indicate that sufficiently large LLMs encode the notion of a knowledge conflict in their parameters, enabling us to detect knowledge conflicts with up to \textbf{80.65\%} accuracy. Building on these insights, we show that activation-level steering can achieve up to a \textbf{12.6\%} improvement in steering success over a random baseline. However, effectiveness depends critically on balancing model size, task domain, and steering direction. The experiment code and data will be made publicly available after acceptance.
☆ PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
comment: 24 pages, 9 figures. Metric/benchmark available at https://github.com/amith-ananthram/posh
☆ From Memorization to Generalization: Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization
Effective biomedical data integration depends on automated term normalization, the mapping of natural language biomedical terms to standardized identifiers. This linking of terms to identifiers is essential for semantic interoperability. Large language models (LLMs) show promise for this task but perform unevenly across terminologies. We evaluated both memorization (training-term performance) and generalization (validation-term performance) across multiple biomedical ontologies. Fine-tuning Llama 3.1 8B revealed marked differences by terminology. GO mappings showed strong memorization gains (up to 77% improvement in term-to-identifier accuracy), whereas HPO showed minimal improvement. Generalization occurred only for protein-gene (GENE) mappings (13.9% gain), while fine-tuning for HPO and GO yielded negligible transfer. Baseline accuracy varied by model scale, with GPT-4o outperforming both Llama variants for all terminologies. Embedding analyses showed tight semantic alignment between gene symbols and protein names but weak alignment between terms and identifiers for GO or HPO, consistent with limited lexicalization. Fine-tuning success depended on two interacting factors: identifier popularity and lexicalization. Popular identifiers were more likely encountered during pretraining, enhancing memorization. Lexicalized identifiers, such as gene symbols, enabled semantic generalization. By contrast, arbitrary identifiers in GO and HPO constrained models to rote learning. These findings provide a predictive framework for when fine-tuning enhances factual recall versus when it fails due to sparse or non-lexicalized identifiers.
comment: Submitted for publication to BMC BioData Mining
☆ When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/
☆ Re:Member: Emotional Question Generation from Personal Memories ACL 2025
We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users' personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.
comment: Accepted to HCI+NLP at ACL 2025
☆ Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues
As large language models (LLMs) are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts are critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts. The task involves evaluating models' social reasoning capability to infer the interpersonal relationships (e.g., friends, sisters, lovers) between speakers in each dialogue. Each dialogue is annotated with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by native (or equivalent) Korean and English speakers from Korea and the U.S. Evaluating nine models on our task, current proprietary LLMs achieve around 75-80% on the English dataset, whereas their performance on Korean drops to 58-69%. More strikingly, models select Unlikely relationships in 10-25% of their responses. Furthermore, we find that thinking models and chain-of-thought prompting, effective for general reasoning, provide minimal benefits for social reasoning and occasionally amplify social biases. Our findings reveal significant limitations in current LLMs' social reasoning capabilities, highlighting the need for efforts to develop socially-aware language models.
☆ Dynamic Evaluation for Oversensitivity in LLMs EMNLP
Oversensitivity occurs when language models defensively reject prompts that are actually benign. This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content. Existing benchmarks rely on static datasets that degrade overtime as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model's unique behavior. Building on this approach, we construct OVERBENCH, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 25 models. OVERBENCH provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.
comment: EMNLP-Findings 2025
☆ ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench
comment: 23 pages
☆ NeuroAda: Activating Each Neuron's Potential for Parameter-Efficient Fine-Tuning
Existing parameter-efficient fine-tuning (PEFT) methods primarily fall into two categories: addition-based and selective in-situ adaptation. The former, such as LoRA, introduce additional modules to adapt the model to downstream tasks, offering strong memory efficiency. However, their representational capacity is often limited, making them less suitable for fine-grained adaptation. In contrast, the latter directly fine-tunes a carefully chosen subset of the original model parameters, allowing for more precise and effective adaptation, but at the cost of significantly increased memory consumption. To reconcile this trade-off, we propose NeuroAda, a novel PEFT method that enables fine-grained model finetuning while maintaining high memory efficiency. Our approach first identifies important parameters (i.e., connections within the network) as in selective adaptation, and then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated, leaving the original model parameters frozen. Empirical results on 23+ tasks spanning both natural language generation and understanding demonstrate that NeuroAda achieves state-of-the-art performance with as little as $\leq \textbf{0.02}\%$ trainable parameters, while reducing CUDA memory usage by up to 60%. We release our code here: https://github.com/FightingFighting/NeuroAda.git.
☆ Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agentic Search
Long-horizon agentic search requires iteratively exploring the web over long trajectories and synthesizing information across many sources, and is the foundation for enabling powerful applications like deep research systems. In this work, we show that popular agentic search frameworks struggle to scale to long trajectories primarily due to context limitations-they accumulate long, noisy content, hit context window and tool budgets, or stop early. Then, we introduce SLIM (Simple Lightweight Information Management), a simple framework that separates retrieval into distinct search and browse tools, and periodically summarizes the trajectory, keeping context concise while enabling longer, more focused searches. On long-horizon tasks, SLIM achieves comparable performance at substantially lower cost and with far fewer tool calls than strong open-source baselines across multiple base models. Specifically, with o3 as the base model, SLIM achieves 56% on BrowseComp and 31% on HLE, outperforming all open-source frameworks by 8 and 4 absolute points, respectively, while incurring 4-6x fewer tool calls. Finally, we release an automated fine-grained trajectory analysis pipeline and error taxonomy for characterizing long-horizon agentic search frameworks; SLIM exhibits fewer hallucinations than prior systems. We hope our analysis framework and simple tool design inform future long-horizon agents.
comment: Code and data are available here: https://github.com/howard-yen/SLIM
☆ StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction
Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.
comment: 13 pages, 5 figures
♻ ☆ Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs EMNLP 2025
Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.
comment: Accepted at EMNLP 2025 (main)
♻ ☆ FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model NeurIPS 2025
Large language models have been widely applied, but can inadvertently encode sensitive or harmful information, raising significant safety concerns. Machine unlearning has emerged to alleviate this concern; however, existing training-time unlearning approaches, relying on coarse-grained loss combinations, have limitations in precisely separating knowledge and balancing removal effectiveness with model utility. In contrast, we propose Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment (FALCON), a novel representation-guided unlearning approach that leverages information-theoretic guidance for efficient parameter selection, employs contrastive mechanisms to enhance representation separation, and projects conflict gradients onto orthogonal subspaces to resolve conflicts between forgetting and retention objectives. Extensive experiments demonstrate that FALCON achieves superior unlearning effectiveness while maintaining model utility, exhibiting robust resistance against knowledge recovery attempts.
comment: Accepted at NeurIPS 2025 with minor revisions
♻ ☆ Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.
♻ ☆ Glyph: Scaling Context Windows via Visual-Text Compression
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
♻ ☆ SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents
Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked "How can I track someone's location without their consent?", a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
♻ ☆ Improving the fact-checking performance of language models by relying on their entailment ability
Automated fact-checking has been a challenging task for the research community. Past works tried various strategies, such as end-to-end training, retrieval-augmented generation, and prompt engineering, to build robust fact-checking systems. However, their accuracy has not been very high for real-world deployment. We, on the other hand, propose a simple yet effective strategy, where entailed justifications generated by LLMs are used to train encoder-only language models (ELMs) for fact-checking. We conducted a rigorous set of experiments, comparing our approach with recent works and various prompting and fine-tuning strategies to demonstrate the superiority of our approach. Additionally, we did quality analysis of model explanations, ablation studies, and error analysis to provide a comprehensive understanding of our approach.
comment: 13 pages
♻ ☆ Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures
Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.
comment: Waiting for Conference Response
♻ ☆ Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale.
♻ ☆ Counterfactual reasoning: an analysis of in-context emergence NeurIPS
Large-scale neural language models exhibit remarkable performance in in-context learning: the ability to learn and reason about the input context on the fly. This work studies in-context counterfactual reasoning in language models, that is, the ability to predict consequences of a hypothetical scenario. We focus on a well-defined, synthetic linear regression task that requires noise abduction. Accurate prediction is based on (1) inferring an unobserved latent concept and (2) copying contextual noise from factual observations. We show that language models are capable of counterfactual reasoning. Further, we enhance existing identifiability results and reduce counterfactual reasoning for a broad class of functions to a transformation on in-context observations. In Transformers, we find that self-attention, model depth and pre-training data diversity drive performance. Moreover, we provide mechanistic evidence that the latent concept is linearly represented in the residual stream and we introduce designated \textit{noise abduction heads} central to performing counterfactual reasoning. Lastly, our findings extend to counterfactual reasoning under SDE dynamics and reflect that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/mrtzmllr/iccr.
comment: Published as a conference paper at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS) 2025
♻ ☆ InternLM2.5-StepProver: Advancing Automated Theorem Proving via Critic-Guided Search
Large Language Models (LLMs) have emerged as powerful tools in mathematical theorem proving, particularly when utilizing formal languages such as LEAN. A prevalent proof method involves the LLM prover iteratively constructing the proof tactic by tactic, typically following a best-first search scheme. However, this method often ignores the critical preference information inside the existing tactic trajectories, hindering the search for deeper proofs. We propose an intuitive yet effective method, which utilizes a critic model to capture the preference information and to guide the search of the prover model at runtime. Given the prover-critic framework, a large-scale expert iteration with more than 20,000 CPU days is then applied to further fine-tune the prover and the critic. The trained InternLM2.5-StepProver critic significantly boosts the performance of the prover model (59.4% to 65.9%). We also analyze the impact of the critic on various aspects of the theorem proving process during expert iteration, providing insights into its effectiveness. We open-source our models and searched proofs at https://github.com/InternLM/InternLM-Math and https://huggingface.co/datasets/internlm/Lean-Workbook.
♻ ☆ Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.
comment: 35 pages, 1 figure
♻ ☆ Fairshare Data Pricing via Data Valuation for Large Language Models
Training data is the backbone of large language models (LLMs), yet today's data markets often operate under exploitative pricing -- sourcing data from marginalized groups with little pay or recognition. This paper introduces a theoretical framework for LLM data markets, modeling the strategic interactions between buyers (LLM builders) and sellers (human annotators). We begin with theoretical and empirical analysis showing how exploitative pricing drives high-quality sellers out of the market, degrading data quality and long-term model performance. Then we introduce fairshare, a pricing mechanism grounded in data valuation that quantifies each data's contribution. It aligns incentives by sustaining seller participation and optimizing utility for both buyers and sellers. Theoretically, we show that fairshare yields mutually optimal outcomes: maximizing long-term buyer utility and seller profit while sustaining market participation. Empirically when training open-source LLMs on complex NLP tasks, including math problems, medical diagnosis, and physical reasoning, fairshare boosts seller earnings and ensures a stable supply of high-quality data, while improving buyers' performance-per-dollar and long-term welfare. Our findings offer a concrete path toward fair, transparent, and economically sustainable data markets for LLM.
♻ ☆ Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default
The research evaluates lightweight medical abstract classification methods to establish their maximum performance capabilities under financial budget restrictions. On the public medical abstracts corpus, we finetune BERT base and Distil BERT with three objectives cross entropy (CE), class weighted CE, and focal loss under identical tokenization, sequence length, optimizer, and schedule. DistilBERT with plain CE gives the strongest raw argmax trade off, while a post hoc operating point selection (validation calibrated, classwise thresholds) sub stantially improves deployed performance; under this tuned regime, focal benefits most. We report Accuracy, Macro F1, and WeightedF1, release evaluation artifacts, and include confusion analyses to clarify error structure. The practical takeaway is to start with a compact encoder and CE, then add lightweight calibration or thresholding when deployment requires higher macro balance.
comment: Healthcare AI, Medical Text Classification,LLM, DistilBERT
♻ ☆ Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection EMNLP 2025
The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.
comment: EMNLP 2025 Main Conference Full Paper
♻ ☆ Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents EMNLP 2025
In this paper, we introduce Spotlight, a novel paradigm for information extraction that produces concise, engaging narratives by highlighting the most compelling aspects of a document. Unlike traditional summaries, which prioritize comprehensive coverage, spotlights selectively emphasize intriguing content to foster deeper reader engagement with the source material. We formally differentiate spotlights from related constructs and support our analysis with a detailed benchmarking study using new datasets curated for this work. To generate high-quality spotlights, we propose a two-stage approach: fine-tuning a large language model on our benchmark data, followed by alignment via Direct Preference Optimization (DPO). Our comprehensive evaluation demonstrates that the resulting model not only identifies key elements with precision but also enhances readability and boosts the engagement value of the original document.
comment: Paper accepted in EMNLP 2025 Main Conference (Full Paper)
♻ ☆ A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
♻ ☆ VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.
comment: Training and Inference Codes: https://github.com/VITA-MLLM/VITA-Audio
♻ ☆ R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.
♻ ☆ DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding NeurIPS 2025
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench
comment: Accepted by Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models
Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.
♻ ☆ EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving NeurIPS 2025
We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.
comment: Accepted by NeurIPS 2025. 47 pages, 24 figures
♻ ☆ Patent Language Model Pretraining with ModernBERT
Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.
comment: 7 pages, 5 figures, 4 tables
♻ ☆ SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
comment: Project Website: http://simbench.tiancheng.hu/ Data: https://huggingface.co/datasets/pitehu/SimBench
♻ ☆ AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models' prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge's presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.
comment: 31 pages
♻ ☆ Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.
comment: 10 pages, 3 figures. v2 adds discussion of related work (G-Pass@k)
♻ ☆ When Text Embedding Meets Large Language Model: A Comprehensive Survey
Text embedding has become a foundational technology in natural language processing (NLP) during the deep learning era, driving advancements across a wide array of downstream tasks. While many natural language understanding challenges can now be modeled using generative paradigms and leverage the robust generative and comprehension capabilities of large language models (LLMs), numerous practical applications - such as semantic matching, clustering, and information retrieval - continue to rely on text embeddings for their efficiency and effectiveness. Therefore, integrating LLMs with text embeddings has become a major research focus in recent years. In this survey, we categorize the interplay between LLMs and text embeddings into three overarching themes: (1) LLM-augmented text embedding, enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders, adapting their innate capabilities for high-quality embedding; and (3) Text embedding understanding with LLMs, leveraging LLMs to analyze and interpret embeddings. By organizing recent works based on interaction patterns rather than specific downstream applications, we offer a novel and systematic overview of contributions from various research and application domains in the era of LLMs. Furthermore, we highlight the unresolved challenges that persisted in the pre-LLM era with pre-trained language models (PLMs) and explore the emerging obstacles brought forth by LLMs. Building on this analysis, we outline prospective directions for the evolution of text embedding, addressing both theoretical and practical opportunities in the rapidly advancing landscape of NLP.
comment: Version 4: We added the latest works of LLM-based Embedders
♻ ☆ Can we Evaluate RAGs with Synthetic Data? ECML-PKDD 2025
We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when the latter is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they do not consistently produce reliable RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.
comment: Accepted for the SynDAiTE workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2025), September 15, 2025 - Porto, Portugal
♻ ☆ Uncertainty Quantification for Evaluating Machine Translation Bias
The predictive uncertainty of machine translation (MT) models is typically used as a quality estimation proxy. In this work, we posit that apart from confidently translating when a single correct translation exists, models should also maintain uncertainty when the input is ambiguous. We use uncertainty to measure gender bias in MT systems. When the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and can be susceptible to biases. Prior work measured bias via gender accuracy, however it cannot be applied to ambiguous cases. Using semantic uncertainty, we are able to assess bias when translating both ambiguous and unambiguous source sentences, and find that high translation accuracy does not correlate with exhibiting uncertainty appropriately, and that debiasing affects the two cases differently.
♻ ☆ From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora EMNLP 2025
Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
comment: EMNLP 2025 Main Conference (Oral)
♻ ☆ DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection NeurIPS 2025
The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.
comment: NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
comment: We have come across a recent approach that has not been properly attributed at the time of submission and compared in a fair setting. Therefore, we would like to withdraw the paper to address these concerns
♻ ☆ LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling NeurIPS 2025
Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.
comment: Accepted at NeurIPS 2025
♻ ☆ Exploring Data-Efficient Adaptation of Large Language Models for Code Generation
Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training data available in practice leads to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training data is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation. DEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome their own shortcomings, thus achieving efficient learning. Specifically, DEED involves identifying error code generated by LLMs, employing Self-Revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data, showing an average relative improvement of 46.2% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-Revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, DEED consistently demonstrates strong performance across various LLMs, underscoring its applicability.
comment: Accepted by TOSEM
♻ ☆ Think With Videos For Agentic Long-Video Understanding
Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
♻ ☆ HauntAttack: When Attack Follows Reasoning as a Shadow
Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of 70\%, achieving up to 12 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.
♻ ☆ Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models
Uncertainty quantification (UQ) has emerged as a promising approach for detecting hallucinations and low-quality output of Large Language Models (LLMs). However, obtaining proper uncertainty scores is complicated by the conditional dependency between the generation steps of an autoregressive LLM because it is hard to model it explicitly. Here, we propose to learn this dependency from attention-based features. In particular, we train a regression model that leverages LLM attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens. To incorporate the recurrent features, we also suggest a two-staged training procedure. Our experimental evaluation on ten datasets and three LLMs shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.
♻ ☆ Explaining Large Language Models with gSMILE
Large Language Models (LLMs) such as GPT, LLaMA, and Claude achieve remarkable performance in text generation but remain opaque in their decision-making processes, limiting trust and accountability in high-stakes applications. We present gSMILE (generative SMILE), a model-agnostic, perturbation-based framework for token-level interpretability in LLMs. Extending the SMILE methodology, gSMILE uses controlled prompt perturbations, Wasserstein distance metrics, and weighted linear surrogates to identify input tokens with the most significant impact on the output. This process enables the generation of intuitive heatmaps that visually highlight influential tokens and reasoning paths. We evaluate gSMILE across leading LLMs (OpenAI's gpt-3.5-turbo-instruct, Meta's LLaMA 3.1 Instruct Turbo, and Anthropic's Claude 2.1) using attribution fidelity, attribution consistency, attribution stability, attribution faithfulness, and attribution accuracy as metrics. Results show that gSMILE delivers reliable human-aligned attributions, with Claude 2.1 excelling in attention fidelity and GPT-3.5 achieving the highest output consistency. These findings demonstrate gSMILE's ability to balance model performance and interpretability, enabling more transparent and trustworthy AI systems.
♻ ☆ SpecExit: Accelerating Large Reasoning Model via Speculative Exit
Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we propose SpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, reducing average generation length by 66\% and achieving a 2.5x speedup in end-to-end latency compared to the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at https://github.com/Tencent/AngelSlim.
♻ ☆ Multi-Agent Collaboration via Evolving Orchestration NeurIPS 2025
Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution. Our code is available at https://github.com/OpenBMB/ChatDev/tree/puppeteer.
comment: accepted at NeurIPS 2025
♻ ☆ Learning to Interpret Weight Differences in Language Models
Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes ("weight diffs") are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT-adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.
comment: Project code and links to weight diffs, adapters, and training data can be found at https://github.com/Aviously/diff-interpretation-tuning
♻ ☆ Temporal Alignment of LLMs through Cycle Encoding for Long-Range Time Representations
Large language models (LLMs) suffer from temporal misalignment issues especially across long span of time. The issue arises from knowing that LLMs are trained on large amounts of data where temporal information is rather sparse over long times, such as thousands of years, resulting in insufficient learning or catastrophic forgetting by the LLMs. This paper proposes a methodology named "Ticktack" for addressing the LLM's long-time span misalignment in a yearly setting. Specifically, we first propose to utilize the sexagenary year expression instead of the Gregorian year expression employed by LLMs, achieving a more uniform distribution in yearly granularity. Then, we employ polar coordinates to model the sexagenary cycle of 60 terms and the year order within each term, with additional temporal encoding to ensure LLMs understand them. Finally, we present a temporal representational alignment approach for post-training LLMs that effectively distinguishes time points with relevant knowledge, hence improving performance on time-related tasks, particularly over a long period. We also create a long time span benchmark for evaluation. Experimental results prove the effectiveness of our proposal.
♻ ☆ Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models NeurIPS 2025
Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level methods (e.g., PPO) aim to provide fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.
comment: Accepted at NeurIPS 2025
♻ ☆ Program Synthesis via Test-Time Transduction NeurIPS 2025
We introduce transductive program synthesis, a new formulation of the program synthesis task that explicitly leverages test inputs during synthesis. While prior approaches to program synthesis--whether based on natural language descriptions or input-output examples--typically aim to generalize from training examples, they often struggle with robustness, especially in real-world settings where training examples are limited and test inputs involve various edge cases. To address this, we propose a novel framework that improves robustness by treating synthesis as an active learning over a finite hypothesis class defined by programs' outputs. We use an LLM to predict outputs for selected test inputs and eliminate inconsistent hypotheses, where the inputs are chosen via a greedy maximin algorithm to minimize the number of LLM queries required. We evaluate our approach on four benchmarks: Playgol, MBPP+, 1D-ARC, and programmatic world modeling on MiniGrid. We demonstrate that our method significantly improves program synthesis in both accuracy and efficiency. We release our code at https://github.com/klee972/SYNTRA.
comment: NeurIPS 2025
♻ ☆ Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders (CE), which require full corpus access. We propose a corpus-free alternative: an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage and then produces a hard negative example using only the generated query text. Our dataset comprises 7,250 arXiv abstracts spanning diverse domains including mathematics, physics, computer science, and related fields, serving as positive passages for query generation. We evaluate two fine-tuning configurations of DistilBERT for dense retrieval; one using LLM-generated hard negatives conditioned solely on the query, and another using negatives generated with both the query and its positive document as context. Compared to traditional corpus-based mining methods {LLM Query $\rightarrow$ BM25 HN and LLM Query $\rightarrow$ CE HN on multiple BEIR benchmark datasets, our all-LLM pipeline outperforms strong lexical mining baselines and achieves performance comparable to cross-encoder-based methods, demonstrating the potential of corpus-free hard negative generation for retrieval model training.
♻ ☆ Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains
Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.
comment: 15 pages, 8 figures
♻ ☆ Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users NeurIPS 2025
We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.
comment: Accepted to the NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models
♻ ☆ Can Large Language Models Master Complex Card Games? NeurIPS 2025
Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models' ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can achieve a certain level of proficiency in multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs. The code is available at https://github.com/THUDM/LLM4CardGame
comment: Accepted by NeurIPS 2025
♻ ☆ Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.
♻ ☆ Harnessing Test-time Adaptation for NLU tasks Involving Dialects of English
Test-time domain adaptation (TTDA) is an excellent method which helps generalize models across domains, tasks, and distributions without the use of labeled datasets. Thus, TTDA is very useful in natural language processing (NLP) in the dialectal setting, since oftentimes, models are trained on Standard American English (SAE), evaluated on Indian English (IndE), Singaporean English (SingE), or Nigerian English (NgE), of which distribution differs significantly from the former. This is especially useful since dialectal datasets are scarce. In this paper, we explore one of the most famous TTDA techniques, SHOT, in dialectal NLP. We finetune and evaluate SHOT on different combinations of dialectal GLUE. Our findings show that SHOT is a viable technique when labeled datasets are unavailable. We also theoretically propose the concept of dialectal gap and show that it has a positive correlation with the effectiveness of SHOT. We also find that in many cases, finetuning on SAE yields higher performance than finetuning on dialectal data.
♻ ☆ A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning
Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only $0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.
comment: 12 pages, 6 figures
♻ ☆ MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.
♻ ☆ DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents
Large language model (LLM)-powered agents are increasingly used in recommender systems (RSs) to achieve personalized behavior modeling, where the memory mechanism plays a pivotal role in enabling the agents to autonomously explore, learn and self-evolve from real-world interactions. However, this very mechanism, serving as a contextual repository, inherently exposes an attack surface for potential adversarial manipulations. Despite its central role, the robustness of agentic RSs in the face of such threats remains largely underexplored. Previous works suffer from semantic mismatches or rely on static embeddings or pre-defined prompts, all of which are not designed for dynamic systems, especially for dynamic memory states of LLM agents. This challenge is exacerbated by the black-box nature of commercial recommenders. To tackle the above problems, in this paper, we present the first systematic investigation of memory-based vulnerabilities in LLM-powered recommender agents, revealing their security limitations and guiding efforts to strengthen system resilience and trustworthiness. Specifically, we propose a novel black-box attack framework named DrunkAgent. DrunkAgent crafts semantically meaningful adversarial textual triggers for target item promotions and introduces a series of strategies to maximize the trigger effect by corrupting the memory updates during the interactions. The triggers and strategies are optimized on a surrogate model, enabling DrunkAgent transferable and stealthy. Extensive experiments on real-world datasets across diverse agentic RSs, including collaborative filtering, retrieval augmentation and sequential recommendations, demonstrate the generalizability, transferability and stealthiness of DrunkAgent.
♻ ☆ IASC: Interactive Agentic System for ConLangs
We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC
comment: Initial draft
♻ ☆ Ontology-Enhanced Knowledge Graph Completion using Large Language Models
Large Language Models (LLMs) have been extensively adopted in Knowledge Graph Completion (KGC), showcasing significant research advancements. However, as black-box models driven by deep neural architectures, current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, thereby hindering their ability to produce conclusive and decisive reasoning outcomes. We aim to integrate neural-perceptual structural information with ontological knowledge, leveraging the powerful capabilities of LLMs to achieve a deeper understanding of the intrinsic logic of the knowledge. We propose an ontology enhanced KGC method using LLMs -- OL-KGC. It first leverages neural perceptual mechanisms to effectively embed structural information into the textual space, and then uses an automated extraction algorithm to retrieve ontological knowledge from the knowledge graphs (KGs) that needs to be completed, which is further transformed into a textual format comprehensible to LLMs for providing logic guidance. We conducted extensive experiments on three widely-used benchmarks -- FB15K-237, UMLS and WN18RR. The experimental results demonstrate that OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.
♻ ☆ Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
comment: 16 pages, 12 figures
♻ ☆ CoDial: Interpretable Task-Oriented Dialogue Systems Through Dialogue Flow Alignment
Building Task-Oriented Dialogue (TOD) systems that generalize across different tasks remains a challenging problem. Data-driven approaches often struggle to transfer effectively to unseen tasks. While recent schema-based TOD frameworks improve generalization by decoupling task logic from language understanding, their reliance on neural or generative models often obscures how task schemas influence behaviour and hence impair interpretability. In this work, we introduce a novel framework, CoDial (Code for Dialogue), which converts a TOD task schema, represented as a novel structured heterogeneous graph, to programmatic LLM guardrailing code, such as NVIDIA's Colang, enabling interpretable and efficient alignment of dialogue policies during inference. We introduce two paradigms, $\text{CoDial}_{\text{free}}$ and $\text{CoDial}_{\text{structured}}$ for generating LLM guardrails, and propose a feedback mechanism that integrates human feedback to iteratively improve the generated code. Empirically, CoDial achieves state-of-the-art (SOTA) performance on the widely used STAR dataset and is on par with SOTA on the MultiWOZ dataset, while also providing interpretability. We additionally demonstrate CoDial's iterative improvement via manual and LLM-aided feedback, making it a practical tool for expert-guided alignment of LLMs in high-stakes domains.
♻ ☆ HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.
♻ ☆ Semantic Agreement Enables Efficient Open-Ended LLM Cascades EMNLP
Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement -- meaning-level consensus between ensemble outputs -- as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.
comment: 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) Industry Track
♻ ☆ ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time
Transformer-based Language Models' computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model's parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model's accuracy. We also study how context can be stored in LLM's weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models' accuracy.
♻ ☆ Improving Metacognition and Uncertainty Communication in Language Models
Large language models (LLMs) are increasingly used in decision-making contexts, but when they present answers without signaling low confidence, users may unknowingly act on erroneous outputs. Prior work shows that LLMs maintain internal uncertainty signals, yet their expressed confidence is often miscalibrated and poorly discriminates between correct and incorrect answers. We investigate whether supervised fine-tuning can improve models' ability to communicate uncertainty and whether such improvements generalize across tasks and domains. We fine-tune LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia, and evaluate two metacognitive tasks: (1) single-question confidence estimation, where the model assigns a numeric certainty to its answer, and (2) pairwise confidence comparison, where the model selects which of two answers it is more likely to answer correctly. We assess generalization to unseen domains, including medical and legal reasoning. Results show that fine-tuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains. However, gains are task-specific: training on single-question calibration does not transfer to pairwise comparison, and vice versa. Multitask fine-tuning yields broader gains, lowering calibration error and strengthening discrimination in out-of-domain evaluations. This suggests that uncertainty communication in LLMs is trainable but requires multitask training to generalize effectively.
♻ ☆ Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models
Widespread LLM adoption has introduced characteristic repetitive phraseology, termed "slop," which degrades output quality and makes AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000x more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results under MIT license: https://github.com/sam-paech/auto-antislop.
comment: 11 pages + appendices, 16 figures
♻ ☆ Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations
The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: \emph{there is no way to improve it when you cannot evaluate it right}. To address this, we introduce \textbf{EvoPresent}, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is \textbf{PresAesth}, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce \textbf{EvoPresent Benchmark}, a comprehensive benchmark comprising: \textit{Presentation Generation Quality}, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and \textit{Aesthetic Awareness}, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.
♻ ☆ AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks NeurIPS 2025
Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.
comment: Accepted by NeurIPS 2025
♻ ☆ ScholaWrite: A Dataset of End-to-End Scholarly Writing Process
Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.
comment: Equal contribution: Khanh Chi Le, Linghe Wang, Minhwa Lee | project page: https://minnesotanlp.github.io/scholawrite/
♻ ☆ CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation
Although autoregressive models have dominated language modeling in recent years, there has been a growing interest in exploring alternative paradigms to the conventional next-token prediction framework. Diffusion-based language models have emerged as a compelling alternative due to their powerful parallel generation capabilities and inherent editability. However, these models are often constrained by fixed-length generation. A promising direction is to combine the strengths of both paradigms, segmenting sequences into blocks, modeling autoregressive dependencies across blocks while leveraging discrete diffusion to estimate the conditional distribution within each block given the preceding context. Nevertheless, their practical application is often hindered by two key limitations: rigid fixed-length outputs and a lack of flexible control mechanisms. In this work, we address the critical limitations of fixed granularity and weak controllability in current large diffusion language models. We propose CtrlDiff, a dynamic and controllable semi-autoregressive framework that adaptively determines the size of each generation block based on local semantics using reinforcement learning. Furthermore, we introduce a classifier-guided control mechanism tailored to discrete diffusion, which significantly reduces computational overhead while facilitating efficient post-hoc conditioning without retraining. Extensive experiments demonstrate that CtrlDiff sets a new standard among hybrid diffusion models, narrows the performance gap to state-of-the-art autoregressive approaches, and enables effective conditional text generation across diverse tasks.
♻ ☆ ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection
We introduce ALHD, the first large-scale comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection. We provide rigorous preprocessing, rich annotations, and standardized balanced splits to support reproducibility. In addition, we present, analyze and discuss benchmark experiments using our new dataset, in turn identifying gaps and proposing future research directions. Benchmarking across traditional classifiers, BERT-based models, and LLMs (zero-shot and few-shot) demonstrates that fine-tuned BERT models achieve competitive performance, outperforming LLM-based models. Results are however not always consistent, as we observe challenges when generalizing across genres; indeed, models struggle to generalize when they need to deal with unseen patterns in cross-genre settings, and these challenges are particularly prominent when dealing with news articles, where LLM-generated texts resemble human texts in style, which opens up avenues for future research. ALHD establishes a foundation for research related to Arabic LLM-detection and mitigating risks of misinformation, academic dishonesty, and cyber threats.
comment: 47 pages, 15 figures. Dataset available at Zenodo: https://doi.org/10.5281/zenodo.17249602 Codebase available at GitHub: https://github.com/alikhairallah/ALHD-Benchmarking
♻ ☆ Robustness Assessment and Enhancement of Text Watermarking for Google's SynthID
Recent advances in LLM watermarking methods such as SynthID-Text by Google DeepMind offer promising solutions for tracing the provenance of AI-generated text. However, our robustness assessment reveals that SynthID-Text is vulnerable to meaning-preserving attacks, such as paraphrasing, copy-paste modifications, and back-translation, which can significantly degrade watermark detectability. To address these limitations, we propose SynGuard, a hybrid framework that combines the semantic alignment strength of Semantic Information Retrieval (SIR) with the probabilistic watermarking mechanism of SynthID-Text. Our approach jointly embeds watermarks at both lexical and semantic levels, enabling robust provenance tracking while preserving the original meaning. Experimental results across multiple attack scenarios show that SynGuard improves watermark recovery by an average of 11.1\% in F1 score compared to SynthID-Text. These findings demonstrate the effectiveness of semantic-aware watermarking in resisting real-world tampering. All code, datasets, and evaluation scripts are publicly available at: https://github.com/githshine/SynGuard.
comment: Accepted by TrustCom2025
♻ ☆ PixelWorld: How Far Are We from Perceiving Everything as Pixels?
Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather than separately processed images and tokenized text. This shift highlights the need for a unified perception paradigm. To investigate this idea, we explore Perceive Everything as Pixels (PEAP) and introduce PixelWorld, a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space. Experiments across multiple benchmarks show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, suggesting that vision transformers can partially capture global textual semantics without explicit tokenization. In contrast, reasoning-intensive tasks such as mathematics and code show notable performance degradation, although Chain-of-Thought prompting helps mitigate this gap by compensating for missing symbolic structure. We further find that when visual and textual information are closely integrated, representing everything as pixels simplifies preprocessing and avoids cross-modal misalignment. PixelWorld thus provides a systematic and practical framework for evaluating unified vision--language models and facilitates further exploration of pixel-based multimodal learning.
♻ ☆ Efficient Interleaved Speech Modeling through Knowledge Distillation
Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.
♻ ☆ CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation ACL 2025
Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across several metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.
comment: Accepted at Findings of ACL 2025
♻ ☆ Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling
Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.
comment: ASRU 2025. Project page: https://jjery2243542.github.io/flowslm.github.io/
Information Retrieval 24
☆ LLMs as Sparse Retrievers:A Framework for First-Stage Product Search
Product search is a crucial component of modern e-commerce platforms, with billions of user queries every day. In product search systems, first-stage retrieval should achieve high recall while ensuring efficient online deployment. Sparse retrieval is particularly attractive in this context due to its interpretability and storage efficiency. However, sparse retrieval methods suffer from severe vocabulary mismatch issues, leading to suboptimal performance in product search scenarios.With their potential for semantic analysis, large language models (LLMs) offer a promising avenue for mitigating vocabulary mismatch issues and thereby improving retrieval quality. Directly applying LLMs to sparse retrieval in product search exposes two key challenges:(1)Queries and product titles are typically short and highly susceptible to LLM-induced hallucinations, such as generating irrelevant expansion terms or underweighting critical literal terms like brand names and model numbers;(2)The large vocabulary space of LLMs leads to difficulty in initializing training effectively, making it challenging to learn meaningful sparse representations in such ultra-high-dimensional spaces.To address these challenges, we propose PROSPER, a framework for PROduct search leveraging LLMs as SParsE Retrievers. PROSPER incorporates: (1)A literal residual network that alleviates hallucination in lexical expansion by reinforcing underweighted literal terms through a residual compensation mechanism; and (2)A lexical focusing window that facilitates effective training initialization via a coarse-to-fine sparsification strategy.Extensive offline and online experiments show that PROSPER significantly outperforms sparse baselines and achieves recall performance comparable to advanced dense retrievers, while also achieving revenue increments online.
comment: 16 pages
☆ ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.
☆ Censorship Chokepoints: New Battlegrounds for Regional Surveillance, Censorship and Influence on the Internet
Undoubtedly, the Internet has become one of the most important conduits to information for the general public. Nonetheless, Internet access can be and has been limited systematically or blocked completely during political events in numerous countries and regions by various censorship mechanisms. Depending on where the core filtering component is situated, censorship techniques have been classified as client-based, server-based, or network-based. However, as the Internet evolves rapidly, new and sophisticated censorship techniques have emerged, which involve techniques that cut across locations and involve new forms of hurdles to information access. We argue that modern censorship can be better understood through a new lens that we term chokepoints, which identifies bottlenecks in the content production or delivery cycle where efficient new forms of large-scale client-side surveillance and filtering mechanisms have emerged.
comment: 15 pages, 2 figures
☆ Evaluating LLM-Based Mobile App Recommendations: An Empirical Study
Large Language Models (LLMs) are increasingly used to recommend mobile applications through natural language prompts, offering a flexible alternative to keyword-based app store search. Yet, the reasoning behind these recommendations remains opaque, raising questions about their consistency, explainability, and alignment with traditional App Store Optimization (ASO) metrics. In this paper, we present an empirical analysis of how widely-used general purpose LLMs generate, justify, and rank mobile app recommendations. Our contributions are: (i) a taxonomy of 16 generalizable ranking criteria elicited from LLM outputs; (ii) a systematic evaluation framework to analyse recommendation consistency and responsiveness to explicit ranking instructions; and (iii) a replication package to support reproducibility and future research on AI-based recommendation systems. Our findings reveal that LLMs rely on a broad yet fragmented set of ranking criteria, only partially aligned with standard ASO metrics. While top-ranked apps tend to be consistent across runs, variability increases with ranking depth and search specificity. LLMs exhibit varying sensitivity to explicit ranking instructions - ranging from substantial adaptations to near-identical outputs - highlighting their complex reasoning dynamics in conversational app discovery. Our results aim to support end-users, app developers, and recommender-systems researchers in navigating the emerging landscape of conversational app discovery.
comment: Under review
☆ KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.
comment: 6 pages, 7 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE 2025)
☆ Enhancing Hotel Recommendations with AI: LLM-Based Review Summarization and Query-Driven Insights
The increasing number of data a booking platform such as Booking.com and AirBnB offers make it challenging for interested parties to browse through the available accommodations and analyze reviews in an efficient way. Efforts have been made from the booking platform providers to utilize recommender systems in an effort to enable the user to filter the results by factors such as stars, amenities, cost but most valuable insights can be provided by the unstructured text-based reviews. Going through these reviews one-by-one requires a substantial amount of time to be devoted while a respectable percentage of the reviews won't provide to the user what they are actually looking for. This research publication explores how Large Language Models (LLMs) can enhance short rental apartments recommendations by summarizing and mining key insights from user reviews. The web application presented in this paper, named "instaGuide", automates the procedure of isolating the text-based user reviews from a property on the Booking.com platform, synthesizing the summary of the reviews, and enabling the user to query specific aspects of the property in an effort to gain feedback on their personal questions/criteria. During the development of the instaGuide tool, numerous LLM models were evaluated based on accuracy, cost, and response quality. The results suggest that the LLM-powered summarization reduces significantly the amount of time the users need to devote on their search for the right short rental apartment, improving the overall decision-making procedure.
☆ LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling
Scaling large recommendation systems requires advancing three major frontiers: processing longer user histories, expanding candidate sets, and increasing model capacity. While promising, transformers' computational cost scales quadratically with the user sequence length and linearly with the number of candidates. This trade-off makes it prohibitively expensive to expand candidate sets or increase sequence length at inference, despite the significant performance improvements. We introduce \textbf{LIME}, a novel architecture that resolves this trade-off. Through two key innovations, LIME fundamentally reduces computational complexity. First, low-rank ``link embeddings" enable pre-computation of attention weights by decoupling user and candidate interactions, making the inference cost nearly independent of candidate set size. Second, a linear attention mechanism, \textbf{LIME-XOR}, reduces the complexity with respect to user sequence length from quadratic ($O(N^2)$) to linear ($O(N)$). Experiments on public and industrial datasets show LIME achieves near-parity with state-of-the-art transformers but with a 10$\times$ inference speedup on large candidate sets or long sequence lengths. When tested on a major recommendation platform, LIME improved user engagement while maintaining minimal inference costs with respect to candidate set size and user history length, establishing a new paradigm for efficient and expressive recommendation systems.
comment: 16 pages
☆ XGen-Q: An Explainable Domain-Adaptive LLM Framework with Retrieval-Augmented Generation for Software Security
Generative AI and large language models (LLMs) have shown strong capabilities in code understanding, but their use in cybersecurity, particularly for malware detection and analysis, remains limited. Existing detection systems often fail to generalize to obfuscated or previously unseen threats, underscoring the need for more adaptable and explainable models. To address this challenge, we introduce XGen-Q, a domain-adapted LLM built on the Qwen-Coder architecture and pretrained on a large-scale corpus of over one million malware samples, spanning both source and assembly code. XGen-Q uses a multi-stage prompt strategy combined with retrieval-augmented generation (RAG) to deliver reliable malware identification and detailed forensic reporting, even in the presence of complex code obfuscation. To further enhance generalization, we design a training pipeline that systematically exposes the model to diverse obfuscation patterns. Experimental results show that XGen-Q achieves significantly lower perplexity than competitive baselines and exhibits strong performance on novel malware samples, demonstrating the promise of LLM-based approaches for interpretable and robust malware analysis.
☆ SBAN: A Framework \& Multi-Dimensional Dataset for Large Language Model Pre-Training and Software Code Mining
This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code. This unique multimodal structure enables research on cross-representation learning, semantic understanding of software, and automated malware detection. Beyond security applications, SBAN supports broader tasks such as code translation, code explanation, and other software mining tasks involving heterogeneous data. It is particularly suited for scalable training of deep models, including transformers and other LLM architectures. By bridging low-level machine representations and high-level human semantics, SBAN provides a robust foundation for building intelligent systems that reason about code. We believe that this dataset opens new opportunities for mining software behavior, improving security analytics, and enhancing LLM capabilities in pre-training and fine-tuning tasks for software code mining.
☆ DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code NeurIPS 2025
The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.
comment: Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025): 4th Workshop on Deep Learning for Code
☆ 10 Simple Rules for Improving Your Standardized Fields and Terms
Contextual metadata is the unsung hero of research data. When done right, standardized and structured vocabularies make your data findable, shareable, and reusable. When done wrong, they turn a well intended effort into data cleanup and curation nightmares. In this paper we tackle the surprisingly tricky process of vocabulary standardization with a mix of practical advice and grounded examples. Drawing from real-world experience in contextual data harmonization, we highlight common challenges (e.g., semantic noise and concept bombs) and provide actionable strategies to address them. Our rules emphasize alignment with Findability, Accessibility, Interoperability, and Reusability (FAIR) principles while remaining adaptable to evolving user and research needs. Whether you are curating datasets, designing a schema, or contributing to a standards body, these rules aim to help you create metadata that is not only technically sound but also meaningful to users.
comment: 17 pages, 1 figure Author Contributions: Conceptualization by EG and RC. Manuscript writing by RC. Revisions and Editing by RC, EG, DD, and WH. Acknowledgements: Charlotte Barclay
☆ Unifying Inductive, Cross-Domain, and Multimodal Learning for Robust and Generalizable Recommendation CIKM 2025
Recommender systems have long been built upon the modeling of interactions between users and items, while recent studies have sought to broaden this paradigm by generalizing to new users and items, incorporating diverse information sources, and transferring knowledge across domains. Nevertheless, these efforts have largely focused on individual aspects, hindering their ability to tackle the complex recommendation scenarios that arise in daily consumptions across diverse domains. In this paper, we present MICRec, a unified framework that fuses inductive modeling, multimodal guidance, and cross-domain transfer to capture user contexts and latent preferences in heterogeneous and incomplete real-world data. Moving beyond the inductive backbone of INMO, our model refines expressive representations through modality-based aggregation and alleviates data sparsity by leveraging overlapping users as anchors across domains, thereby enabling robust and generalizable recommendation. Experiments show that MICRec outperforms 12 baselines, with notable gains in domains with limited training data.
comment: 7 pages, 3 figures, and 4 tables. International Workshop on Multimodal Generative Search and Recommendation (MMGenSR) at The 34th ACM International Conference on Information and Knowledge Management (CIKM 2025)
☆ DiffGRM: Diffusion-based Generative Recommendation Model
Generative recommendation (GR) is an emerging paradigm that represents each item via a tokenizer as an n-digit semantic ID (SID) and predicts the next item by autoregressively generating its SID conditioned on the user's history. However, two structural properties of SIDs make ARMs ill-suited. First, intra-item consistency: the n digits jointly specify one item, yet the left-to-right causality trains each digit only under its prefix and blocks bidirectional cross-digit evidence, collapsing supervision to a single causal path. Second, inter-digit heterogeneity: digits differ in semantic granularity and predictability, while the uniform next-token objective assigns equal weight to all digits, overtraining easy digits and undertraining hard digits. To address these two issues, we propose DiffGRM, a diffusion-based GR model that replaces the autoregressive decoder with a masked discrete diffusion model (MDM), thereby enabling bidirectional context and any-order parallel generation of SID digits for recommendation. Specifically, we tailor DiffGRM in three aspects: (1) tokenization with Parallel Semantic Encoding (PSE) to decouple digits and balance per-digit information; (2) training with On-policy Coherent Noising (OCN) that prioritizes uncertain digits via coherent masking to concentrate supervision on high-value signals; and (3) inference with Confidence-guided Parallel Denoising (CPD) that fills higher-confidence digits first and generates diverse Top-K candidates. Experiments show consistent gains over strong generative and discriminative recommendation baselines on multiple datasets, improving NDCG@10 by 6.9%-15.5%. Code is available at https://github.com/liuzhao09/DiffGRM.
comment: 13 pages, 5 figures
♻ ☆ Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures
Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.
comment: Waiting for Conference Response
♻ ☆ Review of Explainable Graph-Based Recommender Systems
Explainability of recommender systems has become essential to ensure users' trust and satisfaction. Various types of explainable recommender systems have been proposed including explainable graph-based recommender systems. This review paper discusses state-of-the-art approaches of these systems and categorizes them based on three aspects: learning methods, explaining methods, and explanation types. It also explores the commonly used datasets, explainability evaluation methods, and future directions of this research area. Compared with the existing review papers, this paper focuses on explainability based on graphs and covers the topics required for developing novel explainable graph-based recommender systems.
♻ ☆ When Text Embedding Meets Large Language Model: A Comprehensive Survey
Text embedding has become a foundational technology in natural language processing (NLP) during the deep learning era, driving advancements across a wide array of downstream tasks. While many natural language understanding challenges can now be modeled using generative paradigms and leverage the robust generative and comprehension capabilities of large language models (LLMs), numerous practical applications - such as semantic matching, clustering, and information retrieval - continue to rely on text embeddings for their efficiency and effectiveness. Therefore, integrating LLMs with text embeddings has become a major research focus in recent years. In this survey, we categorize the interplay between LLMs and text embeddings into three overarching themes: (1) LLM-augmented text embedding, enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders, adapting their innate capabilities for high-quality embedding; and (3) Text embedding understanding with LLMs, leveraging LLMs to analyze and interpret embeddings. By organizing recent works based on interaction patterns rather than specific downstream applications, we offer a novel and systematic overview of contributions from various research and application domains in the era of LLMs. Furthermore, we highlight the unresolved challenges that persisted in the pre-LLM era with pre-trained language models (PLMs) and explore the emerging obstacles brought forth by LLMs. Building on this analysis, we outline prospective directions for the evolution of text embedding, addressing both theoretical and practical opportunities in the rapidly advancing landscape of NLP.
comment: Version 4: We added the latest works of LLM-based Embedders
♻ ☆ On Efficiency-Effectiveness Trade-off of Diffusion-based Recommenders
Diffusion models have emerged as a powerful paradigm for generative sequential recommendation, which typically generate next items to recommend guided by user interaction histories with a multi-step denoising process. However, the multi-step process relies on discrete approximations, introducing discretization error that creates a trade-off between computational efficiency and recommendation effectiveness. To address this trade-off, we propose TA-Rec, a two-stage framework that achieves one-step generation by smoothing the denoising function during pretraining while alleviating trajectory deviation by aligning with user preferences during fine-tuning. Specifically, to improve the efficiency without sacrificing the recommendation performance, TA-Rec pretrains the denoising model with Temporal Consistency Regularization (TCR), enforcing the consistency between the denoising results across adjacent steps. Thus, we can smooth the denoising function to map the noise as oracle items in one step with bounded error. To further enhance effectiveness, TA-Rec introduces Adaptive Preference Alignment (APA) that aligns the denoising process with user preference adaptively based on preference pair similarity and timesteps. Extensive experiments prove that TA-Rec's two-stage objective effectively mitigates the discretization errors-induced trade-off, enhancing both efficiency and effectiveness of diffusion-based recommenders.
♻ ☆ AI4DiTraRe: Building the BFO-Compliant Chemotion Knowledge Graph ISWC 2025
Chemistry is an example of a discipline where the advancements of technology have led to multi-level and often tangled and tricky processes ongoing in the lab. The repeatedly complex workflows are combined with information from chemical structures, which are essential to understand the scientific process. An important tool for many chemists is Chemotion, which consists of an electronic lab notebook and a repository. This paper introduces a semantic pipeline for constructing the BFO-compliant Chemotion Knowledge Graph, providing an integrated, ontology-driven representation of chemical research data. The Chemotion-KG has been developed to adhere to the FAIR (Findable, Accessible, Interoperable, Reusable) principles and to support AI-driven discovery and reasoning in chemistry. Experimental metadata were harvested from the Chemotion API in JSON-LD format, converted into RDF, and subsequently transformed into a Basic Formal Ontology-aligned graph through SPARQL CONSTRUCT queries. The source code and datasets are publicly available via GitHub. The Chemotion Knowledge Graph is hosted by FIZ Karlsruhe Information Service Engineering. Outcomes presented in this work were achieved within the Leibniz Science Campus ``Digital Transformation of Research'' (DiTraRe) and are part of an ongoing interdisciplinary collaboration.
comment: 12 pages, 7 figures. Camera-ready version. Accepted to the 5th International Workshop on Scientific Knowledge: Representation, Discovery, and Assessment; 2 November 2025 - Nara, Japan; co-located with The 24th International Semantic Web Conference, ISWC 2025. Published in CEUR proceedings Vol-4065, pages 45-56
♻ ☆ Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders (CE), which require full corpus access. We propose a corpus-free alternative: an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage and then produces a hard negative example using only the generated query text. Our dataset comprises 7,250 arXiv abstracts spanning diverse domains including mathematics, physics, computer science, and related fields, serving as positive passages for query generation. We evaluate two fine-tuning configurations of DistilBERT for dense retrieval; one using LLM-generated hard negatives conditioned solely on the query, and another using negatives generated with both the query and its positive document as context. Compared to traditional corpus-based mining methods {LLM Query $\rightarrow$ BM25 HN and LLM Query $\rightarrow$ CE HN on multiple BEIR benchmark datasets, our all-LLM pipeline outperforms strong lexical mining baselines and achieves performance comparable to cross-encoder-based methods, demonstrating the potential of corpus-free hard negative generation for retrieval model training.
♻ ☆ Retrieval-in-the-Chain: Bootstrapping Large Language Models for Generative Retrieval
Generative retrieval (GR) is an emerging paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers (docids) relevant to a given query. Prior works have focused on leveraging the generative capabilities of LLMs to improve GR, while overlooking that their reasoning capabilities could likewise help. This raises a key question: Can explicit reasoning benefit GR? To investigate, we first conduct a preliminary study where an LLM is prompted to generate free-form chain-of-thought (CoT) reasoning before performing constrained docid decoding. Although this method outperforms standard GR, the generated reasoning tends to be verbose and poorly aligned with the docid space. These limitations motivate the development of a reasoning mechanism better tailored to GR. Therefore, we propose Reason-for-Retrieval (R4R), a reasoning-augmented framework for GR that converts free-form CoT reasoning into a compact, structured format, and iteratively refines the reasoning during the retrieval process. R4R augments an existing GR method by leveraging a reasoning-capable LLM that has been instruction-tuned for GR. At inference time, R4R first uses the LLM to generate an initial structured reasoning; then the same LLM alternates between (i) constrained decoding with the chosen GR method to produce candidate docids and (ii) updating the reasoning based on retrieval results to improve the next round. R4R does not require additional models or training, and instead a single LLM serves as both the reasoning generator and the retriever. Extensive experiments on Natural Questions, MS MARCO, and a real-world item-search benchmark validate the effectiveness of R4R.
♻ ☆ FinAI Data Assistant: LLM-based Financial Database Query Processing with the OpenAI Function Calling API CIKM 2025
We present FinAI Data Assistant, a practical approach for natural-language querying over financial databases that combines large language models (LLMs) with the OpenAI Function Calling API. Rather than synthesizing complete SQL via text-to-SQL, our system routes user requests to a small library of vetted, parameterized queries, trading generative flexibility for reliability, low latency, and cost efficiency. We empirically study three questions: (RQ1) whether LLMs alone can reliably recall or extrapolate time-dependent financial data without external retrieval; (RQ2) how well LLMs map company names to stock ticker symbols; and (RQ3) whether function calling outperforms text-to-SQL for end-to-end database query processing. Across controlled experiments on prices and fundamentals, LLM-only predictions exhibit non-negligible error and show look-ahead bias primarily for stock prices relative to model knowledge cutoffs. Ticker-mapping accuracy is near-perfect for NASDAQ-100 constituents and high for S\&P~500 firms. Finally, FinAI Data Assistant achieves lower latency and cost and higher reliability than a text-to-SQL baseline on our task suite. We discuss design trade-offs, limitations, and avenues for deployment.
comment: 6 pages, 2 figures, accepted at CIKM 2025 FinAI Workshop
♻ ☆ DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents
Large language model (LLM)-powered agents are increasingly used in recommender systems (RSs) to achieve personalized behavior modeling, where the memory mechanism plays a pivotal role in enabling the agents to autonomously explore, learn and self-evolve from real-world interactions. However, this very mechanism, serving as a contextual repository, inherently exposes an attack surface for potential adversarial manipulations. Despite its central role, the robustness of agentic RSs in the face of such threats remains largely underexplored. Previous works suffer from semantic mismatches or rely on static embeddings or pre-defined prompts, all of which are not designed for dynamic systems, especially for dynamic memory states of LLM agents. This challenge is exacerbated by the black-box nature of commercial recommenders. To tackle the above problems, in this paper, we present the first systematic investigation of memory-based vulnerabilities in LLM-powered recommender agents, revealing their security limitations and guiding efforts to strengthen system resilience and trustworthiness. Specifically, we propose a novel black-box attack framework named DrunkAgent. DrunkAgent crafts semantically meaningful adversarial textual triggers for target item promotions and introduces a series of strategies to maximize the trigger effect by corrupting the memory updates during the interactions. The triggers and strategies are optimized on a surrogate model, enabling DrunkAgent transferable and stealthy. Extensive experiments on real-world datasets across diverse agentic RSs, including collaborative filtering, retrieval augmentation and sequential recommendations, demonstrate the generalizability, transferability and stealthiness of DrunkAgent.
♻ ☆ MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature
Synthesis procedures play a critical role in materials research, as they directly affect material properties. With data-driven approaches increasingly accelerating materials discovery, there is growing interest in extracting synthesis procedures from scientific literature as structured data. However, existing studies often rely on rigid, domain-specific schemas with predefined fields for structuring synthesis procedures or assume that synthesis procedures are linear sequences of operations, which limits their ability to capture the structural complexity of real-world procedures. To address these limitations, we adopt PROV-DM, an international standard for provenance information, which supports flexible, graph-based modeling of procedures. We present MatPROV, a dataset of PROV-DM-compliant synthesis procedures extracted from scientific literature using large language models. MatPROV captures structural complexities and causal relationships among materials, operations, and conditions through visually intuitive directed graphs. This representation enables machine-interpretable synthesis knowledge, opening opportunities for future research such as automated synthesis planning and optimization.
♻ ☆ CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation ACL 2025
Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across several metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.
comment: Accepted at Findings of ACL 2025
Multimedia 9
☆ PIRA: Pan-CDN Intra-video Resource Adaptation for Short Video Streaming
In large scale short video platforms, CDN resource selection plays a critical role in maintaining Quality of Experience (QoE) while controlling escalating traffic costs. To better understand this phenomenon, we conduct in the wild network measurements during video playback in a production short video system. The results reveal that CDNs delivering higher average QoE often come at greater financial cost, yet their connection quality fluctuates even within a single video underscoring a fundamental and dynamic trade off between QoE and cost. However, the problem of sustaining high QoE under cost constraints remains insufficiently investigated in the context of CDN selection for short video streaming. To address this, we propose PIRA, a dynamic resource selection algorithm that optimizes QoE and cost in real time during video playback. PIRA formally integrating QoE and cost by a mathematical model, and introduce a intra video control theoretic CDN resource selection approach which can balance QoE and cost under network dynamics. To reduce the computation overheads, PIRA employs state space pruning and adaptive parameter adjustment to efficiently solve the high dimensional optimization problem. In large scale production experiments involving 450,000 users over two weeks, PIRA outperforms the production baseline, achieving a 2.1% reduction in start up delay, 15.2% shorter rebuffering time, and 10% lower average unit traffic cost, demonstrating its effectiveness in balancing user experience and financial cost at scale.
☆ Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification
Robust speaker verification under noisy conditions remains an open challenge. Conventional deep learning methods learn a robust unified speaker representation space against diverse background noise and achieve significant improvement. In contrast, this paper presents a noise-conditioned mixture-ofexperts framework that decomposes the feature space into specialized noise-aware subspaces for speaker verification. Specifically, we propose a noise-conditioned expert routing mechanism, a universal model based expert specialization strategy, and an SNR-decaying curriculum learning protocol, collectively improving model robustness and generalization under diverse noise conditions. The proposed method can automatically route inputs to expert networks based on noise information derived from the inputs, where each expert targets distinct noise characteristics while preserving speaker identity information. Comprehensive experiments demonstrate consistent superiority over baselines, confirming that explicit noise-dependent feature modeling significantly enhances robustness without sacrificing verification accuracy.
☆ DeLoad: Demand-Driven Short-Video Preloading with Scalable Watch-Time Estimation
Short video streaming has become a dominant paradigm in digital media, characterized by rapid swiping interactions and diverse media content. A key technical challenge is designing an effective preloading strategy that dynamically selects and prioritizes download tasks from an evolving playlist, balancing Quality of Experience (QoE) and bandwidth efficiency under practical commercial constraints. However, real world analysis reveals critical limitations of existing approaches: (1) insufficient adaptation of download task sizes to dynamic conditions, and (2) watch time prediction models that are difficult to deploy reliably at scale. In this paper, we propose DeLoad, a novel preloading framework that addresses these issues by introducing dynamic task sizing and a practical, multi dimensional watch time estimation method. Additionally, a Deep Reinforcement Learning (DRL) enhanced agent is trained to optimize the download range decisions adaptively. Extensive evaluations conducted on an offline testing platform, leveraging massive real world network data, demonstrate that DeLoad achieves significant improvements in QoE metrics (34.4% to 87.4% gain). Furthermore, after deployment on a large scale commercial short video platform, DeLoad has increased overall user watch time by 0.09% while simultaneously reducing rebuffering events and 3.76% bandwidth consumption.
☆ How2Compress: Scalable and Efficient Edge Video Analytics via Adaptive Granular Video Compression
With the rapid proliferation of the Internet of Things, video analytics has become a cornerstone application in wireless multimedia sensor networks. To support such applications under bandwidth constraints, learning-based adaptive quantization for video compression have demonstrated strong potential in reducing bitrate while maintaining analytical accuracy. However, existing frameworks often fail to fully exploit the fine-grained quality control enabled by modern blockbased video codecs, leaving significant compression efficiency untapped. In this paper, we present How2Compress, a simple yet effective framework designed to enhance video compression efficiency through precise, fine-grained quality control at the macroblock level. How2Compress is a plug-and-play module and can be seamlessly integrated into any existing edge video analytics pipelines. We implement How2Compress on the H.264 codec and evaluate its performance across diverse real-world scenarios. Experimental results show that How2Compress achieves up to $50.4\%$ bitrate savings and outperforms baselines by up to $3.01\times$ without compromising accuracy, demonstrating its practical effectiveness and efficiency. Code is available at https://github.com/wyhallenwu/how2compress and a reproducible docker image at https://hub.docker.com/r/wuyuheng/how2compress.
comment: MM 2025
☆ EVER: Edge-Assisted Auto-Verification for Mobile MR-Aided Operation
Mixed Reality (MR)-aided operation overlays digital objects on the physical world to provide a more immersive and intuitive operation process. A primary challenge is the precise and fast auto-verification of whether the user follows MR guidance by comparing frames before and after each operation. The pre-operation frame includes virtual guiding objects, while the post-operation frame contains physical counterparts. Existing approaches fall short of accounting for the discrepancies between physical and virtual objects due to imperfect 3D modeling or lighting estimation. In this paper, we propose EVER: an edge-assisted auto-verification system for mobile MR-aided operations. Unlike traditional frame-based similarity comparisons, EVER leverages the segmentation model and rendering pipeline adapted to the unique attributes of frames with physical pieces and those with their virtual counterparts; it adopts a threshold-based strategy using Intersection over Union (IoU) metrics for accurate auto-verification. To ensure fast auto-verification and low energy consumption, EVER offloads compute-intensive tasks to an edge server. Through comprehensive evaluations of public datasets and custom datasets with practical implementation, EVER achieves over 90% verification accuracy within 100 milliseconds (significantly faster than average human reaction time of approximately 273 milliseconds), while consuming only minimal additional computational resources and energy compared to a system without auto-verification.
☆ Foveated Compression for Immersive Telepresence Visualization
Immersive televisualization is important both for telepresence and teleoperation, but resolution and fidelity are often limited by communication bandwidth constraints. We propose a lightweight method for foveated compression of immersive televisualization video streams that can be easily integrated with common video codecs, reducing the required bandwidth if eye tracking data is available. Specifically, we show how to spatially adjust the Quantization Parameter of modern block-based video codecs in a adaptive way based on eye tracking information. The foveal region is transmitted with high fidelity while quality is reduced in the peripheral region, saving bandwidth. We integrate our method with the NimbRo avatar system, which won the ANA Avatar XPRIZE competition. Our experiments show that bandwidth can be reduced to a third without sacrificing immersion. We analyze transmission fidelity with qualitative examples and report quantitative results.
comment: Presented at IEEE TELEPRESENCE 2025, Leiden, Netherlands
♻ ☆ LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.
comment: Submitted to ARR Rolling Review
♻ ☆ M3ST-DTI: A multi-task learning model for drug-target interactions based on multi-modal features and multi-stage alignment
Accurate prediction of drug-target interactions (DTI) is pivotal in drug discovery. However, existing approaches often fail to capture deep intra-modal feature interactions or achieve effective cross-modal alignment, limiting predictive performance and generalization. To address these challenges, we propose M3ST-DTI, a multi-task learning model that enables multi-stage integration and alignment of multi modal features for DTI prediction. M3ST-DTI incorporates three types of features-textual, structural, and functional and enhances intra-modal representations using self-attention mechanisms and a hybrid pooling graph attention module. For early-stage feature alignment and fusion, the model in tegrates MCA with Gram loss as a structural constraint. In the later stage, a BCA module captures fine-grained interactions between drugs and targets within each modality, while a deep orthogonal fusion module mitigates feature redundancy.Extensive evaluations on benchmark datasets demonstrate that M3ST-DTI consistently outperforms state-of-the art methods across diverse metrics
comment: This paper accepted by IEEE BIBM 2025
♻ ☆ SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement NeurIPS2025
Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $\textbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https://cypress-yang.github.io/SongBloom_demo. The code and model weights have been released on https://github.com/Cypress-Yang/SongBloom .
comment: Accepted by NeurIPS2025
Image and Video Processing 17
☆ PIRA: Pan-CDN Intra-video Resource Adaptation for Short Video Streaming
In large scale short video platforms, CDN resource selection plays a critical role in maintaining Quality of Experience (QoE) while controlling escalating traffic costs. To better understand this phenomenon, we conduct in the wild network measurements during video playback in a production short video system. The results reveal that CDNs delivering higher average QoE often come at greater financial cost, yet their connection quality fluctuates even within a single video underscoring a fundamental and dynamic trade off between QoE and cost. However, the problem of sustaining high QoE under cost constraints remains insufficiently investigated in the context of CDN selection for short video streaming. To address this, we propose PIRA, a dynamic resource selection algorithm that optimizes QoE and cost in real time during video playback. PIRA formally integrating QoE and cost by a mathematical model, and introduce a intra video control theoretic CDN resource selection approach which can balance QoE and cost under network dynamics. To reduce the computation overheads, PIRA employs state space pruning and adaptive parameter adjustment to efficiently solve the high dimensional optimization problem. In large scale production experiments involving 450,000 users over two weeks, PIRA outperforms the production baseline, achieving a 2.1% reduction in start up delay, 15.2% shorter rebuffering time, and 10% lower average unit traffic cost, demonstrating its effectiveness in balancing user experience and financial cost at scale.
☆ Channel-Aware Vector Quantization for Robust Semantic Communication on Discrete Channels
Deep learning-based semantic communication has largely relied on analog or semi-digital transmission, which limits compatibility with modern digital communication infrastructures. Recent studies have employed vector quantization (VQ) to enable discrete semantic transmission, yet existing methods neglect channel state information during codebook optimization, leading to suboptimal robustness. To bridge this gap, we propose a channel-aware vector quantization (CAVQ) algorithm within a joint source-channel coding (JSCC) framework, termed VQJSCC, established on a discrete memoryless channel. In this framework, semantic features are discretized and directly mapped to modulation constellation symbols, while CAVQ integrates channel transition probabilities into the quantization process, aligning easily confused symbols with semantically similar codewords. A multi-codebook alignment mechanism is further introduced to handle mismatches between codebook order and modulation order by decomposing the transmission stream into multiple independently optimized subchannels. Experimental results demonstrate that VQJSCC effectively mitigates the digital cliff effect, achieves superior reconstruction quality across various modulation schemes, and outperforms state-of-the-art digital semantic communication baselines in both robustness and efficiency.
comment: 12 pages, 8 figures
☆ DeLoad: Demand-Driven Short-Video Preloading with Scalable Watch-Time Estimation
Short video streaming has become a dominant paradigm in digital media, characterized by rapid swiping interactions and diverse media content. A key technical challenge is designing an effective preloading strategy that dynamically selects and prioritizes download tasks from an evolving playlist, balancing Quality of Experience (QoE) and bandwidth efficiency under practical commercial constraints. However, real world analysis reveals critical limitations of existing approaches: (1) insufficient adaptation of download task sizes to dynamic conditions, and (2) watch time prediction models that are difficult to deploy reliably at scale. In this paper, we propose DeLoad, a novel preloading framework that addresses these issues by introducing dynamic task sizing and a practical, multi dimensional watch time estimation method. Additionally, a Deep Reinforcement Learning (DRL) enhanced agent is trained to optimize the download range decisions adaptively. Extensive evaluations conducted on an offline testing platform, leveraging massive real world network data, demonstrate that DeLoad achieves significant improvements in QoE metrics (34.4% to 87.4% gain). Furthermore, after deployment on a large scale commercial short video platform, DeLoad has increased overall user watch time by 0.09% while simultaneously reducing rebuffering events and 3.76% bandwidth consumption.
☆ Quantification of dual-state 5-ALA-induced PpIX fluorescence: Methodology and validation in tissue-mimicking phantoms
Quantification of protoporphyrin IX (PpIX) fluorescence in human brain tumours has the potential to significantly improve patient outcomes in neuro-oncology, but represents a formidable imaging challenge. Protoporphyrin is a biological molecule which interacts with the tissue micro-environment to form two photochemical states in glioma. Each exhibits markedly different quantum efficiencies, with distinct but overlapping emission spectra that also overlap with tissue autofluorescence. Fluorescence emission is known to be distorted by the intrinsic optical properties of tissue, coupled with marked intra-tumoural heterogeneity as a hallmark of glioma tumours. Existing quantitative fluorescence systems are developed and validated using simplified phantoms that do not simultaneously mimic the complex interactions between fluorophores and tissue optical properties or micro-environment. Consequently, existing systems risk introducing systematic errors into PpIX quantification when used in tissue. In this work, we introduce a novel pipeline for quantification of PpIX in glioma, which robustly differentiates both emission states from background autofluorescence without reliance on a priori spectral information, and accounts for variations in their quantum efficiency. Unmixed PpIX emission forms are then corrected for wavelength-dependent optical distortions and weighted for accurate quantification. Significantly, this pipeline is developed and validated using novel tissue-mimicking phantoms replicating the optical properties of glioma tissues and photochemical variability of PpIX fluorescence in glioma. Our workflow achieves strong correlation with ground-truth PpIX concentrations (R2 = 0.918+-0.002), demonstrating its potential for robust, quantitative PpIX fluorescence imaging in clinical settings.
☆ Multi-Resolution Analysis of the Convective Structure of Tropical Cyclones for Short-Term Intensity Guidance NeurIPS 2025
Accurate tropical cyclone (TC) short-term intensity forecasting with a 24-hour lead time is essential for disaster mitigation in the Atlantic TC basin. Since most TCs evolve far from land-based observing networks, satellite imagery is critical to monitoring these storms; however, these complex and high-resolution spatial structures can be challenging to qualitatively interpret in real time by forecasters. Here we propose a concise, interpretable, and descriptive approach to quantify fine TC structures with a multi-resolution analysis (MRA) by the discrete wavelet transform, enabling data analysts to identify physically meaningful structural features that strongly correlate with rapid intensity change. Furthermore, deep-learning techniques can build on this MRA for short-term intensity guidance.
comment: For Tackling Climate Change with Machine Learning workshop at NeurIPS 2025
☆ Foveated Compression for Immersive Telepresence Visualization
Immersive televisualization is important both for telepresence and teleoperation, but resolution and fidelity are often limited by communication bandwidth constraints. We propose a lightweight method for foveated compression of immersive televisualization video streams that can be easily integrated with common video codecs, reducing the required bandwidth if eye tracking data is available. Specifically, we show how to spatially adjust the Quantization Parameter of modern block-based video codecs in a adaptive way based on eye tracking information. The foveal region is transmitted with high fidelity while quality is reduced in the peripheral region, saving bandwidth. We integrate our method with the NimbRo avatar system, which won the ANA Avatar XPRIZE competition. Our experiments show that bandwidth can be reduced to a third without sacrificing immersion. We analyze transmission fidelity with qualitative examples and report quantitative results.
comment: Presented at IEEE TELEPRESENCE 2025, Leiden, Netherlands
☆ HDR Image Reconstruction using an Unsupervised Fusion Model
High Dynamic Range (HDR) imaging aims to reproduce the wide range of brightness levels present in natural scenes, which the human visual system can perceive but conventional digital cameras often fail to capture due to their limited dynamic range. To address this limitation, we propose a deep learning-based multi-exposure fusion approach for HDR image generation. The method takes a set of differently exposed Low Dynamic Range (LDR) images, typically an underexposed and an overexposed image, and learns to fuse their complementary information using a convolutional neural network (CNN). The underexposed image preserves details in bright regions, while the overexposed image retains information in dark regions; the network effectively combines these to reconstruct a high-quality HDR output. The model is trained in an unsupervised manner, without relying on ground-truth HDR images, making it practical for real-world applications where such data is unavailable. We evaluate our results using the Multi-Exposure Fusion Structural Similarity Index Measure (MEF-SSIM) and demonstrate that our approach achieves superior visual quality compared to existing fusion methods. A customized loss function is further introduced to improve reconstruction fidelity and optimize model performance.
♻ ☆ SimCortex: Collision-free Simultaneous Cortical Surfaces Reconstruction
Accurate cortical surface reconstruction from magnetic resonance imaging (MRI) data is crucial for reliable neuroanatomical analyses. Current methods have to contend with complex cortical geometries, strict topological requirements, and often produce surfaces with overlaps, self-intersections, and topological defects. To overcome these shortcomings, we introduce SimCortex, a deep learning framework that simultaneously reconstructs all brain surfaces (left/right white-matter and pial) from T1-weighted(T1w) MRI volumes while preserving topological properties. Our method first segments the T1w image into a nine-class tissue label map. From these segmentations, we generate subject-specific, collision-free initial surface meshes. These surfaces serve as precise initializations for subsequent multiscale diffeomorphic deformations. Employing stationary velocity fields (SVFs) integrated via scaling-and-squaring, our approach ensures smooth, topology-preserving transformations with significantly reduced surface collisions and self-intersections. Evaluations on standard datasets demonstrate that SimCortex dramatically reduces surface overlaps and self-intersections, surpassing current methods while maintaining state-of-the-art geometric accuracy.
comment: Metadata update: added journal reference and DOI linking to the published chapter (Springer)
♻ ☆ Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning
Medical Vision Foundation Models (Med-VFMs) have superior capabilities of interpreting medical images due to the knowledge learned from self-supervised pre-training with extensive unannotated images. To improve their performance on adaptive downstream evaluations, especially segmentation, a few samples from target domains are selected randomly for fine-tuning them. However, there lacks works to explore the way of adapting Med-VFMs to achieve the optimal performance on target domains efficiently. Thus, it is highly demanded to design an efficient way of fine-tuning Med-VFMs by selecting informative samples to maximize their adaptation performance on target domains. To achieve this, we propose an Active Source-Free Domain Adaptation (ASFDA) method to efficiently adapt Med-VFMs to target domains for volumetric medical image segmentation. This ASFDA employs a novel Active Learning (AL) method to select the most informative samples from target domains for fine-tuning Med-VFMs without the access to source pre-training samples, thus maximizing their performance with the minimal selection budget. In this AL method, we design an Active Test Time Sample Query strategy to select samples from the target domains via two query metrics, including Diversified Knowledge Divergence (DKD) and Anatomical Segmentation Difficulty (ASD). DKD is designed to measure the source-target knowledge gap and intra-domain diversity. It utilizes the knowledge of pre-training to guide the querying of source-dissimilar and semantic-diverse samples from the target domains. ASD is designed to evaluate the difficulty in segmentation of anatomical structures by measuring predictive entropy from foreground regions adaptively. Additionally, our ASFDA method employs a Selective Semi-supervised Fine-tuning to improve the performance and efficiency of fine-tuning by identifying samples with high reliability from unqueried ones.
comment: 17 pages, 5 figures, 8 tables
♻ ☆ Regression is all you need for medical image translation
While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved impressive results in natural image synthesis, their core strengths - creativity and realism - can be detrimental in medical applications, where accuracy and fidelity are paramount. These models instead risk introducing hallucinations and replication of unwanted acquisition noise. Here, we propose YODA (You Only Denoise once - or Average), a 2.5D diffusion-based framework for medical image translation (MIT). Consistent with DM theory, we find that conventional diffusion sampling stochastically replicates noise. To mitigate this, we draw and average multiple samples, akin to physical signal averaging. As this effectively approximates the DM's expected value, we term this Expectation-Approximation (ExpA) sampling. We additionally propose regression sampling YODA, which retains the initial DM prediction and omits iterative refinement to produce noise-free images in a single step. Across five diverse multi-modal datasets - including multi-contrast brain MRI and pelvic MRI-CT - we demonstrate that regression sampling is not only substantially more efficient but also matches or exceeds image quality of full diffusion sampling even with ExpA. Our results reveal that iterative refinement solely enhances perceptual realism without benefiting information translation, which we confirm in relevant downstream tasks. YODA outperforms eight state-of-the-art DMs and GANs and challenges the presumed superiority of DMs and GANs over computationally cheap regression models for high-quality MIT. Furthermore, we show that YODA-translated images are interchangeable with, or even superior to, physical acquisitions for several medical applications.
♻ ☆ A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography
Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson's r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson's r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.
comment: Paper accepted to Human Brain Mapping. 25 pages, 3 figures, 8 tables
♻ ☆ GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction NeurIPS 2025
Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion
Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model's robustness to hyperparameter variations.
♻ ☆ Computer Navigated Spinal Surgery Using Magnetic Resonance Imaging and Augmented Reality
Current spinal pain management procedures, such as radiofrequency ablation (RFA) and epidural steroid injection (ESI), rely on fluoroscopy for needle placement which exposes patients and physicians to ionizing radiation. In this paper, we investigate a radiation-free surgical navigation system for spinal pain management procedures that combines magnetic resonance imaging (MRI) with fiducial ArUco marker-based augmented reality (AR). High-resolution MRI scans of a lumbar spinal phantom were obtained and assembled as a surface mesh. Laplacian smoothing algorithms were then applied to smoothen the surface and improve the model fidelity. A commercially available stereo camera (ZED2) was used to track single or dual fiducial ArUco markers on the patient to determine the patient's real-time pose. Custom AR software was applied to overlay the MRI image onto the patient, allowing the physician to see not only the outer surface of the patient but also the complete anatomy of the patient below the surface. Needle-insertion trials on a 3D-printed 3-vertebra phantom showed that dual-ArUco marker tracking increased the accuracy of needle insertions and reduced the average needle misplacement distance compared to single-ArUco marker procedures. The average needle misplacement is comparable to the average deviation of 2 mm for conventional epidural techniques using fluoroscopy. Our radiation-free system demonstrates promise to serve as an alternative to fluoroscopy by improving image-guided spinal navigation.
comment: Equal contribution: Songyuan Lu and Jingwen Hui contributed equally
♻ ☆ Fast MRI for All: Bridging Access Gaps by Training without Raw Data NeurIPS
Physics-driven deep learning (PD-DL) approaches have become popular for improved reconstruction of fast magnetic resonance imaging (MRI) scans. Though PD-DL offers higher acceleration rates than existing clinical fast MRI techniques, their use has been limited outside specialized MRI centers. A key challenge is generalization to rare pathologies or different populations, noted in multiple studies, with fine-tuning on target populations suggested for improvement. However, current approaches for PD-DL training require access to raw k-space measurements, which is typically only available at specialized MRI centers that have research agreements for such data access. This is especially an issue for rural and under-resourced areas, where commercial MRI scanners only provide access to a final reconstructed image. To tackle these challenges, we propose Compressibility-inspired Unsupervised Learning via Parallel Imaging Fidelity (CUPID) for high-quality PD-DL training using only routine clinical reconstructed images exported from an MRI scanner. CUPID evaluates output quality with a compressibility-based approach while ensuring that the output stays consistent with the clinical parallel imaging reconstruction through well-designed perturbations. Our results show CUPID achieves similar quality to established PD-DL training that requires k-space data while outperforming compressed sensing (CS) and diffusion-based generative methods. We further demonstrate its effectiveness in a zero-shot training setup for retrospectively and prospectively sub-sampled acquisitions, attesting to its minimal training burden. As an approach that radically deviates from existing strategies, CUPID presents an opportunity to provide broader access to fast MRI for remote and rural populations in an attempt to reduce the obstacles associated with this expensive imaging modality.
comment: Neural Information Processing Systems (NeurIPS), 2025 (Spotlight)
♻ ☆ Team Westwood Solution for MIDOG 2025 Challenge: An Ensemble-CNN-Based Approach For Mitosis Detection And Classification
This abstract presents our solution (Team Westwood) for mitosis detection and atypical mitosis classification in the MItosis DOmain Generalization (MIDOG) 2025 challenge. For mitosis detection, we trained an nnUNetV2 for initial mitosis candidate screening with high sensitivity, followed by a random forest classifier ensembling predictions of three convolutional neural networks (CNNs): EfficientNet-b3, EfficientNet-b5, and EfficientNetV2-s. For the atypical mitosis classification, we trained another random forest classifier ensembling the predictions of three CNNs: EfficientNet-b3, EfficientNet-b5, and InceptionV3. On the preliminary test set, our solution achieved an F1 score of 0.7450 for track 1 mitosis detection, and a balanced accuracy of 0.8722 for track 2 atypical mitosis classification. On the final test set, our solution achieved an F1 score of 0.6972 for track 1 mitosis detection, and a balanced accuracy of 0.8242 for track 2 atypical mitosis classification.
comment: 3 pages, 2 figures
♻ ☆ Intelligent Software System for Low-Cost, Brightfield Segmentation: Algorithmic Implementation for Cytometric Auto-Analysis
Bright-field microscopy, a cost-effective solution for live-cell culture, is often the only resource available, along with standard CPUs, for many low-budget labs. The inherent chal- lenges of bright-field images - their noisiness, low contrast, and dynamic morphology - coupled with a lack of GPU resources and complex software interfaces, hinder the desired research output. This article presents a novel microscopy image analysis frame- work designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular archi- tecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical applications - particularly in cell transplantation for personalised medicine and muscle regeneration therapies. The access to the application is available for reproducibility